# LIBRARIES
* Execute this cell before going any further. 

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re

<br/><br/>

# Warmup

## DataFrames and Tabular Data
You’ve likely worked with tabular data before, especially in Excel. In Python, we use the `pandas` library (imported as `pd`) to handle this type of data efficiently. The core component of pandas is the `pd.DataFrame` object, which functions much like a spreadsheet. Proficiency in `pandas` is an evergreen skill—it can make data processing tens or even hundreds of times faster compared to manual work in Excel.

In this warmup, we'll explore the basics of the `pd.DataFrame` using a simple dataset. This will prepare you for the extensive use we'll make of this data structure throughout the assignment.

### CODE
* Create three lists of data (e.g., numbers, names, or categories).
* Store these lists in a dictionary, using meaningful labels as keys.
* Use the pd.DataFrame constructor to create a DataFrame from the dictionary.
* Display the DataFrame by making it the last line of a code cell. What does it look like?
* Multiply a column by 5, store the result in a new column on the DataFrame, and display the updated DataFrame.

In [1]:
students      = ['Bob','Alice','Giancarlo']
test_scores   = [95,85,75]
gpa           = [3.0,2.0,5.0]

In [None]:
my_data = {
    # Just like we define dictionaries with angle brackets [],
    # we define dictionaries with braces {}.
    # we define key-value pairs with the key and value paired by a colon
    # and successive key-value pairs separated by commas, like follows:
    # 'my_key' : some_list,
    # 'another_key' : another_list,
    # (remember these are just placeholder names!)
    #define your own dictionary, using the three lists above.
}

In [None]:
my_dataframe = pd.DataFrame(# call pd.DataFrame() as a function.
#use the dict my_data as an argument. 
#end this block with the variable my_dataframe alone on a line.

In [3]:
#we can access columns using angle brackets [],
#just like we can access data in a dictionary by key.
#we can make new columns by using the angle brackets []
#specifying a column that doesn't exist, like as follows:
#my_dataframe['some_column'] = my_dataframe['old_column'] + 7 
#(this would add 7 to every element in the column with title 'old_column')

<br/><br/>
<br/><br/>

# The Photoelectric Effect

## PART 1 - Defining our Routine
If we would embark on a data processing journey, it is a reasonable first step to get a map for ourselves. It is very hard, after all, to write code if you do not know what you are doing. 

### SHORT RESPONSE QUESTIONS
1. What is the terminal information we are trying to find?
2. What do we need to know to calculate this data?
3. What data do we have from our experiments?
4. How can we derive our intermediate data from our experimental data, and our terminal data from our intermediate data? (Include equations, if necessary)
### ANSWER

<br/><br/>
<br/><br/>

## PART 2 - Loading and Plotting Data
We want to process our data using `pandas`. The first thing we must do is load it into memory as a `pd.DataFrame` object. We will not use the constructor as before. Rather, we will use `pd.read_csv()`. Once we have our data, it is rather difficult to tell what is going on by staring at columns of numbers; in fact, Jupyter will not usually display more than a few dozen values before it truncates the `DataFrame`. We can use `plt.plot()` to visualize the relationships between columns in our `DataFrame.`

### GIVEN FUNCTIONS

In [4]:
def plot_data(dataframe,x_column,y_column,title,trendline_coeffs=None,output_file=None):
    x = dataframe[x_column]
    y = dataframe[y_column]

    plt.scatter(x,y, label = 'Data Points', color='blue')

    plt.xlabel(x_column)
    plt.ylabel(y_column)
    plt.title(title)

    if trendline_coeffs is not None:
        poly_coeff = trendline_coeffs
        poly_degree = len(trendline_coeffs) -1
        poly = np.poly1d(poly_coeff)
        x_fit = np.linspace(x.min(),x.max(), 500)
        y_fit = poly(x_fit)
    
        plt.plot(x_fit, y_fit, label=f'Polynomial Fit (Degree {poly_degree})',color='red')
        
        equation_text = "$y = " + " + ".join([f"{coef:.3e}x^{poly_degree - i}" if i != poly_degree else f"{round(coef, 3)}"
                                              for i, coef in enumerate(poly_coeff)]) + "$"
        plt.text(0.2,0.8, equation_text, transform=plt.gca().transAxes, fontsize=8,
                 verticalalignment='bottom', horizontalalignment='left')
        plt.legend()

    if output_file:
        plt.savefig(output_file,format='png',dpi=300, bbox_inches='tight')
        
    plt.show()

### CODE
* use `pd.read_csv(filename)` to open your data.
* Show the contents.
* Use `plt.plot(x,y)` to display the spectrum contained in the data.
* Apply some basic formatting to your plot using the provided functions.

In [None]:
data = pd.read_csv()
data

In [None]:
plt.plot(#use data['some_key'] for the x and y arguments.)
plt.title()
plt.ylabel()
plt.xlabel()

### SHORT RESPONSE QUESTIONS
1. What are the names of each of the columns in `data`? How many rows are there?
2. Describe the qualitative trends in our spectrum based on the plot you made.
### ANSWER

<br/><br/>
<br/><br/>

## PART 3 - Deriving Values
There are two major ways we can process our data now. 
1. We can do operations on columns to obtain new columns (such as converting wavelengths to frequencies using a formula).
2. We can do operations on columns to obtain one or more scalar values (such as the linear coefficients or x-intercept of two columns $x$ and $y$).
3. We can filter our DataFrame using some condition to include only values in a valid range (such as getting rid of all zero values).

Our next step would be to calculate the following:
1. The energy of the wavelength of light.
2. The cutoff voltage.
3. The work function.

How can we accomplish this using these three basic operations?

### GIVEN FUNCTIONS

In [15]:
def truncate_data(dataframe,xcol,ycol,y_threshold):
    '''
    returns a dataframe including only the first value
    lower than or equal to y_threshold
    '''
    for index, y_value in enumerate(dataframe[ycol]):
        if y_value <= y_threshold:
            #return dataframe only up to and including this value
            return dataframe.iloc[0:index + 1]
    #nothing was below threshold
    return dataframe

In [14]:
def find_root_and_yint(data,x_column,y_column,poly_degree=2,interval=[0,np.inf]):
    x = data[x_column]
    y = data[y_column]

    poly_coeff = np.polyfit(x,y,poly_degree)

    try:
        root = min([float(root) for root in np.roots(poly_coeff) if interval[0] < root and root < interval[1]])
    except:
        root = np.nan
    yint = float(poly_coeff[-1])

    return root,yint

### CODE
* Perform a polynomial fit on your data using np.polyfit(x_data,y_data,degree).   
* Using the provided function, plot this fit.  
* Also try truncating the zero values out of your data and applying a linear fit. Which method do you expect to give better results?

### TEXT ANSWER

<br/><br/>
<br/><br/>
  

## PART 4 - Automating our Routine
We've separately put together functions for all the different intermediate variables we need for one set of data for a given wavelength. We will, however, need to do this for every spectrum for every wavelength. We've said before that Python is excellent for automating repetitive tasks. If you can do something once, you can do it ten thousand times. We will build two functions - one which uses everything we just did to run our whole intermediate variable routine for one spectrum, and another which uses this for all wavelengths.

### GIVEN FUNCTIONS

In [16]:
def process_data(dataframe,xcol,ycol,lambda_nm,poly_degree):
    data = truncate_data(dataframe,xcol,ycol,0.01)
    data_dict = {}

    poly_coeff = np.polyfit(data[xcol],data[ycol],poly_degree)   
    root_yint = find_root_and_yint(data,xcol,ycol, poly_degree)
    V0 = root_yint[0]
    phi = calculate_phi(V0,200)
    nu_lambda = calculate_cutoff_nu_lambda(phi)
    cutoff_nu_Hz = nu_lambda[0]
    cutoff_lambda_nm = nu_lambda[1]

    data_dict['Wavelength(nm)'] = lambda_nm
    data_dict['Frequency(Hz)'] = float(c) / (float(lambda_nm) * 1e-9)
    data_dict['Stopping Potential(V)'] = V0
    data_dict['Work Function(J)'] = phi
    data_dict['Cut-off Frequency(Hz)'] = cutoff_nu_Hz
    data_dict['Cut-off Wavelength(nm)'] = cutoff_lambda_nm

    return data_dict

In [5]:
def read_files():
    filenames = os.listdir('./') #'./' is the current directory
    csv_files = [file for file in filenames if file.endswith('.csv')]
    data = {}
    for filename in csv_files:
        wavelength_nm = re.search('\d+',filename).group(0)
        data[wavelength_nm] = pd.read_csv(filename)
    return data

In [18]:
def create_table(table_data,poly_degree):
    table = pd.DataFrame({'Wavelength(nm)':[],
                   'Frequency(Hz)':[],
                   'Work Function(J)':[],
                   'Stopping Potential(V)':[],
                   'Cut-off Frequency(Hz)':[],
                   'Cut-off Wavelength(nm)':[]})
    for wavelength in table_data:
        derived_values = process_data(table_data[wavelength],xcol,ycol,wavelength,poly_degree)
        derived_values = {key : [value] for key,value in derived_values.items()}
        row = pd.DataFrame(derived_values)
        numeric_values = pd.to_numeric(row.values.flatten(), errors='coerce')
        if not np.isnan(numeric_values).any():
            table = pd.concat([table,row])

    return table

### CODE
* Complete the `process_data()` function.

## TEXT ANSWER

<br/><br/>
<br/><br/>  


## Part 5 - Calculating our Terminal Data
Using the data collected from all of our files, perform a linear regression to obtain Planck's constant. Is Planck's constant the slope or the intercept?  
Also, compare the value you obtain from Planck's constant to the literature value and calculate a percent error.

## CODE

## TEXT ANSWER

<br/><br/>
<br/><br/>

# REFLECTION

This same experiment was taught in prior years, using Excel instead of Python. Which method would you prefer to use, and why?  
For which situations would it be better to use a tool like Excel, and for which situations would it be better to use Python?

## TEXT ANSWER

<br/><br/>
<br/><br/>