# Session 3: Fitting polynomials

<div class="alert alert-success"> <p><b>Intended learning outcomes:</b></p>
By the end of this session, you should be able to:
<ul>
<li> Use Python to fit a set of data to a polynomial; </li>
<li> To evaluate the goodness of fit using the matrix of covariance and Chi^2</li>
<li> Fit the residuals to a Gaussian. </li>
</div>

## Why fit to a polynomial?

In physics we often meet an experimental relation between variables which is difficult to describe mathematically. This may either be because the theoretical equation which describes the observed behaviour is difficult to solve, or because the situation is complicated by several ill-defined factors and it is difficult to derive any theoretical equation which can describe it properly. However, we can still measure the dependence between the variables experimentally and we would like to have some means of predicting this dependence. In this kind of situation it is convenient to fit an equation to our experimental data. We can then use the fitted equation to interpolate, i.e. to calculate the expected value of a variable between our measured data points, and to extrapolate, i.e. to calculate the expected value beyond the range of our measured data points. This procedure is often called “parameterizing” the relationship.

In principle we could use any form of equation to fit a set of measured data, but if we have no theoretical basis for fitting a particular type of curve it is often simplest and easiest to fit a polynomial. The order of the polynomial and the coefficients of each term in the fitted equation are called the “parameters” of our fit.

In this session we will look at how to fit polynomials with numpy, using experiment E5 as an example - you will be doing this experiment yourself in the second half of term. The experiment involves calibrating a temperature sensor by measuring its output voltage over a range of temperature. A polynomial equation is then fitted to the experimental results and this calibration equation is later input into a programmable chip so that the sensor can operate as a digital thermometer.

## Using numpy to fit a polynomial to a dataset

The first thing we need to do is import the modules we'll need. Enter these in the code cell below

In [1]:
### STUDENT GENERATED CELL ###

# The following line makes all plot output generate as images within the notebook. 
%matplotlib notebook

# the lines below import the packages needed
import numpy as np
import matplotlib.pyplot as plt

Now we should load our data file, which is called "studentdataE5.txt". This data file contains two columns, the first is the temperature in Celsius, the second the measured voltage (V). 


<div class='alert alert-success'> 
In the code cell below:
<ul>
<li> Load the data file using np.loadtxt, and unpack it into two arrays called `temp` and `voltage`. </li>
<li> Plot it on a (labelled!) graph, using data points only (no line). </li>
</ul>

</div>

In [2]:
### STUDENT GENERATED CELL ###

temp, voltage = np.loadtxt("studentdataE5.txt", unpack=True) # loads data and assigns to "temp" and "voltage"

plt.figure() # plots new graph
plt.plot(temp,voltage,'r.', label="Data Points") # plots data as red dots
plt.legend()
plt.xlabel('Temperature ($^\circ$C)')
plt.ylabel('Voltage (V)')
plt.title('Temperature against Voltage for a temperature sensor')

<IPython.core.display.Javascript object>

Text(0.5,1,'Temperature against Voltage for a temperature sensor')

### Numpy's polyfit function(s)

Fortunately, we can get numpy to do all the hard work of fitting for us, by using the * **polyfit** * function. The documentation for this is here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.polyfit.html - have a quick look at this before proceeding.

The cell below shows an example usage of np.polyfit. To run it, change "temp" and "voltage" to whatever you are using as the relevant variable names.

In [3]:
degree = 2 # degree of polynomial we want to fit to
p = np.polyfit(temp,voltage,degree)
print ("The fitted polynomial coefficients are", p)

The fitted polynomial coefficients are [ 1.06532357e-04 -3.15934839e-02  2.21148238e+00]


***IMPORTANT NOTE:*** Numpy actually has two versions of polyfit. `numpy.polyfit`, and  `numpy.polynomial.polynomial.polyfit`. They are almost identical, and are used in the same way. But look at the one crucial difference:

In [4]:
p = np.polyfit(temp,voltage,degree)
print ("np.polyfit returns the coefficients as", p)
pp = np.polynomial.polynomial.polyfit(temp,voltage,degree)
print ("np.polynomial.polynomial.polyfit returns the coefficients as", pp)

np.polyfit returns the coefficients as [ 1.06532357e-04 -3.15934839e-02  2.21148238e+00]
np.polynomial.polynomial.polyfit returns the coefficients as [ 2.21148238e+00 -3.15934839e-02  1.06532357e-04]


`numpy.polyfit` (our `np.polyfit`) returns the coefficients with the highest power first, but `numpy.polynomial.polynomial.polyfit` returns the lowest power first. This is daft.
  
However, it is an important reminder of *why* we import our modules with named abbreviations - it makes it clear which version of which module function we're using!

### Plotting fitted polynomials

Now we have our polynomial coefficients, we probably want to plot this polynomial to see how good the fit is. We could construct an expresssion for this from the elements of p, but there's a much easier way to do this with the numpy function `poly1d()` (http://docs.scipy.org/doc/numpy/reference/generated/numpy.poly1d.html ), which will convert the array of polynomial coefficients $p$ into a function that we can call to generate the value of the polynomial for a given value of $x$. The following code cell does this by using np.poly1d to create a _function_ called "line":

In [5]:
line = np.poly1d(p)

<div class='alert alert-success'> 

Now you need to:
<ul>
<li> Generate an array of x-values to fit the data to </li>
<li> Use the "line" function we just created to generate a corresponding array of y-values.</li>
<li> Plot the original data (as points) and the fitted line (as a line) on a labelled graph.</li>
</ul>
Do this in the cell below.
</div>

In [6]:
### STUDENT GENERATED CELL ###

x = np.linspace(5,80,500) # creates an array of 500 equally spaced points between 5 and 80
y = line(x) # calculates y values using x array and "line" function

plt.figure() # plots new graph
plt.plot(x, y,'r-', label="Fitted Polynomial") # plots fitted line
plt.plot(temp, voltage,'b.', label="Data Points") # plots fitted data points
plt.legend()
plt.xlabel('Temperature ($^\circ$C)')
plt.ylabel('Voltage (V)')
plt.title('Temperature against Voltage for a temperature sensor')

<IPython.core.display.Javascript object>

Text(0.5,1,'Temperature against Voltage for a temperature sensor')

At first glance, this second-order polynomial looks okayish (or it should do if you've done it right!) - but with definite room for improvement.

But how good is the fit really?

## Goodness of fit (1) - calculating the errors on the coefficients and the matrix of covariance.

We'll recalculate, this time with an important addition to the polyfit call - we'll ask it to also calculate the matrix of covariance.

(A Python aside: In the cell below, there's also a line that limits the number of decimal places that are displayed when we print a numpy array. This is just for convenience - compare this with the arrays printed out at full precision above: which do you find easier to read? You can change the number of decimal points displayed to whatever you want. Note that this will affect _all_ arrays printed after this line is run, but won't affect the formatting of any other numbers, including array elements. See https://docs.scipy.org/doc/numpy/reference/generated/numpy.set_printoptions.html for full documentation of this function - we will be using it again in other sessions.)

In [7]:
# It's useful to limit the number of dp displayed for arrays - see above
np.set_printoptions(precision=4) # 4 dec.places

# recalculating the polynomial
degree = 2 # degree of polynomial we want to fit to
p, v = np.polyfit(temp,voltage,degree,cov=True)
print("The fitted polynomial coefficients are:\n", p)
print("The matrix of covariance is:\n", v)

The fitted polynomial coefficients are:
 [ 1.0653e-04 -3.1593e-02  2.2115e+00]
The matrix of covariance is:
 [[ 1.1925e-10 -1.0136e-08  1.5502e-07]
 [-1.0136e-08  9.0966e-07 -1.5220e-05]
 [ 1.5502e-07 -1.5220e-05  3.1272e-04]]


#### What is the matrix of covariance?

**The quick answer:** The matrix of covariance allows us to calculate the errors on our fitted parameters. For $n$ parameters, the matrix of covariance is an $n \times n$ matrix, whose diagonal elements are the *square* of the uncertainties of the fitted parameters. The off-diagonal elements give the level of correlation between the uncertainties in the parameters - we won't use them here.

**The long (and more complete answer)** is given in sections 7.2-7.4 of [Hughes and Hase](https://www.dawsonera.com/guard/protected/dawson.jsp?name=https://shib-idp.ucl.ac.uk/shibboleth&dest=http://www.dawsonera.com/depp/reader/protected/external/AbstractView/S9780191576560).




When the cell below is complete, it will output the order of each coefficient, the corresponding coefficient and its error, with appropriate text strings.

Look at how we do this:

1. This is most easily done using a loop over the elements of `p`. For example, the length of an array `p` is given by `len(p)` or `np.size(p)`. The structure  `for i in range(np.size(p)):` sets up a loop that will iterate the same number of times as there are elements in the array.
2. Remember that `np.polyfit` gives the coefficients largest-order first. So for a loop with increasing index i, the order of the coefficient `p(i)` will be given by `len(p)-i-1`.


<div class='alert alert-success'>
You will need to complete the final line of this cell to calculate the error of each coefficient.

You'll probably want to use `np.diag` to extract the diagonal elements of the matrix of covariance, in the form of a 1d array. You can find out more about this numpy function here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.diag.html <div>


In [8]:
### STUDENT COMPLETED CELL ###

# extract coefficients and errors from matrix of covariance
for i in range(np.size(p)): # creates for loop to print each order with corresponding coefficient and error
    print ("coefficient order x^", len(p)-i-1, " is ", p[i], " with error ", np.sqrt(np.diag(v)[i])), ### COMPLETE THIS LINE

coefficient order x^ 2  is  0.00010653235653235627  with error  1.0920131466843892e-05
coefficient order x^ 1  is  -0.03159348392348389  with error  0.0009537595561635596
coefficient order x^ 0  is  2.2114823768823766  with error  0.017683870869644822


#### An important caveat about the interpretation of these errors

When doing calculations like this it's important to understand how the mathematics of the calculations relates to the reality of the experiment and the data. In this case we can see that the off-diagonal elements of `v` are clearly non-zero, and hence there is a significant correlation between the polynomial coefficients.

The diagonal elements of the matrix of covariance can be used to find the uncertainty of a coefficient *IF THAT COEFFICIENT ALONE IS THE REQUIRED RESULT OF THE EXPERIMENT*; but when calculating any values based on the full set of coefficients (e.g. the value of the fit for a particular abscissa value) this can give a gross overestimate.  In advanced methods the full matrix is used, but at the undergraduate level some simplified approximation should be employed, such as taking just the covariance of the zero order coefficient.

You should bear this in mind later in the term when you are doing experiment E5 yourselves. For the moment, however, as we are just concerning ourselves with the polynomial fitting itself, we'll continue to take the errors of the coefficients from the matrix of covariance.

This issue will be explored in more detail next year in course PHAS0058 (Lab 3).

## Goodness of fit (2) - calculating the residuals and $\chi^2$

Remember that the residuals are defined as the vertical distance between each of the data points and the fitted line. If the fitted line passes exactly through one of the data points the residual for this point is zero. We can see intuitively that if we have a "good" fit the residual values will be small. However, we have to remember that our experimental data points are subject to random errors and so we should expect the values of the residuals to be randomly distributed about zero. If we find that all the residuals are exactly zero we should start to suspect that our line is "over-fitted". This means it fits our initial data exactly, but if we take any more measurements (subject of course to the same random errors) the line will not fit them and therefore cannot be used to predict their values in advance. So for a useful parameterization of our dataset we need a fit which is "good" but not "too good". The chi-squared test is a statistical tool which can help us find the sort of fit we need. 

*Hint: For a guide that will enable you to use a numerical value of $\chi^2$ to decide if your fit is "good", "too good", or "not good", look at the text box on page 107 of [Hughes and Hase](https://www.dawsonera.com/guard/protected/dawson.jsp?name=https://shib-idp.ucl.ac.uk/shibboleth&dest=http://www.dawsonera.com/depp/reader/protected/external/AbstractView/S9780191576560).*


The numpy polyfit function calculates the residuals for us. We can also obtain more data from polyfit by setting full=True (but note that this is mutually exclusive with cov=True, you can only have one or the other). Again from the np.polyfit documentation:

       "residuals, rank, singular_values, rcond : present only if full = True
            Residuals of the least-squares fit, the effective rank of the scaled Vandermonde coefficient matrix, 
            its singular values, and the specified value of rcond. For more details, see linalg.lstsq."
            
            
(http://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.lstsq.html#numpy.linalg.lstsq)

Let's look at what this gives us:
      

In [9]:
p, residuals, rank, singular_values, rcond = np.polyfit(temp,voltage,degree,full=True)

print("p is:", p)
print("residuals array is:", residuals)
print("rank is:", rank)
print("singular_values is:", singular_values)
print("rcond is:", rcond)

p is: [ 1.0653e-04 -3.1593e-02  2.2115e+00]
residuals array is: [0.0133]
rank is: 3
singular_values is: [1.6569 0.4974 0.0845]
rcond is: 5.773159728050814e-15


Note that here "residuals" gives us the *sum* of the square of the residuals, not the individual residuals themselves, which is normally what we're interested in. But it's easy to calculate them, so normally it's more useful to have `cov=True` than `full=True`.

To calculate the residuals, we just remember that the residuals are the vertical distance between the data point and the fitted line. 


<div class="alert alert-success">
Use the cell below to calculate and print out:
<ul>
<li> the residuals; </li> 
<li>  the squares of the residuals; and</li> 
<li>  the sum of the squares of the residuals.</li> 
</ul>
</div>



In [10]:
### STUDENT GENERATED CELL ###

residual = [] # creates empty array for residuals
r_squared = [] # creates empty array for residuals squared

for i in range(len(temp)): # for loop to calculate each residual and residual squared, then added to empty array above
    r = ((p[0]*temp[i]**2 + p[1]*temp[i] + p[2]) - voltage[i]) # calculation of residual
    rsqrd = r**2 # calculation of residual squared
    residual.append(r) # adds residual to empty array
    r_squared.append(rsqrd) # adds residual squared to empty array
    
sumrsqrd = np.sum(r_squared) # calculates sum of residuals squared
print("The residuals are:", residual)
print("\nThe squares of the residuals are:", r_squared)
print("\n\nThe sum of the squares of the residuals is:", sumrsqrd)

The residuals are: [-0.02318925518925452, -0.010585543345542792, -0.010064249084248578, -0.006625372405371532, -0.01526891330891278, 0.011005128205128889, 0.021196752136752872, 0.01630595848595906, 0.022332747252747698, 0.026277118437119107, 0.02613907203907251, 0.02191860805860868, 0.015615726495727067, 0.00423042735042789, -0.0012372893772887306, -0.011787423687423138, -0.012419975579975207, -0.03513494505494452, -0.025932332112331702, -0.024812136752136338, -0.030774358974358762, -0.02981899877899874, -0.016946056166056156, 0.007844468864468768, 0.025552576312576347, 0.056178266178266156]

The squares of the residuals are: [0.0005377415562323677, 0.00011205372792036528, 0.00010128910962979833, 4.389555950985856e-05, 0.0002331397136350938, 0.0001211128468113234, 0.00044930230114693747, 0.0002658842821458203, 0.0004987515998551099, 0.0006904869533583849, 0.0006832510870638223, 0.00048042537922690534, 0.00024385091398935235, 1.7896515567248337e-05, 1.5308850031515346e-06, 0.00013894335

Check that your result for the sum of the squares of the residuals is the same as the same as the "residuals" generated by full=True.

To take account of the random experimental errors affecting our data we can also divide the residuals by the error in the dependent variable (here the voltage is our "y" value), which for this experiment was estimated by the student as 0.006 V for all values (if the error is different for each measurement, we can just have a 1D-array for this instead of a single number). 



<div class="alert alert-success">
The student measured the error in the voltage to be 0.006 V for all the measured values. 
<br>
In the cell below, repeat your calculations above, but using the residuals divided by the y-error  rather than the residuals alone. Set a variable for the y-error, don't hardcode a value of 0.006 V.
</div>

In [11]:
### STUDENT GENERATED CELL ###

y_error = 0.006 # error in voltage
residual1 = [] # creates empty array for residuals
r_squared1 = [] # creates empty array for residuals squared

for i in range(len(temp)): # for loop to calculate each residual and residual squared, then added to empty array above
    new_r = ((p[0]*temp[i]**2 + p[1]*temp[i] + p[2]) - voltage[i])/y_error # calculation of residual with error
    new_rsqrd = (new_r)**2 # calculation of residual squared
    residual1.append(new_r) # adds residual to empty array
    r_squared1.append(new_rsqrd) # adds residual squared to empty array
    
sumrsqrd1 = np.sum(r_squared1) # calculates sum of residuals squared
print("The residuals are:", residual1)
print("\nThe squares of the residuals are:", r_squared1)
print("\n\nThe sum of the squares of the residuals is:", sumrsqrd1)

The residuals are: [-3.8648758648757533, -1.764257224257132, -1.677374847374763, -1.1042287342285888, -2.5448188848187967, 1.834188034188148, 3.532792022792145, 2.717659747659843, 3.7221245421246163, 4.379519739519851, 4.356512006512085, 3.6531013431014463, 2.602621082621178, 0.705071225071315, -0.20621489621478842, -1.9645706145705228, -2.069995929995868, -5.855824175824087, -4.322055352055283, -4.135356125356056, -5.129059829059794, -4.969833129833123, -2.824342694342693, 1.3074114774114614, 4.258762718762725, 9.36304436304436]

The squares of the residuals are: [14.937265450899101, 3.11260355334348, 2.813586378605509, 1.2193210974960713, 6.476103156530384, 3.364245744758983, 12.480619476303817, 7.385674504050563, 13.854211107086385, 19.180193148844022, 18.97919686288395, 13.345149422969591, 6.773636499704232, 0.497125432423565, 0.04252458342087596, 3.859537699634002, 4.284883150199458, 34.290676778165846, 18.68016246622972, 17.101170283519853, 26.307254730074884, 24.699241338386898,

Dividing this by the number of degrees of freedom will give us the reduced $\chi^2$. The number of degrees of freedom is defined as the total number of datapoints minus the number of coefficients or fitting parameters in the fitted equation.


<div class="alert alert-success">
In the cell below, calculate and output the number of degrees of freedom and the reduced $\chi^2$.
</div>

In [12]:
### STUDENT GENERATED CELL ###

v = len(temp) - len(p) # calculates degrees of freedom
r_X = sumrsqrd1/v # calculates reduced X^2 term
print("The number of degrees of freedom is:", v)
print("The reduced X^2 term is:", r_X)

The number of degrees of freedom is: 23
The reduced X^2 term is: 16.05105560569328


The reduced $\chi^2$ is useful as it gives us a single number with which we can compare the goodness of fit of different polynomials.

Now we have everything in place, let's try comparing different polynomials. 

<div class="alert alert-success">
Write code in the cell below that will calculate the best fit polynomials of order 1,2,3,4,5 and 6. For each of these:<ul>

<li>  Print out the coefficients, with their order, and error </li>
<li>  Calculate and output the number of degrees of freedom and the reduced $\chi^2$ </li>
    </ul>
</div>

Hints: 
* the most efficient way of doing this is with a loop structure
* The residuals are the vertical distance between the fitted line and the data point - so you'll need to recalculate the residuals for each fitted line
* The `line` function we generated was specific to those values of `p`. So each time the array of polynomial coefficients `p` changes, you'll also need to redefine this function.

In [13]:
### STUDENT GENERATED CELL ###

degree1 = 6 # degree of polynomial we want to fit to
y_error = 0.006 # error in voltage

for i in range(degree1): # for loop to change the degree each time
    j=i+1 # makes sure degrees1 array starts at 1 eg. order 1
    q = 0 # sets q to 0 each run
    residuals = []
    r_squared2 = [] # creates empty array for residuals squared
    p, v = np.polyfit(temp,voltage,j,cov=True) # assigns coefficient array to "p" and matrix of covariance to "v"
    print("For a degree", j) # prints the degree of the run
    for w in range(len(p)): # creates for loop to print each order with corresponding coefficient and error
        print ("The coefficient order x^", len(p)-w-1, " is ", p[w], " with error ", np.sqrt(np.diag(v)[w]))
        
    while q < ((degree1) - (j)): # makes sure each "p" array has 7 values by filling the array with 0's before actual values
        p = np.append([0],p) # makes sure there is a coefficient for each order of x even if its zero
        q = q + 1
        
    for k in range(len(temp)): # calculates residuals, residuals squared for all data points
        # line below calculates each residual
        new_r1 = ((p[0]*temp[k]**6 + p[1]*temp[k]**5 + p[2]*temp[k]**4 + p[3]*temp[k]**3 + p[4]*temp[k]**2 + p[5]*temp[k] + p[6]) - voltage[k])/y_error
        # line below calculates residual squared
        new_rsqrd1 = (new_r1)**2
        residuals.append(new_r1)# adds residual to empty array
        r_squared2.append(new_rsqrd1) # adds residual squared to empty array
    sumrsqrd2 = np.sum(r_squared2) # calculates sum of residuals squared
    v1 = len(temp) - (j + 1) # calculates degrees of freedom
    r_X1 = sumrsqrd2/v1 # calculates reduced X^2 term
    print("There are", v1,"degrees of freedom with reduced X^2 term of", r_X1, "\n")

For a degree 1
The coefficient order x^ 1  is  -0.022538233618233598  with error  0.0005038795798805088
The coefficient order x^ 0  is  2.0729903133903127  with error  0.024230793099587977
There are 24 degrees of freedom with reduced X^2 term of 85.09437242798354 

For a degree 2
The coefficient order x^ 2  is  0.00010653235653235627  with error  1.0920131466843892e-05
The coefficient order x^ 1  is  -0.03159348392348389  with error  0.0009537595561635596
The coefficient order x^ 0  is  2.2114823768823766  with error  0.017683870869644822
There are 23 degrees of freedom with reduced X^2 term of 16.05105560569328 

For a degree 3
The coefficient order x^ 3  is  2.2156696264087213e-06  with error  2.8107150536249555e-07
The coefficient order x^ 2  is  -0.0001759655208347549  with error  3.6259484789556216e-05
The coefficient order x^ 1  is  -0.021602364877119123  with error  0.0013560808352536893
The coefficient order x^ 0  is  2.1270343447414364  with error  0.013953947357395538
There a

<div class="alert alert-success"> Which order of fitted polynomial would you use to parameterize the relationship between voltage and temperature for this sensor? <br> 
Give the reasons for your choice in a text cell. <br> Then plot the fitted line for the polynomial you think best represents the data, together with the original data, on a labelled graph</div>

Best order for a fitted polynomial is order 4. This ie because for a good fit, the  reduced $\chi^2$ has to be approximately equal to 1. For orders of 3 or less, the $\chi^2$ is much larger than 1, but for degrees greater or equal to 4, they are about 1. So the best is 4 since it is the lowest order for which the $\chi^2$ is about 1. Increase the number of orders, doesn't increase the fit significantly for orders higher than 4 since they are all about equal to 1

### Student completed text cell ###
Explanation:
    
In order to be confident that we have a good fit, we would like the value of the reduced $\chi^2$ to be around 1. We can see that for the polynomials of degree 3 or less, $\chi^2$ is much larger than this, whereas the polynomials of degree 4, 5, and 6 are all around 1. Hence the best choice for this data is probably the polynomial of degree 4, which is the smallest degree of polynomial which gives us the desired $\chi^2$. Adding in the extra terms in the polynomial doesn't significantly improve the fit, as we can see from the size of the highest-order coefficients.

In [14]:
### STUDENT GENERATED CELL ###

degrees_fit = 4 # number of degrees used for fit
p1 = np.polyfit(temp,voltage,degrees_fit) # array of coefficients for degree 4
line2 = np.poly1d(p1) # uses coefficients to create an equation for line
x1 = np.linspace(5,80,500) # 500 equally spaced points between 5 and 80 in array
y1 = line2(x1) # y values to corresponding x values above
print("The polynomial fit is of order", degrees_fit)

plt.figure() # plots new graph
plt.plot(x1, y1,'r-', label="Fitted Polynomial") # plots fitted polynomial graph
plt.plot(temp, voltage,'b.', label="Data Points") # plots data points
plt.legend()
plt.xlabel('Temperature ($^\circ$C)')
plt.ylabel('Voltage (V)')
plt.title('Temperature against Voltage for a temperature sensor')

The polynomial fit is of order 4


<IPython.core.display.Javascript object>

Text(0.5,1,'Temperature against Voltage for a temperature sensor')

## Fitting the residuals to a Gaussian

It would be interesting to have a closer look at the residuals. In theory, they should follow a Gaussian (normal) distribution. Do they?

Fit the residuals to a Gaussian using scipy.stats (following the same process as we did in session 2), and plot them as a histogram together with the fitted Gaussian.

<div class="alert alert-success">Do you think that these residuals match the expected distribution?
What relation do you notice between the standard deviation of the residuals and the experimental error on the voltage reading estimated by the student who did this experiment? Explain in a text cell.</div>

In [17]:
### STUDENT GENERATED CELL ###

import scipy.stats as stats # imports new package needed for histogram plotting

x2 = np.linspace(-4, 4, 200) # new array of 200 equally spaced points between -4 and 4
residual2 = [] # empty array for residuals
for i in range(len(temp)): # calculates residuals, residuals squared for all data points
    # calculates residual for each data point to order 4 line
    r2 = ((p1[0]*temp[i]**4 + p1[1]*temp[i]**3 + p1[2]*temp[i]**2 + p1[3]*temp[i] + p1[4]) - voltage[i])/y_error
    residual2.append(r2) # adds residuals to empty array

x0, sigma = stats.norm.fit(residual2) # calculates the mean and standard deviation of the residuals
print("The Mean is", x0, "\nThe Standard Deviation is", sigma)
gaussian_fn = stats.norm.pdf(x2, x0, sigma) # creates an array of y values for Gaussian fitted line
    
plt.figure() # plots new graph
plt.hist(residual2, bins=15, density=True, edgecolor='k') # histogram with residuals data
plt.plot(x2,gaussian_fn,'r-.', label="Gaussian.pdf") # plots fitted Gaussian line
plt.legend()
plt.xlabel("Value")
plt.ylabel("Number of Occurences")

The Mean is -7.973963369942375e-14 
The Standard Deviation is 0.9400983808065035


<IPython.core.display.Javascript object>

Text(0,0.5,'Number of Occurences')

From the graph, we can see that the residuals, in general, do follow a Gaussian distribution. The mean is approximately equal to 0.

### Student written text cell ###
We can see that the residuals, broadly speaking, follow a Gaussian distribution, with an average value of $x_0 \sim 0$ and a standard deviation that is similar to the student's estimated experimental error. This is another indicator that the polynomial provides a good fit to the data.

### What's coming next

In this session we've seen how to fit data in a general case when we don't already know from a theoretical model what function we want to fit to. In the next session, we'll be looking at how to to use Python fit a line when we know what function we want to fit the data to.