## Numerical Methods
### Activity 1: Simple Linear Regression

First load the data into a pandas dataframe and display it. Our data is from a scientific paper from 1905 which compared head size to brain weight (people were interested in that kind of thing back then).

In [1]:
%matplotlib notebook
import pandas as pd
df = pd.read_csv('brainhead.csv')
display(df)
df.describe()

Unnamed: 0,"Gender (1 = M, 2 = F)","Age (1 = 20-46, 2 = 46+)",Head size (cm^3),Brain weight (g)
0,1,1,4512,1530
1,1,1,3738,1297
2,1,1,4261,1335
3,1,1,3777,1282
4,1,1,4177,1590
...,...,...,...,...
232,2,2,3214,1110
233,2,2,3394,1215
234,2,2,3233,1104
235,2,2,3352,1170


Unnamed: 0,"Gender (1 = M, 2 = F)","Age (1 = 20-46, 2 = 46+)",Head size (cm^3),Brain weight (g)
count,237.0,237.0,237.0,237.0
mean,1.434599,1.535865,3633.991561,1282.873418
std,0.496753,0.499768,365.261422,120.340446
min,1.0,1.0,2720.0,955.0
25%,1.0,1.0,3389.0,1207.0
50%,1.0,2.0,3614.0,1280.0
75%,2.0,2.0,3876.0,1350.0
max,2.0,2.0,4747.0,1635.0


Pandas lets us generate histograms from dataframes. You can choose the number of bins (the default is 10), or explicitly define the bin boundaries.

In [None]:
df.hist(column = 'Head size (cm^3)', grid = False, edgecolor='black', bins = 10)
df.hist(column = 'Brain weight (g)', grid = False, edgecolor='black', bins = [900, 1100, 1300, 1500, 1700])

With can set unequal bin widths if we like, and then draw a frequency density histogram. This takes slightly more work, as pandas will not directly draw this kind of histogram. Instead, we can get it to draw a 'normalized' frequency density histogram (i.e. total area sums to one). To obtain a frequency density histogram from this we must then relabel the y values.

In [None]:
# This part creates a normed histogram figure.
# We need to do a bit of backend work wth matplotlib, so we start by creating plot and ax objects seperately.

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots()

# This draws the normalized histogram onto ax.
df.hist(column = 'Brain weight (g)', 
        grid = False, 
        density = True, 
        edgecolor='black', 
        bins = [900, 1100, 1200, 1300, 1400, 1700],
        ax = ax)

# This part relabels the y axis. We just have to multiply the default y tick values by 237 (the number of data points).
labels = np.round(np.array(ax.get_yticks())*237,3)
plt.yticks(ax.get_yticks(), labels)

plt.show()


We can also draw histograms 'by group'. In the example below, we plot the histograms for brain weight, dividing by gender.

In [None]:
df.hist(by = 'Gender (1 = M, 2 = F)', edgecolor='black', column = 'Brain weight (g)', bins = 8)

Now we'll extract the 'head size' and 'brain weight' data to get our x and y values for regression.

In [None]:
# We have to extract the data from the columns we're interested in.
# The regression method we'll be calling uses data in a weird format. 
# This is why x is formatted as a column vector and y as a row vector.
# The reason for this format is that later in place of single x values we will be dealing with arrays of x values.

x = np.array(df[['Head size (cm^3)']].values)
y = np.array(df[['Brain weight (g)']].values.flatten())

print(x)
print(y)                 



Now we can calculate the parameters of the regression line $a + bx$ for $x$ and $y$. We will do this using the scikit-learn library for python. This is one of the main tools in modern data analysis.

In [None]:
from sklearn.linear_model import LinearRegression

# First we create a linear regression object.
model = LinearRegression()

# Then we fit this object to our data.
# Scikit-learn has a lot of tools for doing things like regression, and by design they all work this way.
model.fit(x,y)

# We can now extract the parameters of interest from the fitted model.
a = model.intercept_
b = model.coef_

print('intercept:', a)
print('slope:', b)

We can also use the fitted model to predict values of $x$ given values of $y$.

In [None]:
# We will predict the brain weight when the head sizes are 3500, 3700 and 4000
# Note the input here is [[3500], [3700], [4000]]. 

predictions = model.predict([[3500], [3700], [4000]])
for p in predictions:
    print(p)

We can also calculate $r^2$ and $s_{yx}$, to measure the goodness of fit.

In [None]:
import math

# r^2 is easy as sklearn has a built-in method to find it
r2 = round(model.score(x, y),3)
print('coefficient of determination (to 3 decimal places):', r2)

# There's no built-in method for s_yx as far as I'm aware, but we can easily write our own.
# This function takes the x and y values as array-like objects, the parameters of the best fit line, and the decimal places. 
def findSE(x_values, y_values, a, b, dp=3):
    Sr = 0
    for i in range(0, len(x_values)):
        y = y_values[i]
        x = x_values[i]
        Sr += (y - a - b*x)**2
    return round(math.sqrt(Sr/(len(y_values) - 2)),dp)

# Here's another function that does the sdame thing using numpy's array arithmetic functionality internally. 
def findSE2(x_values, y_values, a, b, dp=3):
    x = np.array(x_values)
    y = np.array(y_values)
    Sr = sum((y - a - b*x)**2)
    return round(math.sqrt(Sr/(len(y_values) - 2)),dp)


print('standard error of estimate (to 3 decimal places): {}'.format(findSE(x.flatten(),y,a,b)))
print('alternative standard error of estimate calculation: {}'.format(findSE2(x.flatten(),y,a,b)))

We can plot the regression line.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots() 
ax.scatter(x, y, c = 'orange')
x_line = np.linspace(2700,4800,100)

# To draw the regression line we will reformat the x_line so that it is appropriate for the 'predict' function
y_line = model.predict(x_line[:,np.newaxis])
plt.plot(x_line, y_line)
ax.set_ylabel('Brain weight')
ax.set_xlabel('Head size')
plt.show()

Our data is divided into male/female and into two categories based on age. Sometimes when we perform a regression analysis we can get a trend which does not exist or is even reversed if we break the data down into subcategories. You can google 'Simpson's paradox' to get more information about this. Here we perform our regression analysis seperately for men and women.


In [None]:

temp_x = np.array(df[['Gender (1 = M, 2 = F)','Head size (cm^3)']].values) 
M_x = []
F_x = []
for i in temp_x:
    if i[0] == 1:
        M_x.append([i[1]])
    else:
        F_x.append([i[1]])
        
temp_y = np.array(df[['Gender (1 = M, 2 = F)','Brain weight (g)']].values) 
M_y = []
F_y = []
for i in temp_y:
    if i[0] == 1:
        M_y.append(i[1])
    else:
        F_y.append(i[1])
#print(M_x)
#print(F_x)
#print(M_y)
#print(F_y)
model_M = LinearRegression()
model_M.fit(M_x,M_y)

a_M = model_M.intercept_
b_M = model_M.coef_[0]

print('intercept for men:', a_M)
print('slope for men:', b_M)

r2_M = round(model_M.score(M_x, M_y),3)
print('coefficient of determination for men:', r2_M)
print('')

model_F = LinearRegression()
model_F.fit(F_x,F_y)

a_F = model_F.intercept_
b_F = model_F.coef_[0]

print('intercept for women:', a_F)
print('slope for women:', b_F)

r2_F = round(model_F.score(F_x, F_y),3)
print('coefficient of determination for women:', r2_F)

Notice how the $r^2$ values for men and women are both smaller than the $r^2$ value calculated for the combined data. 

Now we draw graphs for men and women seperately.

In [None]:
fig_M, ax_M = plt.subplots() 
ax_M.scatter(M_x, M_y, c = 'orange')
x_line_m = np.linspace(2700,4800,100)
y_line_m = model.predict(x_line_m[:,np.newaxis])
plt.plot(x_line_m, y_line_m)
ax_M.set_ylabel('Brain weight')
ax_M.set_xlabel('Head size')
ax_M.set_title('Men only')
plt.show()


fig_F, ax_F = plt.subplots() 
ax_F.scatter(F_x, F_y, c = 'orange')
x_line_w = np.linspace(2700,4800,100)
y_line_w = model.predict(x_line_w[:,np.newaxis])
plt.plot(x_line_w, y_line_w)
ax_F.set_ylabel('Brain weight')
ax_F.set_xlabel('Head size')
ax_F.set_title('Women only')
plt.show()

### Simple linear regression from scratch.
Now you can code your own function for performing simple linear regression. In other words, you will not rely on scikit-learn.

You can try to write a method that calculates the coefficients $a$,$b$ for the regression line $a + bx$. If you do this correctly, this cell should produce as output the plot for the regression line for the full dataset (as in the output of cell 5). You need to fill in the details here. This method should take an array of $x$ values and an array of $y$ values, and return the coefficients $a$,$b$ of the regression line. You only need to guarantee correct behaviour for correctly formatted data.

In [None]:
def LR(x_values, y_values):
    #TODO
    return 

x1 = np.array(df[['Head size (cm^3)']].values.flatten())
y1 = np.array(df[['Brain weight (g)']].values.flatten())

a1, b1 = LR(x1,y1)

fig1, ax1 = plt.subplots() 
ax1.scatter(x1, y1, c = 'orange')
x_line1 = np.linspace(2700,4800,100)
y_line1 = a1 + b1*x_line1
plt.plot(x_line1, y_line1)
ax1.set_ylabel('Brain weight')
ax1.set_xlabel('Head size')
plt.show()

### Bonus: Histograms with matplotlib
We can also use matplotlib alone to create histograms. This is very similar to the process using pandas, but without needing to create a dataframe first. There are also several other visualization libraries for python, e.g. ggplot, seaborn etc., and some of these are built on matplotlib, but we won't use those here.

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots() 

data = [11,5,12,3,10,7,7,9,15,19,6,7,13,11,8,10,16,14,5,7]
bins = [0,5,10,15,20]
plt.hist(data, bins = bins, edgecolor='black',)
plt.show()