# Lab 1 - Overfitting & Bias-Variance Tradeoff

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model, model_selection
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score

Run the code block below to generate a toy dataset that we will use to better understand the importance of training data size and what happens when we overfit our data.

In [None]:
# Generate a noisy dataset
x = np.arange(0,10,0.1)
y = 3*x + 1 + 5*np.random.randn(x.shape[0])

## Does the Size of the Training Data Set Matter?

**Coding Work:**     
(1) Use the `test_train_split` function to randomly select 2% of the data to use for training the model.        
(2) Fit a linear model to the training data. On one plot, plot the training data set and the model predictions on the training set. On a second plot, plot the test set and the model predictions on the test set.     
(3) Print out the mean square error of the predictions on the training set and on the test set.   

**Discuss the following questions with your lab group:**   
How does the mean square error compare between the training and test sets?    
How does this relationship change if instead of using 2% of the data for training, you use 5% of the data? What about if you use 20% of the data? How about 70% of the data?
What does this tell you about the importance of training data set size and representativeness?

## Model Complexity & Overfitting    

**Code Work:**        
(1) Use the `train_test_split` funtion to randomly select 10% of the data for model training. Make sure to set the random state to some integer for reproducibility.     
(2) Use the code shell below to fit a second order polynomial model to the training data and plot the results as in Part 1.      
(3) Print out the mean square error of the predictions on the training set and on the test set.   

**Discuss the following questions with your lab group:**   
Try fitting polynomials of degree 2, 4, 6, 8, and 10 to training data.        
How does relationship between the mean square error on training data set and the mean square error on the test data set change as you increase the polynomial degree?       
Do you think that any of these polynomial models are overfit? Which ones and why?  
How would your answer change if you trained the model on 70% of the data instead of 10%?

In [None]:
# Enter your code for the test-train split here


poly_degree = 2   # update the degree of the polynomial to fit

# Generate the polynomial features for the train and tes set
poly = PolynomialFeatures(degree=poly_degree, include_bias=False)
poly_train = poly.fit_transform(x_train)
poly_test = poly.fit_transform(x_test)

# Train the linear regression model with the polynomial features
poly_model = linear_model.LinearRegression()
poly_model.fit(poly_train, y_train)
y_pred_poly = poly_model.predict(poly_test)
y_pred_poly_train = poly_model.predict(poly_train)

sorted = np.argsort(x_test, axis=0)

text_kwargs = dict(ha='left', va='center', fontsize=12)

plt.close("all")
fig, (ax1, ax2) = plt.subplots(1,2, layout="constrained")
ax1.scatter(x_train, y_train, s=10)
ax1.plot(x_test[sorted[:,0]], y_pred_poly[sorted[:,0]], color="orange")
ax1.text(0, -14, "MSE: %.2f" % mean_squared_error(y_train, y_pred_poly_train), **text_kwargs)
ax2.scatter(x_test, y_test, s=10)
ax2.plot(x_test[sorted[:,0]], y_pred_poly[sorted[:,0]], color="orange")
ax2.text(0, -14, "MSE: %.2f" % mean_squared_error(y_test, y_pred_poly), **text_kwargs)
ax1.set_xlabel("X")
ax1.set_ylabel("Y")
ax1.set_title("Training Data")
ax2.set_xlabel("X")
ax2.set_ylabel("Y")
ax2.set_title("Test Data")
fig.set_size_inches(10,4, forward=True)
plt.show()

## Bias-Variance Tradeoff

Bias is error that comes from eroneous assumptions in our machine learning algorithm - for example, trying to fit data generated from a cubic polynomial with a simple linear model.     
Variance is error from sensitivity to small fluctuations in the training set. High variance could be the result of a training data set that is too small, or from overfitting where the algorithm is modeling random noise in the training data set.        
In general, there is a trade-off between bias and variance. More complex model will have lower bias, since they can capture more features of the relationship between inputs and outputs. However, they can also have higher variance due to overfitting. An optimal model tries to jointly minimize bias and variance. In the code below, we will explore this bias-variance tradeoff.   

Run the first code block to generate a new toy dataset.

In [None]:
# Generate a noisy dataset
x = np.arange(-10,10,0.2)
y = 0.01*(x**3 + 1 + 280*np.random.randn(x.shape[0]))

The code below comes bias and variance between a simple linear model and a polynomial model of order `poly_degree`. The script runs 20 iterations in which it randomly selects a training dataset of size `1-sample_size` and fits the two models. The "Bias" plot shows a histogram of the mean square error of the models on the 20 training sets. A low value means that the model is effectively capturing most of the variation present in the training data set. The "Variance" plot shows the 20 linear and 20 polynomial model predictions over the training set. A high spread vertical spread across the 20 models indicates high variance - the model coefficients change a lot depending on the data used to train it.     

**Run the script with `sample_size = 0.8` and for `poly_degree` equal to 3,5,7, 9, and 11. Then discuss the following questions with your lab group:**     
How does the bias of the higher order polynomial model change as you increase the polynomial degree?         
How does the variance of the polynomial models change as you increase the polynomial degree?        
How does the higher order polynomial bias and variance compare to the linear model?            
What seems like a reasonable polynomial degree for modeling this data set? Why?

In [None]:
# Bias-Variance Tradeoff

poly_degree = 11      # polynomial model degree
sample_size = 0.8     # size of the test set (training set = 1 - sample_size)

Niter = 20            # Number of iterations to run
# Variables to save the results from each iterations model fitting
length = np.rint((1-sample_size)*x.shape[0]).astype(int)
x_lin = np.empty((length,Niter))
y_pred_lin = np.empty((length,Niter))
bias_lin = np.empty(Niter)
x_poly = np.empty((length,Niter))
y_pred_poly = np.empty((length,Niter))
bias_poly = np.empty(Niter)

# Fit linear and polynomial models on 20 randomly selected training data sets
for k in range(0,Niter):
  x_train, x_test, y_train, y_test = model_selection.train_test_split(x.reshape(-1,1), y.reshape(-1,1), test_size=sample_size)
  # Linear Regression
  regr = linear_model.LinearRegression()
  regr.fit(x_train, y_train)
  x_lin[:,k] = x_train[:,0]
  tmp = regr.predict(x_train)
  y_pred_lin[:,k] = tmp[:,0]
  bias_lin[k] = mean_squared_error(y_train, tmp)
  # Polynomial Regression
  poly = PolynomialFeatures(degree=poly_degree, include_bias=False)
  poly_train = poly.fit_transform(x_train)
  poly_model = linear_model.LinearRegression()
  poly_model.fit(poly_train, y_train)
  sorted = np.argsort(x_train, axis=0)
  tmp = poly_model.predict(poly_train)
  y_pred_poly[:,k] = tmp[sorted[:,0]][:,0]
  x_poly[:,k] = x_train[sorted[:,0]][:,0]
  bias_poly[k] = mean_squared_error(y_train, tmp)

# Plot the results
plt.close("all")
fig, (ax1, ax2) = plt.subplots(1,2, layout='constrained')

ax1.hist(bias_lin, facecolor='blue',alpha=0.5,edgecolor='black', density=True, label="Linear Model")
ax1.hist(bias_poly, facecolor='orange',alpha=0.5,edgecolor='black', density=True, label="Higher Order Polynomial")
ax1.set_xlabel("Mean Square Error")
ax1.set_ylabel("Probability Density")
ax1.set_title("Bias")
ax1.legend()

ax2.scatter(x,y,s=5, color='black')
for k in range(0,Niter):
  ax2.plot(x_lin[:,k], y_pred_lin[:,k], color='blue', alpha=0.5)
for k in range(0,Niter):
  ax2.plot(x_poly[:,k], y_pred_poly[:,k], color='orange', alpha=0.5)
ax2.set_xlabel("X")
ax2.set_ylabel("Y")
ax2.set_title("Variance")

fig.set_size_inches(10,4, forward=True)
plt.show()