# **DSFM Project**: Bias-Variance Tradeoff

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

This project explores how __Bias__ and __Variance__ change as a linear regression is expanded with additional polynomial terms. 

First we define a __Data Generating Process__ (DGP) as a sigmoid function; then draw repeated samples from the DGP to allow for random variation; and finally we show the degree of __Bias__ and __Variance__ of the model at a given value of X. At the end of the project we can copare how the model performs at different levels of the polynomial expansion.  

-------------

## **Part 0**: Setup

### Import Packages

In [None]:
import matplotlib 
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np
from IPython.display import clear_output
import math 
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

%matplotlib inline

### Define Constants

In [None]:
X0         = 4       # value of the arbitrary point where bias will be measured
NOISE      = 0.2     # offset for draw from uniform distribution 
FONTSIZE   = 18
OFFSET     = 0.1    # plot offset on x and y axes
DRAWS      = 100    # number of draws

### Helper Functions

We will help you get started by defining some helper functions" for you to use, that just work. There is nothing for you to do in this section, and you don't need to worry about the internal aspects of each function, other than to just understand at a high level what each function returns or accomplishes.

__Data Generating Function__:  
We define the true data generating function (DGP) for this example to be a logistic sigmoid function. The sigmoid function *without* any noise is the true function that we aim to recover. A sigmoid function is defined as:  $sigmoid(x) = \frac{1}{1+e^{-x}}$

In [None]:
def sigmoid(x):
    y = 1 / (1 + math.exp(-x))
    return y

__Custom Plotting Function__:  
The following function will plot the data and the given model predictions. This custom functioin simply "wraps" a basic matplotlib plotting function, but it does so in a standardized way that fits the needs of this project.

In [None]:
def myPlot(X, y, y_DGP, y_pred, draw, symbol = 'o'):
    
    font = {'size'   : FONTSIZE}
    matplotlib.rc('font', **font)
    
    plt.figure(figsize=(16, 12))
    plt.plot(X, y, symbol, markersize=12, linewidth=3, label='Sampled data')
    plt.plot(X, y_DGP, markersize=12, linewidth=3, label='Ground truth')
    plt.ylim(0 - OFFSET, 1 + OFFSET)
    plt.xlim(min(X) - OFFSET, max(X) + OFFSET)
    plt.hlines(0, xmin = min(X), xmax = max(X), colors='black', linewidth=3)
    plt.hlines(1, xmin = min(X), xmax = max(X), colors='black', linewidth=3)
    
    if type(y_pred) == float:
        plt.hlines(y_pred, xmin = min(X), xmax = max(X), colors='red', label='Latest estimate', linewidth=3)
    if type(y_pred) == np.ndarray:
        plt.plot(X, y_pred, '-', markersize = 12, linewidth=3, color = 'red', label='Latest estimate')
        
    plt.ylabel('y')
    plt.xlabel('X')
    plt.title('Draw: {}'.format(draw))
    plt.legend()
    
    return plt

__Test above two functions to see them in action__...

In [None]:
X = list(range(-50, 50))
X = [i/10 for i in X if i%2 == 0]
y_DGP = [sigmoid(i) for i in X]
y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]
myPlot(X, y, y_DGP, None, 0)
plt.show()

## **Part 1**: Model 1 - The Mean

The simplest model of all predicts the mean of the outcome variable. The model has a high bias, but a low variance. 

We will now resample from DGP and measure bias/variance at an arbitrary point to demonstrate this point.

In [None]:
# Initialize
estimates = []
draw = 0

__Q 1: Draw data, estimate model, plot the result__:  
Repeatedly execute the following block of code to draw a new sample, estimate the model, and inspect the variance and the bias at an arbitrary point X0.

In [None]:
# Draw sample 
draw += 1
y_DGP = [sigmoid(i) for i in X]
y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]

# CODE HERE
# Estimate: compute the mean of y
y_pred = None

# Assert OK to proceed 
assert y_pred is not None, 'HINT: you need to complete the code to proceed.'

estimates.append(y_pred)

# Plot
p = myPlot(X, y, y_DGP, y_pred, draw)   # Plot data
for est in estimates[:-1]:      # Plot estimates 
    p.hlines(est, xmin = min(X), xmax = max(X), colors='gray', linestyle='dashed', linewidth=2)
p.show()

__Q 2: Summarize Bias and Variance at X0__

In [None]:
# Draw from the DGP and compute the difference to the prediction
biases = []
for draw in range(DRAWS):
    biases.append(y_pred - sigmoid(X0))

In [None]:
print('Mean Bias at point X0: {}'.format(round(abs(np.mean(biases)), 4)))
print('Variance at point X0: {}'.format(round(np.var(biases), 4)))

## **Part 2**: Model 2 - A Line

The next most complicated model predicts a linear relationship between X and the outcome variable Y. The model still has moderate bias, but somewhat lower variance. 

We will now resample from DGP and measure bias/variance at an arbitrary point to demonstrate this point.

In [None]:
# Initialize
estimates = []
draw = 0

__Q 1: Draw data, estimate model, plot the result__:  
Repeatedly execute the following block of code to draw a new sample, estimate the model, and inspect the variance and the bias at an arbitrary point X0.

In [None]:
# Draw sample 
draw += 1
y_DGP = [sigmoid(i) for i in X]
y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]

# CODE HERE 
# Estimate: fit a LinearRegression() model
model = None

# Assert OK to proceed 
assert model is not None, 'HINT: you need to complete the code to proceed.'

y_pred = model.predict(np.array(X).reshape(-1, 1))
estimates.append(y_pred)

# Plot
p = myPlot(X, y, y_DGP, y_pred, draw)      # Plot data
for est in estimates[:-1]:                 # Plot estimates 
    plt.plot(X, est, '-', markersize = 12, linewidth=2, color = 'gray', linestyle='dashed')
p.show()

__Q 2: Summarize Bias and Variance at X0__

In [None]:
# Draw from the DGP and compute the difference to the prediction
biases = []
for draw in range(DRAWS):
    y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]
    model = LinearRegression().fit(np.array(X).reshape(-1, 1), y)
    y_pred = model.predict(np.array(X).reshape(-1, 1))
    biases.append(y_pred[X.index(X0)] - sigmoid(X0))


In [None]:
print('Mean Bias at point X0: {}'.format(round(abs(np.mean(biases)), 4)))
print('Variance at point X0: {}'.format(round(np.var(biases), 4)))

## **Part 3**: Model 3 - A Higher Order Polynomial Regression

We can expand the predictor space for X with higher order polynomials (X*X, X*X*X, etc...) to allow for non-linearities in the relationshio between X and Y. These modeled will bring down bias, but increase variance as each model chases the randomness of sampling variation. 

We will now resample from DGP and measure bias/variance at an arbitrary point to demonstrate this point.

In [None]:
# Initialize
estimates = []
draw = 0

degree = 4   # Set the polynomial basis expansion (try 20)

__Q 1: Draw data, estimate model, plot the result__:  
Repeatedly execute the following block of code to draw a new sample, estimate the model, and inspect the variance and the bias at an arbitrary point X0.

In [None]:
# Draw sample 
draw += 1
y_DGP = [sigmoid(i) for i in X]
y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]

# Generate a polynomial "basis expansion" -- new feature variables to add into the regression 
polynomial = PolynomialFeatures(degree=degree, include_bias=True)
X_ = polynomial.fit_transform(np.array(X).reshape(-1, 1))

# Estimate 
model = LinearRegression().fit(X_, y)

# CODE HERE
# Use the fitted model to predict values for X_
y_pred = None

# Assert OK to proceed 
assert y_pred is not None, 'HINT: you need to complete the code to proceed.'

# Smooth estimation line so we can see the prediction better
X_smooth = list(range(-500, 491, 1))
X_smooth = [i/100 for i in X_smooth]
y_polynomial = [sigmoid(i) for i in X_smooth]
X_polynomial = polynomial.transform(np.array(X_smooth).reshape(-1, 1))  # fit on only 10 datapoints
y_polynomial_pred = model.predict(X_polynomial)
estimates.append(y_polynomial_pred)

# Plot
p = myPlot(X, y, y_DGP, y_pred, draw)    # Plot data
for est in estimates[:-1]:       # Plot estimates 
    p.plot(X_smooth, est, '-', markersize = 12, linewidth=2, color = 'gray', linestyle='dashed')
p.show()

__Q 2: Summarize Bias and Variance at X0__

In [None]:
# Draw from the DGP and compute the difference to the prediction
biases = []
for draw in range(DRAWS):
    y = [i + np.random.uniform(-NOISE, +NOISE) for i in y_DGP]
    model = LinearRegression().fit(X_, y)
    y_pred = model.predict(np.array(X_))
    biases.append(y_pred[X.index(X0)] - sigmoid(X0))


In [None]:
print('Mean Bias at point X0: {}'.format(round(abs(np.mean(biases)), 4)))
print('Variance at point X0: {}'.format(round(np.var(biases), 4)))

## **Part 4, Bonus**: Bias and variance across the domain of X

We demonstrated the average bias and variance at one point, X0, which was arbitrarily set to a point defined by the constant X0. To get a complete and reliable assessment of bias and variance for the model _overall_, however, one would have to average bias and variance across the domain of possible levels of X. 

Can you copy and adapt the code above to loop the tested point X0 across the domain of X? You might want to step across the range in steps of 0.1; you will need to average across the averages you are already doing; and for greater reliability, you should complete 1000 draws at each point.