# The Bias / Variance Problem
When constructing machine learning models, one of the problems that we face is call the bias / variance problem.  In essence, these two properties of the data/model are at odds.  The key is to balance the two of them.  This notebook steps through the bias/variance problem

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import pandas as pd
import random

from sklearn.linear_model import LinearRegression
from matplotlib.pylab import rcParams

In [None]:
# Parameter to draw pretty pictures
rcParams['figure.figsize'] = 12, 10

### Accuracy and Precision
It is likely that you have run into the terms accuracy and precision before.  This is often presented as an infographic with bullseyes on it.

__Accuracy__:  how close can you get to the true answer

__Precision__:  how repeatable are your attempts

<img src="./AccuracyPrecision.png">

It is worth our time to explore this a little bit interactively.

In [None]:
# First, some functions for drawing it.
def bullseye(ax,withData=None):
    a = 3       # radius 0 to >a
    b = 5       # radius a to >b
    c = 7       # radius b to c
    d = 9       # radius b to c

    circle4 = plt.Circle((0, 0), d, color='red')
    circle3 = plt.Circle((0, 0), c, color='white')
    circle2 = plt.Circle((0, 0), b, color='red')
    circle1 = plt.Circle((0, 0), a, color='white')

    ax.add_artist(circle4)
    ax.add_artist(circle3)
    ax.add_artist(circle2)
    ax.add_artist(circle1)

    if withData is not None:
        ax.scatter(withData[:,0], withData[:,1],zorder=4)


    ax.axis([-22, 22, -22, 22])
    
def shotgun(nPts,bias,variance):
    return np.random.normal(bias,variance,[nPts,2])


In [None]:
numPts = 50;
# For each pair of the blank areas ________,  insert a low value and a high value.
# Use the range of 1 to 10 for both
accuracy = [________,________]
precision = [________,________]

fig, ax = plt.subplots(2,2)
bullseye(ax[0, 0],shotgun(numPts,accuracy[0],precision[0]))
bullseye(ax[0, 1],shotgun(numPts,accuracy[0],precision[1]))
bullseye(ax[1, 0],shotgun(numPts,accuracy[1],precision[0]))
bullseye(ax[1, 1],shotgun(numPts,accuracy[1],precision[1]))

ax[0,0].set_title("High Precision")
ax[0,1].set_title("Low Precision")

ax[0,0].set_ylabel("High Accuracy")
ax[1,0].set_ylabel("Low Accuracy")

plt.show()

<font color='red'>
# There are two things to think about here:

1. How might you address low accuracy and low precision?
   
2. Think about the "low" number and "high" number that you have entered.  
   Does a "low" number correspond to a "low" accuracy or "low" precision?
  
  

### Bias and Variance

Bias and variance are the mathematical properties that impact accuracy and precision.  The relationship is an inverse relationship:

$\uparrow Bias \sim \downarrow Accuracy$

$\uparrow Variance \sim \downarrow Precision$

Unfortunately, in the field of machine learning, __Bias__ and __Variance__ are linked to each other, so there is always a tradeoff between these two factors.

#### Linear regression

Under ideal conditions

$\bar{y} = m*x + b$

In the real-world, we have noise:

$\bar{y} = m*x + b + \epsilon$

where

$\epsilon \sim N(\mu,\sigma^2)$

The function $N$ is the normal distribution function which we don't need to know the details for, but $\mu$ and $\sigma$ are important:

$\mu$:  When you were adjusting values to change the accuracy, you were changing the bias.  $\mu$ represents the bias in the noise.

$\sigma$:  When you were adjusting values to change the precision, you were changing the variance.  $\sigma$ represents the variance in the noise.

When we are building our linear regression *(under ideal conditions)*, we create the model with this error metric:

$mse = \frac{1}{N} \sum{( y_i - \bar{y}(x_i) )^2}$

The expanded form of this is:

$mse = \frac{1}{N} \sum{(y_i - (m*x_i + b))^2}$

Now we have noise, so we have:

$mse = \frac{1}{N} \sum{(y_i - (m*x_i + b + \epsilon))^2}$

Let's look at the real-world implication of this.

# Implications for model development
We recognize there are problems of bias and variance.  What does that look like when you create a model?

In [None]:
#  We are going to define some functions to make life easier to demonstrate

# Our synthetic data is Dave's mental well-being as predicted by the number of boys that his daughters dates
def Daves_Daughters_Dates():
    x = np.array([i*np.pi/180 for i in range(60,300,4)])
    np.random.seed(10)  #Setting seed for reproducability
    y = np.sin(x) + np.random.normal(0,0.15,len(x))
    data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
    return data

# This is the call to linear regression with some nice window dressing around it.
def linear_regression(data, power, models_to_plot):
    #initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x_%d'%i for i in range(2,power+1)])
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for power: %d'%power)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

In [None]:
data = Daves_Daughters_Dates()
plt.plot(data['x'],data['y'],'.')
plt.title('Number of boys dated vs Daves well-being')

In [None]:
# Prep the data ... more complex models as the power goes up
for i in range(2,16):  #power of 1 is already there
    colname = 'x_%d'%i      #new var will be x_power
    data[colname] = data['x']**i

#Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

#Define the powers for which a plot is required:
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

#Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

<font color='red'>
## What do we observe as the model gets more complex?

In [None]:
# Yes, this is a terrible table to look at, but necessary to understand

#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_simple

#### Observation
* As the model gets more complex, there are more model coefficients that are fitted; however, they are also getting larger and larger.
* As we get closer and closer to perfectly fitting the data, the more that we are fitting to the noise rather than the signal.