<a href="https://colab.research.google.com/github/uliang/AKS-primality-testing/blob/master/Bias_and_variance_tradeoff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*MS9002: Data Mining Techniques*
### Python lab session 1 : Understanding bias and variance errors 

---

In this lab session, we will investigate how a model may perform poorly due to bias and variance. 

*Learning objectives*

- To recall the concept of bias and variance error and how it affects model performance. 



In [0]:
import pandas as pd
import seaborn as sns 
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, ColorBar
from bokeh.tile_providers import STAMEN_TERRAIN
from bokeh.transform import linear_cmap
from bokeh.palettes import Inferno256
import pyproj

train_data = pd.read_csv('./sample_data/california_housing_train.csv')
test_data = pd.read_csv('./sample_data/california_housing_test.csv')

### California housing value dataset 


Housing values are dependant on many external factors. We have here data collected from several towns and cities in the state of California. We want to be able to build a model that will allow us to predict the regions median housing value from the various attributes of the area. 




In [37]:
output_notebook()

project_projection = pyproj.Proj("+init=EPSG:4326")
webmercator_projection = pyproj.Proj("+init=EPSG:3857")
long, lat = pyproj.transform(project_projection, webmercator_projection, 
                             list(train_data.longitude),
                             list(train_data.latitude))

value = train_data.median_house_value
source = ColumnDataSource(data={
    'latitude': lat, 
    'longitude': long, 
    'median_house_value': value
})
mapper = linear_cmap(field_name='median_house_value', palette=Inferno256, 
                     low=value.min(), high=value.max())
color_bar = ColorBar(color_mapper=mapper['transform'], width=8, location=(0,0), 
                    label_standoff=12)

p = figure(x_axis_type='mercator', y_axis_type='mercator', 
           x_range=(min(long),max(long)), y_range=(min(lat), max(lat)))
p.add_tile(STAMEN_TERRAIN)
p.circle(x='longitude', y='latitude', source=source, size=5, color=mapper, 
        alpha=.75)
p.add_layout(color_bar, 'right')

show(p)


Let's try a linear regression model. The target variable is `median_house_value` and attributes are the predictors. 

In [13]:
from sklearn.linear_model import LinearRegression
import numpy as np



linear_model = LinearRegression()
X = train_data.iloc[:, :-1]
y = train_data.loc[: , 'median_house_value']
X_test = test_data.iloc[:, :-1]
y_test = test_data.loc[:, 'median_house_value']

linear_model.fit(X, y)
score = linear_model.score(X_test, y_test)

print(f'Testing Rsq score for linear regression model: {score:.4f}')



Testing Rsq score for linear regression model: 0.6195


When you run the cell you will notice that the level of accuracy is pretty low. The score here is interpreted as the percentage of variation in the response variable which is able to accounted for by our model. 

This low accuracy is an error due to bias. What happened is that we __assumed__ that the attributes of the model affects housing prices additively. That is each unit increase in an attribute level contributes a certain proportionate amount of "value" to the house. We also assumed that there is no __interaction__ within attributes of the model. 

The $R^2$ score shows that this simple assumption is not met in reality. Hence the term "bias" error. 

### Reduce bias by adding more complexity

In real life, one way of reducing bias is to have a clearer understanding of the situation at hand. In data mining, we accomplish this by introducing a model that can account for more structure in the data. This means a more complex model. 

One simple way of introducing complexity into the model is to account for interaction and non-linearity among the attribute variables. 

In [0]:
#@title Select polynomial degree
degree = 2 #@param {type:"slider", min:2, max:5, step:1}


In [33]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline


pipeline = make_pipeline(PolynomialFeatures(degree), LinearRegression())

pipeline.fit(X, y)
score_train = pipeline.score(X, y)
score = pipeline.score(X_test, y_test)


print(f'The Rsq score of a polynomial regression model' + 
      f'\nModel of degree {degree} \n Training set: {score_train:.4f}' + 
      f'\nModel of degree {degree} \n Test set: {score:.4f}')

The Rsq score of a polynomial regression model
Model of degree 2 
 Training set: 0.7066
Model of degree 2 
 Test set: 0.6646


Now we see an improvement in the model score on the test set. Naturally the score on the training set has also improved. 

However, try to fit the models for higher and higher values of `degree`. Note down the $R^2$ scores for both the training and testing set. 

### Lab discussion 
1. Interpret and explain the observed scores that you see. 






### Errors due to variance

It is evident from our observations that a overly complex model does not necessarily lead to better outcomes. In fact, if we had only judged the performance of the model based on its scores on the training set, we would have concluded that a high degree is a more accurate model even though its performance in reality is very poor! 


This is akin to someone who believes in "conspiracy theories". His theory may sound plausible enough because he has connected "all the dots" so to speak and failed to consider alternative data points which may challenge his position. 

In data mining, this tendency to overfit to all training data is something which has to be guarded against. 