## Get your hands dirty!

I have been asked about bias & variance at job interveiws for **many, many times,** and I still feel I still haven't completely grasp the concepts. So I decide to just use a dataset as an example to actually capture the dynamics of bias and variances. 



<img src="https://images.pexels.com/photos/4255819/pexels-photo-4255819.jpeg?cs=srgb&dl=pexels-jonathan-borba-4255819.jpg&fm=jpg" width="500px"/>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from subprocess import check_output
#print(check_output(["ls", "../input"]).decode("utf8"))
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
import zipfile

In [None]:
import matplotlib.pyplot as plt

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
plt.style.use('seaborn-poster')

## Give credits to...

[Thanks to Li Li's awesome feature engineering](https://www.kaggle.com/aikinogard/random-forest-starter-with-numerical-features)

[Thanks to this fantastic explanation. Hands down the best explanation I've read so far](http://rasbt.github.io/mlxtend/user_guide/evaluate/bias_variance_decomp/)


## Thereotically speaking..

Suppose we want to estimate some function $\theta$ by using an estimator $\hat{\theta}$

**Bias is the difference between the expected value of the estimator and the true parameter $\theta$ we want to estimate**

therefore,

$$bias = E[\hat{\theta}] - \theta $$

**Variance is the difference between the expected value of the squared estimator and squared expectation of the estimator**

in a more convenient form

$$variance = E[E[\hat{\theta}] - \hat{\theta}]]^2$$

#### Now if we have multiple training sets drawn from the unknown distribution...

We can define these training sets as **true function plus noise**


We will use an linear regression model to illustrate the idea



In [None]:
reg = pd.read_csv('../input/random-linear-regression/train.csv')
reg.head()

In [None]:
## remove outliner
reg = reg[reg.x <= 75]

In [None]:
reg.fillna(np.mean(reg.y), inplace=True)

In [None]:
##draw samples from the dataset

sample1 = reg.sample(frac=.01,replace=False, random_state=6)
sample2 = reg.sample(frac=.01,replace=False, random_state=4)
sample3 = reg.sample(frac=.01,replace=False, random_state=9)

Because we want to artificially create a high bias situtation, here we just use y = mean(x) as a uniform function to evaluate the true function.

In [None]:
lm = LinearRegression()

In [None]:
plt.figure(figsize=(10, 5))
num=1
colors = ['red', 'green', 'yellow']
colors_dot= ['darkred', 'darkgreen', 'orange']
for sample in [sample1, sample2, sample3]:
    mean_y = np.mean(sample['y'])
    plt.scatter(sample.x, sample.y,color= colors_dot[num-1])
    plt.hlines(xmin= 0, xmax = 75, y=mean_y, color= colors[num-1], label= 'sample' + str(num))
    num +=1
## fit true distribution
lm.fit(X = pd.DataFrame(reg['x']), y=reg['y'])
plt.plot(reg.x, lm.predict(pd.DataFrame(reg['x'])), color='black', label='True function f(x)')
plt.legend()
plt.title('high bias');

Here, we can say that the **bias is large** because the difference between the true value and the predicted value,
on average (here, average means **expectation of the training sets** not expectation over examples in the training set), is large:

Now to create high variance scenario, we can create overfitting models which are just lines between the dots.

In [None]:
plt.figure(figsize=(10, 5))
num=1
colors = ['red', 'green', 'yellow']
colors_dot= ['darkred', 'darkgreen', 'orange']
for sample in [sample1, sample2, sample3]:
    plt.plot(sample['x'], sample['y'], 'o-', label='sample' + str(num))
    num +=1
## fit true distribution
lm.fit(X = pd.DataFrame(reg['x']), y=reg['y'])
plt.plot(reg.x, lm.predict(pd.DataFrame(reg['x'])), color='black', label='True function f(x)')
plt.legend()
plt.title('high variance');

As we can see, the variance is very large, since **on average, a prediction differs a lot from the expectation value of the prediction**:

Now we can use mlextend package to quantify the bias and variance using a real-world dataset

In [None]:
with zipfile.ZipFile("../input/two-sigma-connect-rental-listing-inquiries/"+'train.json'+".zip","r") as z:
    z.extractall("/kaggle/working/")

In [None]:
df = pd.read_json(open("/kaggle/working/train.json", "r"))
## sample data
df = df.sample(frac=.01, replace=False, random_state=3)
## naive feature engineering
## referencede from Li Li's notebook
df["num_photos"] = df["photos"].apply(len)
df["num_features"] = df["features"].apply(len)
df["num_description_words"] = df["description"].apply(lambda x: len(x.split(" ")))
df["created"] = pd.to_datetime(df["created"])
df["created_year"] = df["created"].dt.year
df["created_month"] = df["created"].dt.month
df["created_day"] = df["created"].dt.day
num_feats = ["bathrooms", "bedrooms", "latitude", "longitude", "price",
             "num_photos", "num_features", "num_description_words",
             "created_year", "created_month", "created_day"]
X = df[num_feats]
y = df["interest_level"]
X.head()



In [None]:
df.interest_level.value_counts()

In [None]:
y_dict = {'low': 1, 'medium':2, 'high':3}
y = df['interest_level'].apply(lambda x: y_dict[x])

In [None]:
## train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.33)


### Now let's build a random forrest 

We can change the number of estimators (change the model complexity) and then estimate the bias and variances at each point

In [None]:
from mlxtend.evaluate import bias_variance_decomp

In [None]:
X_train_arr = X_train.values
y_train_arr = y_train.values
X_val_arr = X_val.values
y_val_arr = y_val.values

In [None]:
estimators_nums = [10, 30, 50, 100, 300, 500]
rf = RandomForestClassifier(n_estimators=30)
#rf.fit(X_train, y_train)

avg_expected_losses = []
avg_biases = []
avg_vars = []

for n in estimators_nums:
    print('estimators: ', n)
    rf = RandomForestClassifier(n_estimators=n)
    avg_expected_loss, avg_bias, avg_var = bias_variance_decomp(
        rf, X_train_arr, y_train_arr, X_val_arr, y_val_arr, loss='0-1_loss',
        random_seed=123)
    
    avg_expected_losses.append(avg_expected_loss)
    avg_biases.append(avg_bias)
    avg_vars.append(avg_var)


In [None]:
plt.figure(figsize=(8,5))
plt.plot(estimators_nums, avg_expected_losses, label='avg expected loss')
plt.plot(estimators_nums, avg_biases, label='avg bias')
plt.plot(estimators_nums, avg_vars, label='avg var')
plt.xlabel('n estimator')
plt.legend();

In [None]:
lm.fit(np.array(estimators_nums).reshape(-1,1), np.array(avg_vars))
lm.coef_

In [None]:
plt.figure(figsize=(8, 5))
estimators_nums_inv = [1/i for i in estimators_nums]
plt.plot(estimators_nums_inv, avg_vars)
plt.xlabel('1 / number_of_estimator')
plt.ylabel('avg var')
plt.title('variances and inverse number of estimators');

As shown above, since we are using random forrest, as the number of trees increases, average variances decreases. The average variance and the number of trees are negatively correlated. A further investigation shows that average variances is almost proportional to the **inverse of the number of estimators**.

This is because random forrest has used bagging **AKA. bootstrap samples**. As we increase the number of bootstrapped samples, the mean remains the same, but the variances will be smaller. Suppose we have p bootstrap samples, then

$$\frac{var(x_1) + var(x_2) + .. var(x_p)}{p^2} = \frac{p \sigma^2}{p^2} = \frac{\sigma^2}{p}$$

For more information, please refer to:

[Ensemble Learning on Bias and Variance](https://www.section.io/engineering-education/ensemble-bias-var/)