# Overfitting and Underfitting

Models too complex for the data overfit:
* they explain too well the data that they have seen
* they do not generalize

Models too simple for the data underfit:
* they capture no noise
* they are limited by their expressivity

How to find the right trade-off?

In [1]:
# # Standard imports
import pandas as pd
import numpy as np

# Disable jedi autocompleter
%config Completer.use_jedi = False

## The Framework and why do we need it

In this section we intend to go into details into the cross-validation framework.  
Before we dive in, let's linger on the reasons for always having training and testing sets. Let’s first look at the limitation of using a dataset without keeping any samples out.

To illustrate the different concepts, we will use the California housing dataset.

In [2]:
from sklearn.datasets import fetch_california_housing

In [5]:
housing = fetch_california_housing(as_frame=True)
type(housing)

sklearn.utils.Bunch

In [6]:
dir(housing)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [7]:
data = housing.data
target = housing.target
data.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


To simplify future visualtization, let's transform the prices from the 100 (k\\$) range to the thousand dollars (k\\$) range.

In [9]:
target *= 100
target.head()

0    452.6
1    358.5
2    352.1
3    341.3
4    342.2
Name: MedHouseVal, dtype: float64

## Training Error vs. Testing Error

To solve this regression task, we will use a decision tree regressor

In [10]:
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor(random_state=0)
regressor.fit(data, target)

DecisionTreeRegressor(random_state=0)

After training the regressor, we sould like to know it potential generalization performance once deployed in production. For this purpose, we use the <code style="background:yellow;color:black">mean absolute error</code>, which gives us an error in the native unit, i.e. k\\$.

In [12]:
from sklearn.metrics import mean_absolute_error

target_predicted = regressor.predict(data)
score = mean_absolute_error(target, target_predicted)
print(f"On average, our regressor makes an error of {score:.2f} k$")

On average, our regressor makes an error of 0.00 k$


We get a perfect prediction with no errors. It is too optimistic and almost always revealing a methodological problem when doing machine learning.

<div class="alert alert-block alert-info">
Indeed, we trained and predicted on the same dataset. Since our decision tree was fully grown, every sample in the dataset is stored in a leaf node. Therefore, our decision tree fully memorized the dataset given during <em>fit</em> and therefore made no error when predicting.</div>

This error computed above is called the <code style="background:yellow;color:black">empirical error or training error</code>.