# Understanding Overfitting and Underfitting

## Introduction
As models are built from training data they tend to follow the patterns that are specific to training data. But we may capture additional patterns that we not see in real world data or we may fail to capture necessary patterns that real world data expect.

Over fitting is the case where our models capture the patterns that are only found in training data.
Where as in under fitting we fail to capture expected pattern and our model gives inaccurate predictions

The overfitting and underfitting is mainly decided on complication of model chosen.

## Setup

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("always")

# returns mean abslute error for given training, test data and a constraint on number of leaf nodes.
def get_mae(max_leaf_nodes, train_X, val_x, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state = 0)
    model.fit(train_X, train_y)
    pred_y = model.predict(val_X)
    return mean_absolute_error(val_y, pred_y)

file_path = "data/mel_housing/melb_data.csv"
data = pd.read_csv(file_path)

# removing unclear rows
data = data.dropna(0)

# setting features and target. 
y = data.Price
features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = data[features]

train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

## Experiment
Now we verify the test mean absolute error for different vales of max_leaf_nodes which decides depth of a tree.

In [None]:
print('mae for max_leaf_nodes = 5    ',get_mae(5, train_X, val_X, train_y, val_y))
print('mae for max_leaf_nodes = 50   ',get_mae(50, train_X, val_X, train_y, val_y))
print('mae for max_leaf_nodes = 500  ',get_mae(500, train_X, val_X, train_y, val_y))
print('mae for max_leaf_nodes = 5000 ',get_mae(5000, train_X, val_X, train_y, val_y))

## Conclusion
As we can see from the experiment test mean absolute error first drops and again raises with increase in complexity of model.
The errors happened due to simplicity of model(low max_leaf_nodes) is due to undefitting
The errors happened due to following of patterns specific to train data(high max_leaf_nodes) is due to overfitting
In above example max_leaf_nodes = 500 seems to be optimal among those values