# 5. Underfitting and Overfitting

**Overfitting**: Model matches training data almost perfectly, but performs poorly in test and new data

**Underfitting**: Model doesn't capture the relation between X and Y and performs poorly on training and test data

## Example
Consider a tree, where we can control the depth. High depth with many leaves causes the model to overfit. Only few splits are not precise enough to capture the relation, so the model may uderfit.

In [1]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

In [2]:
import pandas as pd

melbourne_data = pd.read_csv('./resources/melb_data.csv')
filtered_melbourne_data = melbourne_data.dropna(axis=0)
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [7]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print(f"Max leaf nodes: {max_leaf_nodes:,}  \t\t Mean Absolute Error:  {my_mae:,.2f}")

Max leaf nodes: 5  		 Mean Absolute Error:  347,380.34
Max leaf nodes: 50  		 Mean Absolute Error:  258,171.21
Max leaf nodes: 500  		 Mean Absolute Error:  243,495.96
Max leaf nodes: 5,000  		 Mean Absolute Error:  255,575.13


## Exercises

In [8]:
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

iowa_file_path = './resources/home-data-for-ml-course/train.csv'

home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

iowa_model = DecisionTreeRegressor(random_state=1)
iowa_model.fit(train_X, train_y)

val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))

Validation MAE: 29,653


### 1. Compare Different Tree Sizes

In [9]:
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
leaf_errors = {}
for max_leaf_nodes in candidate_max_leaf_nodes:
    mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    leaf_errors[max_leaf_nodes] = mae
best_tree_size = min(candidate_max_leaf_nodes, key=lambda x: leaf_errors[x])
print(best_tree_size)

100


### 2. Fit Model Using All Data

In [None]:
final_model = DecisionTreeRegressor()