### **Underfitting**
* 지나친 단순화로 인해 에러가 많이 발생하는 현상
* 관련 패턴이 없어 예측 정확도가 떨어짐

### **Overfitting**
* 너무 정확하게 표현하려 하여 training data에서 정확성은 좋지만 실제 test data에서는 에러가 나는 현상
* 미래에 재발하지 않을 패턴을 포착하여 예측정확도가 떨어짐

<img src = "http://i.imgur.com/AXSEOfI.png" width="400">

* MAE 함수

In [9]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes , random_state=0)
    model.fit(train_X,train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


In [10]:
import pandas as pd
    
melbourne_file_path = '../datafile/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 

filtered_melbourne_data = melbourne_data.dropna(axis=0)

y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, y,random_state = 0)

In [11]:
for max_leaf_nodes in [5,50,500,5000]:
    my_mae = get_mae(max_leaf_nodes,train_X,val_X,train_y,val_y)
    print("Max leaf nodes : %d\t MAE : %d" %(max_leaf_nodes,my_mae))

Max leaf nodes : 5	 MAE : 347380
Max leaf nodes : 50	 MAE : 258171
Max leaf nodes : 500	 MAE : 243495
Max leaf nodes : 5000	 MAE : 254983
