### Generating Overfit Tree Models

By default, you scikitlearn tree models will grow until every node is pure.  To explore this, you are to build different models using the `max_depth` parameter and determine when the tree begins to overfit the data.  For depths from `max_depth = 1` until the tree is completed, keep track of the accuracy on training vs. test data and generate a plot with depths as the horizontal axis and accuracy as the vertical axis for train and test data.  

Repeat this process with different splits of the data to determine at what depth the tree begins to overfit.  Share your results with your peers and discuss your approach to generating the visualization.  What are the consequences of this overfitting for your approach to building Decision Trees?   We provide a small dataset with health data where your goal is to predict whether or not the individuals survive.

In [5]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('data/Whickham.txt')

In [9]:
data.outcome = data.outcome.map({'Alive':1, 'Dead':0})
data.smoker = data.smoker.map({'Yes':1, 'No':0})
data.head()

Unnamed: 0,outcome,smoker,age
0,1,1,23
1,1,1,18
2,0,1,71
3,1,0,67
4,1,0,64


In [10]:
X = data[['smoker', 'age']]
y = data['outcome']

In [94]:
score = pd.DataFrame({}, columns = ['Depth', 'Split', 'Train', 'Test'])

# score = {'Depth':[], 'Train_Accuracy':[], 'Test_Accuracy':[]}

for max_depth in np.arange(1, 21):
    for test_size in np.linspace(.1,.9,9):

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = test_size)

        model = DecisionTreeClassifier( criterion = 'gini', max_depth = max_depth )

        model.fit(X_train ,y_train)
        
        score = score.append({'Depth':max_depth, 'Split':test_size, \
            'Train':model.score(X_train, y_train), 'Test':model.score(X_test, y_test)}, ignore_index = True)

        # score['Depth'].append(max_depth)
        # score['Train_Accuracy'].append(model.score(X_train, y_train))
        # score['Test_Accuracy'].append(model.score(X_test, y_test))

# score = pd.DataFrame(score)


In [127]:
# import matplotlib.pyplot as plt


# fig, ax = plt.subplots()

# ax.step(score.Depth, score.Train_Accuracy, label = 'Training Accuracy', alpha = .8)
# ax.step(score.Depth, score.Test_Accuracy, label = 'Test Accuracy', alpha = .8, linestyle = 'dashed')

# ax.set_xlabel('Maximum layer depth')
# ax.set_ylabel('Model accuracy')

# ax.set_xlim(1,8)
# ax.legend()

import plotly.express as px

train = np.rot90(score[['Train']].values.reshape(20, 9))
test = np.rot90(score[['Test']].values.reshape(20, 9))

fig = px.imshow(train, labels = {'x':'Maximum Depth', 'y':'Test Split Size', 'color':'Training Accuracy'})#, y=np.linspace(.1,.9,9))
fig

In [124]:
fig = px.imshow(test, labels = {'x':'Maximum Depth', 'y':'Test Split Size', 'color':'Test Accuracy'})
fig

array([[0.85109983, 0.85823026, 0.8531012 , 0.86928934, 0.84474886,
        0.86666667, 0.83502538, 0.84732824, 0.90076336],
       [0.85448393, 0.85252141, 0.84657236, 0.86294416, 0.85235921,
        0.84190476, 0.86294416, 0.83206107, 0.90076336],
       [0.85532995, 0.84776403, 0.8476605 , 0.84517766, 0.86605784,
        0.86857143, 0.85279188, 0.88167939, 0.84732824],
       [0.85532995, 0.85252141, 0.84874864, 0.86675127, 0.85083714,
        0.84952381, 0.86294416, 0.87022901, 0.87022901],
       [0.84856176, 0.86584206, 0.86071817, 0.86040609, 0.86453577,
        0.89333333, 0.86294416, 0.88549618, 0.90076336],
       [0.86548223, 0.86203616, 0.8781284 , 0.86167513, 0.85388128,
        0.87428571, 0.86548223, 0.88549618, 0.9389313 ],
       [0.86125212, 0.86298763, 0.86833515, 0.8642132 , 0.85996956,
        0.88190476, 0.86294416, 0.90076336, 0.92366412],
       [0.85532995, 0.86489058, 0.88139282, 0.86294416, 0.87062405,
        0.87809524, 0.88832487, 0.92366412, 0.93129771],
