# Decision Tree Exercises

#### Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

#### 1. What is your baseline prediction? What is your baseline accuracy? *remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.*



In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import sklearn.metrics

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

import acquire
import prepare

In [2]:
# read titanic data into dataframe
titanic = acquire.get_titanic_data()
# clean and split data
train, validate, test = prepare.prep_titanic(titanic)
train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,embark_town_Queenstown,embark_town_Southampton,sex_male
513,1,1,female,54.0,1,0,59.4,Cherbourg,0,0,0,0
169,0,3,male,28.0,0,0,56.4958,Southampton,1,0,1,1
276,0,3,female,45.0,0,0,7.75,Southampton,1,0,1,0
541,0,3,female,9.0,4,2,31.275,Southampton,0,0,1,0
406,0,3,male,51.0,0,0,7.75,Southampton,1,0,1,1


In [3]:
# determine most prevalent class
train.survived.value_counts()

0    220
1    157
Name: survived, dtype: int64

In [4]:
# create x and y versions of train, validate, and test samples
x_train = train.drop(columns=['survived', 'sex', 'embark_town'])
y_train = train.survived

x_validate = validate.drop(columns=['survived', 'sex', 'embark_town'])
y_validate = validate.survived

x_test = test.drop(columns=['survived', 'sex', 'embark_town'])
y_test = test.survived

In [5]:
# create baseline
baseline = 0
# boolean mask of where baseline was correct
matches_baseline = y_train == baseline
# calculate baseline accuracy
baseline_accuracy = matches_baseline.mean()
baseline_accuracy

0.583554376657825

Our baseline prediction is 0 (did not survive). Our baseline accuracy is about 58.4%.

#### 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [6]:
tree = DecisionTreeClassifier(max_depth=3, random_state=123)

In [7]:
tree = tree.fit(x_train, y_train)

In [8]:
import graphviz
from graphviz import Graph

dot_data = export_graphviz(tree, feature_names=x_train.columns, class_names=['not survived', 'survived'], rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)

graph.render('titanic_train_decision_tree', view=True)

'titanic_train_decision_tree.pdf'

In [9]:
y_pred = tree.predict(x_train)
y_pred[0:5]

array([1, 0, 1, 1, 0])

In [10]:
y_pred_prob = tree.predict_proba(x_train)
y_pred_prob[0:5]

array([[0.03614458, 0.96385542],
       [0.89808917, 0.10191083],
       [0.45283019, 0.54716981],
       [0.45283019, 0.54716981],
       [0.89808917, 0.10191083]])

#### 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [11]:
tree.score(x_train, y_train)

0.830238726790451

In [12]:
sklearn.metrics.confusion_matrix(y_train, y_pred)
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred))

Unnamed: 0,0,1
0,193,27
1,37,120


In [13]:
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.88      0.86       220
           1       0.82      0.76      0.79       157

    accuracy                           0.83       377
   macro avg       0.83      0.82      0.82       377
weighted avg       0.83      0.83      0.83       377



#### 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [14]:
# with 1 being positive and 0 being negative
TP = 120
TN = 193
FP = 27
FN = 37

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 83.02%
True Positive Rate: 76.43%
True Negative Rate: 87.73%
False Positive Rate: 12.27%
False Negative Rate: 23.57%
Precision: 81.63%
Recall: 76.43%
F1 Score: 78.95%
Support - 1: 157
Support - 0: 220


#### 5. Run through steps 2-4 using a different `max_depth` value.

In [15]:
tree2 = DecisionTreeClassifier(max_depth=4, random_state=123)
tree2 = tree2.fit(x_train, y_train)
dot_data = export_graphviz(tree2, feature_names=x_train.columns, class_names=['not survived', 'survived'], rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)
graph.render('titanic_train_decision_tree2', view=True)
y_pred = tree2.predict(x_train)
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       220
           1       0.84      0.76      0.80       157

    accuracy                           0.84       377
   macro avg       0.84      0.83      0.83       377
weighted avg       0.84      0.84      0.84       377



#### 6. Which model performs better on your in-sample data?

In [16]:
print('Max Depth: 3')
print(f'In-Sample Accuracy: {tree.score(x_train, y_train):.2%}')
print()
print('Max Depth: 4')
print(f'In-Sample Accuracy: {tree2.score(x_train, y_train):.2%}')

Max Depth: 3
In-Sample Accuracy: 83.02%

Max Depth: 4
In-Sample Accuracy: 83.82%


The model with a max depth of 4 works better (more accurate by 0.8%) on the in-sample data than the model with a max depth of 3.

#### 7. Which model performs best on your out-of-sample data, the `validate` set?

In [17]:
print('Max Depth: 3')
print(f'Out-of-Sample Accuracy: {tree.score(x_validate, y_validate):.2%}')
print()
print('Max Depth: 4')
print(f'Out-of-Sample Accuracy: {tree2.score(x_validate, y_validate):.2%}')

Max Depth: 3
Out-of-Sample Accuracy: 77.16%

Max Depth: 4
Out-of-Sample Accuracy: 77.78%


The model with a max depth of 4 performs better (more accurate by 0.62%) than the model with a max depth of 3.

------------------------

#### Bonus:

#### 1. Work through these same exercises using the Telco dataset.

#### 2. Experiment with this model on other datasets with a higher number of output classes.

---

# Random Forest Exercises

#### Continue working in your model file with titanic data to do the following:

#### 1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [18]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_train)

#### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [19]:
rf.score(x_train, y_train)

0.9814323607427056

In [20]:
labels=sorted(y_train.unique())
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,0,1
0,218,2
1,5,152


In [26]:
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.82      0.93      0.87       220
           1       0.88      0.72      0.79       157

    accuracy                           0.84       377
   macro avg       0.85      0.82      0.83       377
weighted avg       0.84      0.84      0.84       377



#### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [22]:
TP = 152
TN = 218
FP = 2
FN = 5

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 98.14%
True Positive Rate: 96.82%
False Positive Rate: 0.91%
True Negative Rate: 99.09%
False Negative Rate: 3.18%
Precision: 98.70%
Recall: 96.82%
F1 Score: 97.75%
Support - 1: 157
Support - 0: 220


#### 4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

In [23]:
for n in range(2, 11):
    for i in range(1, 10):
        rf = RandomForestClassifier(min_samples_leaf=n, max_depth=i, random_state=123)
        rf.fit(x_train, y_train)
        y_pred = rf.predict(x_train)
        class_report = sklearn.metrics.classification_report(y_train, y_pred, output_dict=True)
        print(f'Random Forest with min of {n} samples per leaf and max depth of {i}:')
        print(pd.DataFrame(class_report))
        print()

Random Forest with min of 2 samples per leaf and max depth of 1:
                    0           1  accuracy   macro avg  weighted avg
precision    0.740876    0.834951  0.766578    0.787914      0.780053
recall       0.922727    0.547771  0.766578    0.735249      0.766578
f1-score     0.821862    0.661538  0.766578    0.741700      0.755096
support    220.000000  157.000000  0.766578  377.000000    377.000000

Random Forest with min of 2 samples per leaf and max depth of 2:
                    0           1  accuracy   macro avg  weighted avg
precision    0.779528    0.821138  0.793103    0.800333      0.796856
recall       0.900000    0.643312  0.793103    0.771656      0.793103
f1-score     0.835443    0.721429  0.793103    0.778436      0.787962
support    220.000000  157.000000  0.793103  377.000000    377.000000

Random Forest with min of 2 samples per leaf and max depth of 3:
                    0           1  accuracy   macro avg  weighted avg
precision    0.814516    0.860465

Random Forest with min of 4 samples per leaf and max depth of 4:
                    0           1  accuracy   macro avg  weighted avg
precision    0.837398    0.893130  0.856764    0.865264      0.860607
recall       0.936364    0.745223  0.856764    0.840793      0.856764
f1-score     0.884120    0.812500  0.856764    0.848310      0.854294
support    220.000000  157.000000  0.856764  377.000000    377.000000

Random Forest with min of 4 samples per leaf and max depth of 5:
                    0           1  accuracy   macro avg  weighted avg
precision    0.826772    0.918699  0.856764    0.872735      0.865054
recall       0.954545    0.719745  0.856764    0.837145      0.856764
f1-score     0.886076    0.807143  0.856764    0.846609      0.853205
support    220.000000  157.000000  0.856764  377.000000    377.000000

Random Forest with min of 4 samples per leaf and max depth of 6:
                    0           1  accuracy   macro avg  weighted avg
precision    0.856557    0.917293

Random Forest with min of 6 samples per leaf and max depth of 7:
                    0           1  accuracy   macro avg  weighted avg
precision    0.852459    0.909774  0.872679    0.881117      0.876328
recall       0.945455    0.770701  0.872679    0.858078      0.872679
f1-score     0.896552    0.834483  0.872679    0.865517      0.870703
support    220.000000  157.000000  0.872679  377.000000    377.000000

Random Forest with min of 6 samples per leaf and max depth of 8:
                    0           1  accuracy   macro avg  weighted avg
precision    0.854772    0.897059  0.870027    0.875915      0.872382
recall       0.936364    0.777070  0.870027    0.856717      0.870027
f1-score     0.893709    0.832765  0.870027    0.863237      0.868329
support    220.000000  157.000000  0.870027  377.000000    377.000000

Random Forest with min of 6 samples per leaf and max depth of 9:
                    0           1  accuracy   macro avg  weighted avg
precision    0.861345    0.892086

Random Forest with min of 9 samples per leaf and max depth of 1:
                    0           1  accuracy   macro avg  weighted avg
precision    0.740876    0.834951  0.766578    0.787914      0.780053
recall       0.922727    0.547771  0.766578    0.735249      0.766578
f1-score     0.821862    0.661538  0.766578    0.741700      0.755096
support    220.000000  157.000000  0.766578  377.000000    377.000000

Random Forest with min of 9 samples per leaf and max depth of 2:
                    0           1  accuracy   macro avg  weighted avg
precision    0.780392    0.827869  0.795756    0.804131      0.800164
recall       0.904545    0.643312  0.795756    0.773929      0.795756
f1-score     0.837895    0.724014  0.795756    0.780955      0.790470
support    220.000000  157.000000  0.795756  377.000000    377.000000

Random Forest with min of 9 samples per leaf and max depth of 3:
                    0           1  accuracy   macro avg  weighted avg
precision    0.792000    0.826772

#### 5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

#### After making a few models, which one has the best performance (or closest metrics) on both train and validate?