# Exercises

## Using the titanic data, in your classification-exercises repository, create a notebook, `decision_tree.ipynb` where you will do the following:

In [1]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

import matplotlib.pyplot as plt
import seaborn as sns

import acquire
import prepare
import model

In [2]:
train,\
validate,\
test = prepare.split_data(
    prepare.prep_titanic(
    acquire.get_titanic_data()),'survived')

File exists - reading CSV file


In [3]:
train,validate,test = model.preprocess_titanic(train,validate,test)

### 1. What is your baseline prediction? What is your baseline accuracy? *remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.*

In [4]:
df = pd.DataFrame({'survived':train.survived,'prediction':0})

In [5]:
(df.survived == df.prediction).mean()

0.6161048689138576

### 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [6]:
X_train = train.drop(columns='survived')
y_train = train.survived

X_val = validate.drop(columns='survived')
y_val = validate.survived

X_test = test.drop(columns='survived')
y_test = test.survived

In [7]:
dt = DecisionTreeClassifier(max_depth=3, random_state=123)

In [8]:
dt = dt.fit(X_train,y_train)

In [9]:
y_pred = dt.predict(X_train)

In [10]:
pd.DataFrame({'actual': train.survived,'prediction':y_pred})

Unnamed: 0,actual,prediction
580,1,1
140,0,1
747,1,1
615,1,1
132,0,0
...,...,...
461,0,0
344,0,0
513,1,1
467,0,0


### 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [11]:
dt.score(X_train,y_train)

0.8220973782771536

In [14]:
dt.score(X_val,y_val)

0.8370786516853933

In [15]:
labels = (y_train.unique())
pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,1,0
1,312,17
0,78,127


In [16]:
print(classification_report(y_train,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.95      0.87       329
           1       0.88      0.62      0.73       205

    accuracy                           0.82       534
   macro avg       0.84      0.78      0.80       534
weighted avg       0.83      0.82      0.81       534



### 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [17]:
print(classification_report(y_train,y_pred))

              precision    recall  f1-score   support

           0       0.80      0.95      0.87       329
           1       0.88      0.62      0.73       205

    accuracy                           0.82       534
   macro avg       0.84      0.78      0.80       534
weighted avg       0.83      0.82      0.81       534



In [18]:
def compute(tp,tn,fp,fn):
    accuracy = (tp+tn)/(tp+tn+fp+fn)
    true_pos_rate = tp/(tp+fn)
    false_pos_rate = fp/(fp+tn)
    true_neg_rate = tn/(tn+fp)
    false_neg_rate = fn/(tp+fn)
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1_score = (2*(precision * recall))/(precision + recall)
    support = tp + fn + fp + tn
    
    print(f'accuracy: {accuracy:2f}')
    print(f'true positive rate: {true_pos_rate:2f}')
    print(f'false positive rate: {false_pos_rate:2f}')
    print(f'true negative rate: {true_neg_rate:2f}')
    print(f'false negative rate: {false_neg_rate:2f}')
    print(f'precision: {precision:2f}')
    print(f'recall: {recall:2f}')
    print(f'f1-score: {f1_score:2f}')
    print(f'support: {support:2f}')

In [19]:
tp = 312
tn = 127
fp = 17
fn = 78

In [20]:
compute(tp,tn,fp,fn)

accuracy: 0.822097
true positive rate: 0.800000
false positive rate: 0.118056
true negative rate: 0.881944
false negative rate: 0.200000
precision: 0.948328
recall: 0.800000
f1-score: 0.867872
support: 534.000000


### 5. Run through steps 2-4 using a different `max_depth` value.

In [21]:
X_train = train.drop(columns='survived')
y_train = train.survived

X_val = validate.drop(columns='survived')
y_val = validate.survived

X_test = test.drop(columns='survived')
y_test = test.survived

In [22]:
dt = DecisionTreeClassifier(max_depth=5, random_state=123)

In [23]:
dt = dt.fit(X_train,y_train)

In [24]:
y_pred = dt.predict(X_train)

In [25]:
pd.DataFrame({'actual': train.survived,'prediction':y_pred})

Unnamed: 0,actual,prediction
580,1,1
140,0,0
747,1,1
615,1,1
132,0,1
...,...,...
461,0,0
344,0,0
513,1,1
467,0,0


In [26]:
dt.score(X_train,y_train)

0.8520599250936329

In [27]:
dt.score(X_val,y_val)

0.8258426966292135

In [28]:
labels = (y_train.unique())
pd.DataFrame(confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,1,0
1,317,12
0,67,138


In [29]:
print(classification_report(y_train,y_pred))

              precision    recall  f1-score   support

           0       0.83      0.96      0.89       329
           1       0.92      0.67      0.78       205

    accuracy                           0.85       534
   macro avg       0.87      0.82      0.83       534
weighted avg       0.86      0.85      0.85       534



In [30]:
tp = 317
tn = 138
fp = 12
fn = 67

In [31]:
compute(tp,tn,fp,fn)

accuracy: 0.852060
true positive rate: 0.825521
false positive rate: 0.080000
true negative rate: 0.920000
false negative rate: 0.174479
precision: 0.963526
recall: 0.825521
f1-score: 0.889201
support: 534.000000


### 6. Which model performs better on your in-sample data?

> DecisionTreeClassifier(max_depth=5, random_state=123)

### 7. Which model performs best on your out-of-sample data, the `validate` set?

> DecisionTreeClassifier(max_depth=3, random_state=123)