# Decision Tree Exercises

#### Using the titanic data, in your classification-exercises repository, create a notebook, model.ipynb where you will do the following:

#### 1. What is your baseline prediction? What is your baseline accuracy? *remember: your baseline prediction for a classification problem is predicting the most prevelant class in the training dataset (the mode). When you make those predictions, what is your accuracy? This is your baseline accuracy.*



In [1]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
import sklearn.metrics

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd

import acquire
import prepare

In [2]:
# read titanic data into dataframe
titanic = acquire.get_titanic_data()
# clean and split data
train, validate, test = prepare.prep_titanic(titanic)
train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,embark_town_Queenstown,embark_town_Southampton,sex_male
513,1,1,female,54.0,1,0,59.4,Cherbourg,0,0,0,0
169,0,3,male,28.0,0,0,56.4958,Southampton,1,0,1,1
276,0,3,female,45.0,0,0,7.75,Southampton,1,0,1,0
541,0,3,female,9.0,4,2,31.275,Southampton,0,0,1,0
406,0,3,male,51.0,0,0,7.75,Southampton,1,0,1,1


In [3]:
# determine most prevalent class
train.survived.value_counts()

0    220
1    157
Name: survived, dtype: int64

In [4]:
# create x and y versions of train, validate, and test samples
x_train = train.drop(columns=['survived', 'sex', 'embark_town'])
y_train = train.survived

x_validate = validate.drop(columns=['survived', 'sex', 'embark_town'])
y_validate = validate.survived

x_test = test.drop(columns=['survived', 'sex', 'embark_town'])
y_test = test.survived

In [5]:
# create baseline
baseline = 0
# boolean mask of where baseline was correct
matches_baseline = y_train == baseline
# calculate baseline accuracy
baseline_accuracy = matches_baseline.mean()
baseline_accuracy

0.583554376657825

Our baseline prediction is 0 (did not survive). Our baseline accuracy is about 58.4%.

#### 2. Fit the decision tree classifier to your training sample and transform (i.e. make predictions on the training sample)

In [6]:
tree = DecisionTreeClassifier(max_depth=3, random_state=123)

In [7]:
tree = tree.fit(x_train, y_train)

In [8]:
import graphviz
from graphviz import Graph

dot_data = export_graphviz(tree, feature_names=x_train.columns, class_names=['not survived', 'survived'], rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)

graph.render('titanic_train_decision_tree', view=True)

'titanic_train_decision_tree.pdf'

In [9]:
y_pred = tree.predict(x_train)
y_pred[0:5]

array([1, 0, 1, 1, 0])

In [10]:
y_pred_prob = tree.predict_proba(x_train)
y_pred_prob[0:5]

array([[0.03614458, 0.96385542],
       [0.89808917, 0.10191083],
       [0.45283019, 0.54716981],
       [0.45283019, 0.54716981],
       [0.89808917, 0.10191083]])

#### 3. Evaluate your in-sample results using the model score, confusion matrix, and classification report.

In [11]:
tree.score(x_train, y_train)

0.830238726790451

In [12]:
sklearn.metrics.confusion_matrix(y_train, y_pred)
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred))

Unnamed: 0,0,1
0,193,27
1,37,120


In [13]:
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.88      0.86       220
           1       0.82      0.76      0.79       157

    accuracy                           0.83       377
   macro avg       0.83      0.82      0.82       377
weighted avg       0.83      0.83      0.83       377



#### 4. Compute: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [14]:
# with 1 being positive and 0 being negative
TP = 120
TN = 193
FP = 27
FN = 37

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 83.02%
True Positive Rate: 76.43%
True Negative Rate: 87.73%
False Positive Rate: 12.27%
False Negative Rate: 23.57%
Precision: 81.63%
Recall: 76.43%
F1 Score: 78.95%
Support - 1: 157
Support - 0: 220


#### 5. Run through steps 2-4 using a different `max_depth` value.

In [15]:
tree2 = DecisionTreeClassifier(max_depth=4, random_state=123)
tree2 = tree2.fit(x_train, y_train)
dot_data = export_graphviz(tree2, feature_names=x_train.columns, class_names=['not survived', 'survived'], rounded=True, filled=True, out_file=None)
graph = graphviz.Source(dot_data)
graph.render('titanic_train_decision_tree2', view=True)
y_pred = tree2.predict(x_train)
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.90      0.87       220
           1       0.84      0.76      0.80       157

    accuracy                           0.84       377
   macro avg       0.84      0.83      0.83       377
weighted avg       0.84      0.84      0.84       377



#### 6. Which model performs better on your in-sample data?

In [16]:
print('Max Depth: 3')
print(f'In-Sample Accuracy: {tree.score(x_train, y_train):.2%}')
print()
print('Max Depth: 4')
print(f'In-Sample Accuracy: {tree2.score(x_train, y_train):.2%}')

Max Depth: 3
In-Sample Accuracy: 83.02%

Max Depth: 4
In-Sample Accuracy: 83.82%


The model with a max depth of 4 works better (more accurate by 0.8%) on the in-sample data than the model with a max depth of 3.

#### 7. Which model performs best on your out-of-sample data, the `validate` set?

In [17]:
print('Max Depth: 3')
print(f'Out-of-Sample Accuracy: {tree.score(x_validate, y_validate):.2%}')
print()
print('Max Depth: 4')
print(f'Out-of-Sample Accuracy: {tree2.score(x_validate, y_validate):.2%}')

Max Depth: 3
Out-of-Sample Accuracy: 77.16%

Max Depth: 4
Out-of-Sample Accuracy: 77.78%


The model with a max depth of 4 performs better (more accurate by 0.62%) than the model with a max depth of 3.

In [18]:
# using method from review to try out several max_depth values:
# use for loop, dictionary, and dataframe to view accuracies/differences in train/validate accuracy
scores = []
for n in range(3, 21):
    tree = DecisionTreeClassifier(max_depth=n, random_state=123)
    tree = tree.fit(x_train, y_train)
    accuracy_train = tree.score(x_train, y_train)
    accuracy_validate = tree.score(x_validate, y_validate)
    output = {'max_depth':n,
              'train_accuracy':accuracy_train,
              'validate_accuracy':accuracy_validate}
    scores.append(output)
decision_trees = pd.DataFrame(scores)
decision_trees['difference'] = decision_trees.train_accuracy - decision_trees.validate_accuracy
decision_trees

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
0,3,0.830239,0.771605,0.058634
1,4,0.838196,0.777778,0.060419
2,5,0.859416,0.796296,0.06312
3,6,0.888594,0.728395,0.160199
4,7,0.925729,0.771605,0.154125
5,8,0.965517,0.777778,0.187739
6,9,0.976127,0.765432,0.210695
7,10,0.984085,0.771605,0.21248
8,11,0.992042,0.783951,0.208092
9,12,0.994695,0.759259,0.235436


In [19]:
# narrow results to those with a train/validate difference below 0.1 (to avoid choosing an overfit model)
# sort by validate_accuracy and then by difference;
# want to see most accurate model on out-of-sample data 
# and one with lowest difference if models have same validate_accuracy
decision_trees[decision_trees.difference<=0.1].sort_values(by=['validate_accuracy', 'difference'], ascending=[False, True])

Unnamed: 0,max_depth,train_accuracy,validate_accuracy,difference
2,5,0.859416,0.796296,0.06312
1,4,0.838196,0.777778,0.060419
0,3,0.830239,0.771605,0.058634


When looking at decision tree models with max depths ranging from 3 to 20, the model with a max depth of 5 appears to be the model which performs best with the titanic data.

------------------------

#### Bonus:

#### 1. Work through these same exercises using the Telco dataset.

#### 2. Experiment with this model on other datasets with a higher number of output classes.

---

# Random Forest Exercises

#### Continue working in your model file with titanic data to do the following:

#### 1. Fit the Random Forest classifier to your training sample and transform (i.e. make predictions on the training sample) setting the random_state accordingly and setting min_samples_leaf = 1 and max_depth = 10.

In [20]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(min_samples_leaf=1, max_depth=10, random_state=123)
rf.fit(x_train, y_train)
y_pred = rf.predict(x_train)

#### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [21]:
rf.score(x_train, y_train)

0.9814323607427056

In [22]:
labels=sorted(y_train.unique())
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred), index=labels, columns=labels)

Unnamed: 0,0,1
0,218,2
1,5,152


In [23]:
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.98       220
           1       0.99      0.97      0.98       157

    accuracy                           0.98       377
   macro avg       0.98      0.98      0.98       377
weighted avg       0.98      0.98      0.98       377



#### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [24]:
TP = 152
TN = 218
FP = 2
FN = 5

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 98.14%
True Positive Rate: 96.82%
False Positive Rate: 0.91%
True Negative Rate: 99.09%
False Negative Rate: 3.18%
Precision: 98.70%
Recall: 96.82%
F1 Score: 97.75%
Support - 1: 157
Support - 0: 220


#### 4. Run through steps increasing your min_samples_leaf and decreasing your max_depth.

In [70]:
for n in range(2, 6):
    for i in range(1, 10):
        rf = RandomForestClassifier(min_samples_leaf=n, max_depth=i, random_state=123)
        rf.fit(x_train, y_train)
        y_pred = rf.predict(x_train)
        class_report = sklearn.metrics.classification_report(y_train, y_pred, output_dict=True)
        print(f'Random Forest with min of {n} samples per leaf and max depth of {i}:')
        print(pd.DataFrame(class_report))
        print()

Random Forest with min of 2 samples per leaf and max depth of 1:
                    0           1  accuracy   macro avg  weighted avg
precision    0.740876    0.834951  0.766578    0.787914      0.780053
recall       0.922727    0.547771  0.766578    0.735249      0.766578
f1-score     0.821862    0.661538  0.766578    0.741700      0.755096
support    220.000000  157.000000  0.766578  377.000000    377.000000

Random Forest with min of 2 samples per leaf and max depth of 2:
                    0           1  accuracy   macro avg  weighted avg
precision    0.779528    0.821138  0.793103    0.800333      0.796856
recall       0.900000    0.643312  0.793103    0.771656      0.793103
f1-score     0.835443    0.721429  0.793103    0.778436      0.787962
support    220.000000  157.000000  0.793103  377.000000    377.000000

Random Forest with min of 2 samples per leaf and max depth of 3:
                    0           1  accuracy   macro avg  weighted avg
precision    0.814516    0.860465

Random Forest with min of 4 samples per leaf and max depth of 4:
                    0           1  accuracy   macro avg  weighted avg
precision    0.837398    0.893130  0.856764    0.865264      0.860607
recall       0.936364    0.745223  0.856764    0.840793      0.856764
f1-score     0.884120    0.812500  0.856764    0.848310      0.854294
support    220.000000  157.000000  0.856764  377.000000    377.000000

Random Forest with min of 4 samples per leaf and max depth of 5:
                    0           1  accuracy   macro avg  weighted avg
precision    0.826772    0.918699  0.856764    0.872735      0.865054
recall       0.954545    0.719745  0.856764    0.837145      0.856764
f1-score     0.886076    0.807143  0.856764    0.846609      0.853205
support    220.000000  157.000000  0.856764  377.000000    377.000000

Random Forest with min of 4 samples per leaf and max depth of 6:
                    0           1  accuracy   macro avg  weighted avg
precision    0.856557    0.917293

#### 5. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

The best performing model (accuracy=90.45%) on the in-sample data is the random forest with min_samples_leaf=2 and max_depth=7. After that, the random forests with min_samples_leaf=2, max_depth=6 and min_samples_leaf=3, max_depth=8 perform best with an accuracy of 89.39%.

#### After making a few models, which one has the best performance (or closest metrics) on both train and validate?

In [80]:
rf1 = RandomForestClassifier(min_samples_leaf=2, max_depth=7, random_state=123)
rf1.fit(x_train, y_train)
print('min_samples_leaf=2, max_depth=7')
print(f'Train Accuracy: {rf1.score(x_train, y_train):.2%}')
print(f'Validate Accuracy: {rf1.score(x_validate, y_validate):.2%}')
print(f'Difference: {(rf1.score(x_train, y_train)-rf1.score(x_validate, y_validate)):.2%}\n')
rf2 = RandomForestClassifier(min_samples_leaf=2, max_depth=6, random_state=123)
rf2.fit(x_train, y_train)
print('min_samples_leaf=2, max_depth=6')
print(f'Train Accuracy: {rf2.score(x_train, y_train):.2%}')
print(f'Validate Accuracy: {rf2.score(x_validate, y_validate):.2%}')
print(f'Difference: {(rf2.score(x_train, y_train)-rf2.score(x_validate, y_validate)):.2%}\n')
rf3 = RandomForestClassifier(min_samples_leaf=3, max_depth=8, random_state=123)
rf3.fit(x_train, y_train)
print('min_samples_leaf=3, max_depth=8')
print(f'Train Accuracy: {rf3.score(x_train, y_train):.2%}')
print(f'Validate Accuracy: {rf3.score(x_validate, y_validate):.2%}')
print(f'Difference: {(rf3.score(x_train, y_train)-rf3.score(x_validate, y_validate)):.2%}')

min_samples_leaf=2, max_depth=7
Train Accuracy: 90.45%
Validate Accuracy: 81.48%
Difference: 8.97%

min_samples_leaf=2, max_depth=6
Train Accuracy: 89.39%
Validate Accuracy: 80.86%
Difference: 8.53%

min_samples_leaf=3, max_depth=8
Train Accuracy: 89.39%
Validate Accuracy: 80.86%
Difference: 8.53%


The model with min_samples_leaf=2, max_depth=7 performs better than the other models on both the train and validate samples. However, the other two models have the same accuracy on both train and validate and also have a slightly smaller difference between their train accuracy and their validate accuracy than the best-performing model.

---
# K-Nearest Neighbor Exercises

#### Continue working in your `model` file with the titanic dataset.

#### 1. Fit a K-Nearest Neighbors classifier to your training sample and transform (i.e. make predictions on the training sample)

In [26]:
# import classifier
from sklearn.neighbors import KNeighborsClassifier
# create object
knn = KNeighborsClassifier(n_neighbors=1, weights='uniform')

In [27]:
# fit model
knn.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [28]:
# make predictions
y_pred = knn.predict(x_train)

#### 2. Evaluate your results using the model score, confusion matrix, and classification report.

In [29]:
knn.score(x_train, y_train)

0.9973474801061007

In [30]:
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred))

Unnamed: 0,0,1
0,220,0
1,1,156


In [31]:
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       220
           1       1.00      0.99      1.00       157

    accuracy                           1.00       377
   macro avg       1.00      1.00      1.00       377
weighted avg       1.00      1.00      1.00       377



#### 3. Print and clearly label the following: Accuracy, true positive rate, false positive rate, true negative rate, false negative rate, precision, recall, f1-score, and support.

In [32]:
TN, FP, FN, TP = sklearn.metrics.confusion_matrix(y_train, y_pred).ravel()

TN, FP, FN, TP

(220, 0, 1, 156)

In [33]:
accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 99.73%
True Positive Rate: 99.36%
True Negative Rate: 100.00%
False Positive Rate: 0.00%
False Negative Rate: 0.64%
Precision: 100.00%
Recall: 99.36%
F1 Score: 99.68%
Support - 1: 157
Support - 0: 220


#### 4. Run through steps 2-4 setting k to 10

In [34]:
knn10 = KNeighborsClassifier(n_neighbors=10, weights='uniform')
knn10.fit(x_train, y_train)
y_pred10 = knn10.predict(x_train)
print(f'KNN Accuracy where k=10: {knn10.score(x_train, y_train):.2%}')
print()
print(sklearn.metrics.classification_report(y_train, y_pred10))
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred10))

KNN Accuracy where k=10: 72.41%

              precision    recall  f1-score   support

           0       0.71      0.89      0.79       220
           1       0.76      0.49      0.60       157

    accuracy                           0.72       377
   macro avg       0.74      0.69      0.69       377
weighted avg       0.73      0.72      0.71       377



Unnamed: 0,0,1
0,196,24
1,80,77


In [35]:
TP = 77
TN = 196
FP = 24
FN = 80

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 72.41%
True Positive Rate: 49.04%
True Negative Rate: 89.09%
False Positive Rate: 10.91%
False Negative Rate: 50.96%
Precision: 76.24%
Recall: 49.04%
F1 Score: 59.69%
Support - 1: 157
Support - 0: 220


#### 5. Run through steps 2-4 setting k to 20

In [36]:
knn20 = KNeighborsClassifier(n_neighbors=20, weights='uniform')
knn20.fit(x_train, y_train)
y_pred20 = knn20.predict(x_train)
print(f'KNN Accuracy where k=20: {knn20.score(x_train, y_train):.2%}')
print()
print(sklearn.metrics.classification_report(y_train, y_pred20))
pd.DataFrame(sklearn.metrics.confusion_matrix(y_train, y_pred20))

KNN Accuracy where k=20: 68.44%

              precision    recall  f1-score   support

           0       0.68      0.86      0.76       220
           1       0.69      0.43      0.53       157

    accuracy                           0.68       377
   macro avg       0.69      0.65      0.65       377
weighted avg       0.69      0.68      0.67       377



Unnamed: 0,0,1
0,190,30
1,89,68


In [37]:
TP = 68
TN = 190
FP = 30
FN = 89

accuracy = (TP+TN)/(TP+TN+FP+FN)
print(f'Accuracy: {accuracy:.2%}')
true_pos_rate = TP/(TP+FN)
print(f'True Positive Rate: {true_pos_rate:.2%}')
true_neg_rate = TN/(TN+FP)
print(f'True Negative Rate: {true_neg_rate:.2%}')
false_pos_rate = FP/(FP+TN)
print(f'False Positive Rate: {false_pos_rate:.2%}')
false_neg_rate = FN/(FN+TP)
print(f'False Negative Rate: {false_neg_rate:.2%}')
precision = TP/(TP+FP)
print(f'Precision: {precision:.2%}')
recall = TP/(TP+FN)
print(f'Recall: {recall:.2%}')
f1_score = 2*(precision*recall)/(precision+recall)
print(f'F1 Score: {f1_score:.2%}')
support_1 = TP+FN
print(f'Support - 1: {support_1}')
support_0 = TN+FP
print(f'Support - 0: {support_0}')

Accuracy: 68.44%
True Positive Rate: 43.31%
True Negative Rate: 86.36%
False Positive Rate: 13.64%
False Negative Rate: 56.69%
Precision: 69.39%
Recall: 43.31%
F1 Score: 53.33%
Support - 1: 157
Support - 0: 220


#### 6. What are the differences in the evaluation metrics? Which performs better on your in-sample data? Why?

In [38]:
print('Classification Reports of KNN classifiers on train set:\n')
print(f'k=5:\n{sklearn.metrics.classification_report(y_train, y_pred)}\n')
print(f'k=10:\n{sklearn.metrics.classification_report(y_train, y_pred10)}\n')
print(f'k=20:\n{sklearn.metrics.classification_report(y_train, y_pred20)}\n')

Classification Reports of KNN classifiers on train set:

k=5:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       220
           1       1.00      0.99      1.00       157

    accuracy                           1.00       377
   macro avg       1.00      1.00      1.00       377
weighted avg       1.00      1.00      1.00       377


k=10:
              precision    recall  f1-score   support

           0       0.71      0.89      0.79       220
           1       0.76      0.49      0.60       157

    accuracy                           0.72       377
   macro avg       0.74      0.69      0.69       377
weighted avg       0.73      0.72      0.71       377


k=20:
              precision    recall  f1-score   support

           0       0.68      0.86      0.76       220
           1       0.69      0.43      0.53       157

    accuracy                           0.68       377
   macro avg       0.69      0.65      0.65       37

The model where k=5 performs best on our in-sample data. It has a higher accuracy and a higher F1 score than the models where k is equal to 10 or 20.

#### 7. Which model performs best on our out-of-sample data from `validate`?

In [39]:
print('Accuracy of KNN classifiers on validate set:\n')
print(f'KNN Accuracy where k=5: {knn.score(x_validate, y_validate)}')
print(f'KNN Accuracy where k=10: {knn10.score(x_validate, y_validate)}')
print(f'KNN Accuracy where k=20: {knn20.score(x_validate, y_validate)}')

Accuracy of KNN classifiers on validate set:

KNN Accuracy where k=5: 0.6234567901234568
KNN Accuracy where k=10: 0.654320987654321
KNN Accuracy where k=20: 0.654320987654321


In [40]:
print('Classification Reports of KNN classifiers on validate set:\n')
print(f'k=5:\n{sklearn.metrics.classification_report(y_validate, knn.predict(x_validate))}\n')
print(f'k=10:\n{sklearn.metrics.classification_report(y_validate, knn10.predict(x_validate))}\n')
print(f'k=20:\n{sklearn.metrics.classification_report(y_validate, knn20.predict(x_validate))}\n')

Classification Reports of KNN classifiers on validate set:

k=5:
              precision    recall  f1-score   support

           0       0.67      0.70      0.68        94
           1       0.56      0.51      0.53        68

    accuracy                           0.62       162
   macro avg       0.61      0.61      0.61       162
weighted avg       0.62      0.62      0.62       162


k=10:
              precision    recall  f1-score   support

           0       0.65      0.86      0.74        94
           1       0.66      0.37      0.47        68

    accuracy                           0.65       162
   macro avg       0.66      0.61      0.61       162
weighted avg       0.66      0.65      0.63       162


k=20:
              precision    recall  f1-score   support

           0       0.66      0.85      0.74        94
           1       0.65      0.38      0.48        68

    accuracy                           0.65       162
   macro avg       0.65      0.62      0.61      

The model where k=20 performs best on the out-of-sample data.

---
# Logistic Regression Exercises

#### 1. Create a model that includes age in addition to fare and pclass. Does this model perform better than your baseline?

In [41]:
from sklearn.linear_model import LogisticRegression
import sklearn.linear_model

In [46]:
# create logistic regression object
logreg = LogisticRegression(random_state=123)

In [47]:
# assign dataframe with columns age, fare, and pclass to variable
xtrain_afp = x_train[['age', 'fare', 'pclass']]

# fit logreg to train, only with desired features
logreg.fit(xtrain_afp, y_train)

# make predictions
y_pred = logreg.predict(xtrain_afp)

# print classification report
print(sklearn.metrics.classification_report(y_train, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.81      0.76       220
           1       0.67      0.54      0.59       157

    accuracy                           0.69       377
   macro avg       0.69      0.67      0.67       377
weighted avg       0.69      0.69      0.69       377



In [50]:
print(f'Model Accuracy: {logreg.score(xtrain_afp, y_train):.2%}')
print(f'Baseline Accuracy: {baseline_accuracy:.2%}')

Model Accuracy: 69.50%
Baseline Accuracy: 58.36%


This model performs better than our baseline. The model has an accuracy of 69.5%, which is 11.4% greater than the baseline accuracy.

#### 2. Include sex in your model as well. Note that you'll need to encode or create a dummy variable of this feature before including it in a model.

In [65]:
# create copy of logistic regression object
logreg2 = LogisticRegression(random_state=123)
# filter down to desired features
xtrain_safp = x_train[['sex_male', 'age', 'fare', 'pclass']]
# fit logreg2
logreg2.fit(xtrain_safp, y_train)
# make predictions
y_pred2 = logreg2.predict(xtrain_safp)
# print classification report
# print(sklearn.metrics.classification_report(y_train, y_pred2))

              precision    recall  f1-score   support

           0       0.82      0.85      0.83       220
           1       0.77      0.73      0.75       157

    accuracy                           0.80       377
   macro avg       0.79      0.79      0.79       377
weighted avg       0.80      0.80      0.80       377



#### 3. Try out other combinations of features and models.

In [52]:
train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embark_town,alone,embark_town_Queenstown,embark_town_Southampton,sex_male
513,1,1,female,54.0,1,0,59.4,Cherbourg,0,0,0,0
169,0,3,male,28.0,0,0,56.4958,Southampton,1,0,1,1
276,0,3,female,45.0,0,0,7.75,Southampton,1,0,1,0
541,0,3,female,9.0,4,2,31.275,Southampton,0,0,1,0
406,0,3,male,51.0,0,0,7.75,Southampton,1,0,1,1


In [54]:
logreg3 = LogisticRegression(random_state=123)
xtrain_spa = x_train[['sibsp', 'parch', 'alone']]
logreg3.fit(xtrain_spa, y_train)
y_pred3 = logreg3.predict(xtrain_spa)
print(sklearn.metrics.classification_report(y_train, y_pred3))

              precision    recall  f1-score   support

           0       0.65      0.87      0.75       220
           1       0.66      0.35      0.46       157

    accuracy                           0.66       377
   macro avg       0.66      0.61      0.60       377
weighted avg       0.66      0.66      0.63       377



In [56]:
logreg4 = LogisticRegression(random_state=123)
xtrain_asp = x_train[['alone', 'sex_male', 'pclass']]
logreg4.fit(xtrain_asp, y_train)
y_pred4 = logreg4.predict(xtrain_asp)
print(sklearn.metrics.classification_report(y_train, y_pred4))

              precision    recall  f1-score   support

           0       0.80      0.85      0.82       220
           1       0.77      0.69      0.73       157

    accuracy                           0.79       377
   macro avg       0.78      0.77      0.78       377
weighted avg       0.79      0.79      0.79       377



In [58]:
logreg5 = LogisticRegression(C=0.1, random_state=123)
logreg5.fit(xtrain_safp, y_train)
y_pred5 = logreg5.predict(xtrain_safp)
print(sklearn.metrics.classification_report(y_train, y_pred5))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       220
           1       0.79      0.71      0.75       157

    accuracy                           0.80       377
   macro avg       0.80      0.79      0.79       377
weighted avg       0.80      0.80      0.80       377



In [59]:
logreg6 = LogisticRegression(C=10, random_state=123)
logreg6.fit(xtrain_safp, y_train)
y_pred6 = logreg5.predict(xtrain_safp)
print(sklearn.metrics.classification_report(y_train, y_pred6))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84       220
           1       0.79      0.71      0.75       157

    accuracy                           0.80       377
   macro avg       0.80      0.79      0.79       377
weighted avg       0.80      0.80      0.80       377



#### 4. Use your best 3 models to predict and evaluate on your validate sample.

The best 3 models are logreg2 (80% accuracy), logreg4 (79% accuracy), and logreg5 (80% accuracy).

In [66]:
y_pred2 = logreg2.predict(x_validate[['sex_male', 'age', 'fare', 'pclass']])
print('logreg2 model')
print(sklearn.metrics.classification_report(y_validate, y_pred2))

y_pred4 = logreg4.predict(x_validate[['alone', 'sex_male', 'pclass']])
print('logreg4 model')
print(sklearn.metrics.classification_report(y_validate, y_pred4))

y_pred5 = logreg5.predict(x_validate[['sex_male', 'age', 'fare', 'pclass']])
print('logreg5 model')
print(sklearn.metrics.classification_report(y_validate, y_pred5))

logreg2 model
              precision    recall  f1-score   support

           0       0.82      0.86      0.84        94
           1       0.79      0.74      0.76        68

    accuracy                           0.81       162
   macro avg       0.81      0.80      0.80       162
weighted avg       0.81      0.81      0.81       162

logreg4 model
              precision    recall  f1-score   support

           0       0.76      0.83      0.80        94
           1       0.73      0.65      0.69        68

    accuracy                           0.75       162
   macro avg       0.75      0.74      0.74       162
weighted avg       0.75      0.75      0.75       162

logreg5 model
              precision    recall  f1-score   support

           0       0.79      0.91      0.85        94
           1       0.85      0.66      0.74        68

    accuracy                           0.81       162
   macro avg       0.82      0.79      0.80       162
weighted avg       0.81      0.8

The logreg2 model appears to perform the best on the validate sample.

#### 5. Choose you best model from the validation performation, and evaluate it on the test dataset. How do the performance metrics compare to validate? to train?

In [67]:
y_pred = logreg2.predict(x_test[['sex_male', 'age', 'fare', 'pclass']])
print('logreg2 model on test')
print(sklearn.metrics.classification_report(y_test, y_pred))

y_pred = logreg2.predict(x_validate[['sex_male', 'age', 'fare', 'pclass']])
print('logreg2 model on validate')
print(sklearn.metrics.classification_report(y_validate, y_pred))

y_pred = logreg2.predict(x_train[['sex_male', 'age', 'fare', 'pclass']])
print('logreg2 model on train')
print(sklearn.metrics.classification_report(y_train, y_pred))

logreg2 model on test
              precision    recall  f1-score   support

           0       0.78      0.78      0.78        79
           1       0.69      0.68      0.68        56

    accuracy                           0.74       135
   macro avg       0.73      0.73      0.73       135
weighted avg       0.74      0.74      0.74       135

logreg2 model on validate
              precision    recall  f1-score   support

           0       0.82      0.86      0.84        94
           1       0.79      0.74      0.76        68

    accuracy                           0.81       162
   macro avg       0.81      0.80      0.80       162
weighted avg       0.81      0.81      0.81       162

logreg2 model on train
              precision    recall  f1-score   support

           0       0.82      0.85      0.83       220
           1       0.77      0.73      0.75       157

    accuracy                           0.80       377
   macro avg       0.79      0.79      0.79       377
wei

This model performs best on the validate sample. It performed better on the train sample than it did on the test sample. The model had the lowest performance on the test sample.