# CPE695 HW3
**By: Tyler Bryk**

In this assignment, we will explore the titanic dataset and create a decision tree to predict the survival rate of passengers on board. 

**Step 1:** Read in Titanic.csv and observe a few samples, some features are categorical and others are numerical. Take a random 80% samples for training and the rest 20% for test.



*In this step, we load our Titanic.csv dataset into a Pandas DataFrame. First, we convert values in the sex column to be numerical, female->0 and male->1, then we do the same for the pclass column, 1st->1, 2nd->2, 3rd->3. The age column contains several missing values, so we will impute those missing instances with the median value for the age column. Lastly, we will split our dataset into 80% training data, and 20% testing data.*

In [1]:
# Import Libraries
import pandas as pd
import graphviz
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_graphviz
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, accuracy_score

In [2]:
# Load Data and Split 80:20
data = pd.read_csv('Titanic.csv', index_col=0)
data['sex'].replace(['female','male'],[0,1], inplace=True)
data['pclass'].replace(['1st','2nd','3rd'],[1,2,3], inplace=True)
data['age'] = data['age'].fillna(data['age'].median())
x = data[['pclass', 'sex', 'age', 'sibsp']]
y = data['survived']
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=10413641)

# Decision Tree Modeling
**Step 2:** Fit a decision tree model using independent variables ‘pclass + sex + age + sibsp’ and dependent variable ‘survived’. Then plot the full tree.

*In this step, we fit a decision tree classifier with the training data. Our classifier uses the default parameters specified by the sklearn library. Afterwards, the initial tree is plotted and saved to a DT-Initial.png file. The tree is also illustrated below in the notebook. Based on the visualization of the tree, it can be observed that the tree depth is 15, which is very clunky. It is also important to remember that the sex class has been split into numerical values where female->0 and male->1, so for the root node of our tree, females are to the left section, and males are to the right. Despite this tree being very large, it seems to make logical sense, and it also agrees with the example tree that the professor illustrated in the homework assignment. In the next step, we will determine the accuracy of the tree, and then prune the tree to a smaller depth.*

In [3]:
# Create a Decision Tree with Default Parameters
clf = DecisionTreeClassifier(random_state=10413641).fit(xTrain, yTrain)

In [4]:
# Plot the Decision Tree using Graphviz
dot_data = export_graphviz(clf, filled=True, feature_names = ['pclass','sex','age','sibsp'], class_names=['0','1'])
graph = graphviz.Source(dot_data, format="png") 
graph.render("DT-Initial")
graph

ExecutableNotFound: failed to execute ['dot', '-Tpng', '-O', 'DT-Initial'], make sure the Graphviz executables are on your systems' PATH

# Sampling Errors
**Step 3:** Print out the performance measures of the full model:
- In‐sample percent survivors correctly predicted (on training set)
- In-sample percent fatalities correctly predicted (on training set) 
- Out‐of‐sample percent survivors correctly predicted (on test set)
- Out-of‐sample percent fatalities correctly predicted (on test set)

*In this step, we calculate the in and out-of-sample accuracy rates. After training the model on the training data, we then test the model on the training data again (In-sample) and then on the testing data (Out-of-sample). The confusion matrix for each trial is displayed below. The in-sample test gave a very promising overall accuracy of roughly 88% where the positive class was correctly predicted 76% and the negative class was correctly predicted 95% of the time. While the in-sample accuracy is very good, the out-of-sample accuracy was roughly 10% lower across the board, indicating that our decision tree model is overfit. Knowing that our model is overfit, we will now try to prune the tree and find the optimal parameters. We will then test the model on these performance metrics again, and hope that the testing accuracy closer resembles the training accuracy.*

In [None]:
# In-Sample Error
insamplePreds = clf.predict(xTrain)
tn, fp, fn, tp = confusion_matrix(yTrain, insamplePreds).ravel()
print('Percent Survivors Correctly Predicted:\t', 100*(tp / (tp + fn)), '%')
print('Percent Fatalities Correctly Predicted:\t', 100*(tn / (tn + fp)), '%')
print('Overall In-Sample Accuracy:\t\t', 100*accuracy_score(insamplePreds, yTrain), '%\n')
plot_confusion_matrix(clf, xTrain, yTrain, cmap=plt.cm.Blues, display_labels = ['Did Not Survive', 'Survived'], values_format='d')
plt.show()

In [None]:
# Out-of-Sample Error
outsamplePreds = clf.predict(xTest)
tn, fp, fn, tp = confusion_matrix(yTest, outsamplePreds).ravel()
print('Percent Survivors Correctly Predicted:\t', 100*(tp / (tp + fn)), '%')
print('Percent Fatalities Correctly Predicted:\t', 100*(tn / (tn + fp)), '%')
print('Overall Out-of-Sample Accuracy:\t\t', 100*accuracy_score(outsamplePreds, yTest), '%\n')
plot_confusion_matrix(clf, xTest, yTest, cmap=plt.cm.Blues, display_labels = ['Did Not Survive', 'Survived'], values_format='d')
plt.show()

# Cross Validation

**Step 4:** Use cross‐validation to find the best parameter to prune the tree. You should be able to plot a graph with the ‘tree size’ as the x-axis and ‘number of misclassification’ as the Y-axis. Find the minimum number of misclassification and choose the corresponding tree size to prune the tree.

*In this step, we use cross validation to prune the tree, and find the optimal model parameters. We will use sklearn's gridsearchCV method to opitimize our model. A k-fold CV method with 5 folds was selected because it yielded the highest overall accuracy across samples where 5, 10, 15 folds were sampled. The parameters being tested for sklearn's decision tree classifier are displayed in output below, along with each of their optimal values. For our tree, it appears that the optimal tree depth is between 3 and 4 nodes, and the max number of leaf nodes should be 8. The tree depth results vary, the gridsearch suggests 4, however the graph of tree size vs. error suggest a depth of 3. We will try both depths on our final pruned model, and we will select whichever depth gives the highest sample accuracy. Looking ahead, a max depth of four was the ideal size.*

In [None]:
# Use 5-Fold Cross Validation to Tune Model Parameters

params = [{ 'max_depth':[2,3,4,5,6],
            'max_leaf_nodes':[None,2,3,4,5,6,7,8,9,10]  }]

gridsearchDT05 = GridSearchCV(clf, params, cv= 5).fit(xTrain,yTrain)
gridsearchDT10 = GridSearchCV(clf, params, cv=10).fit(xTrain,yTrain)
gridsearchDT15 = GridSearchCV(clf, params, cv=15).fit(xTrain,yTrain)

print("Accuracy for  5-folds: {}".format(gridsearchDT05.score(xTest,yTest)))
print("Accuracy for 10-folds: {}".format(gridsearchDT10.score(xTest,yTest)))
print("Accuracy for 15-folds: {}".format(gridsearchDT15.score(xTest,yTest)))

print("Optimal Parameters with 5-fold: {}".format(gridsearchDT05.best_params_))

In [None]:
# Plot Tree Size vs. Misclassification Rate
misclassTr = []
misclassTe = []
for node in range(10):
  c = DecisionTreeClassifier(max_depth=(node+1), max_leaf_nodes=8).fit(xTrain,yTrain)
  predsTr = c.predict(xTrain)
  predsTe = c.predict(xTest)
  tnr, fpr, fnr, tpr = confusion_matrix(yTrain, predsTr).ravel()
  tne, fpe, fne, tpe = confusion_matrix(yTest,  predsTe).ravel()
  misclassTr.append(fpr+fnr)
  misclassTe.append(fpe+fne)
plt.figure(figsize=(12,4))
plt.subplot(121)
plt.xlabel("Tree Size")
plt.ylabel("# of Misclassifications")
plt.title("Tree Size vs. Number of Misclassifications (Train)")
plt.plot(range(1,11),misclassTr, linestyle='--', marker='o', color='b')
plt.subplot(122)
plt.xlabel("Tree Size")
plt.ylabel("# of Misclassifications")
plt.title("Tree Size vs. Number of Misclassifications (Test)")
plt.plot(range(1,11),misclassTe, linestyle='--', marker='o', color='r')
plt.show()

# Pruning the Tree
**Step 5:** Prune the tree with the optimal tree size and plot the pruned tree.

In [None]:
# Create a Pruned Decision Tree with Max Depth 4 and Max Leaf Nodes 8
pclf = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=8, random_state=10413641).fit(xTrain, yTrain)

In [None]:
# Plot the Pruned Decision Tree using Graphviz
dot_data = export_graphviz(pclf, filled=True, feature_names = ['pclass','sex','age','sibsp'], class_names=['0','1'])
graph = graphviz.Source(dot_data, format="png") 
graph.render("DT-Pruned")
graph

# Pruned Sampling Errors

**Step 6:** For the final pruned tree, report its in‐sample and out‐of‐sample accuracy, defined as:
- In‐sample percent survivors correctly predicted (on training set) 
- In‐sample percent fatalities correctly predicted (on training set) 
- Out‐of‐sample percent survivors correctly predicted (on test set) 
- Out‐of‐sample percent fatalities correctly predicted (on test set)

Check whether there is improvement in out‐of‐sample for the full tree (bigger model) and the pruned tree (smaller model).

*Overall, pruning the decision tree helped imporve the out-of-sample accuracy of the model. If we look at the accuracies for the initial full tree: 88% In-sample, 77% Out-of-sample, we noticed that the training accuracy was faily high, and the model was clearly overfit because of the large gap between training and testing accuracy. Thankfully, this was no longer the case in the pruned model. Looking at the new performance: 81% In-sample, 79% Out-of-sample, we first notice that the testing accuracy increased by a modest 2%, but more importantly the difference between training and testing accuracy is very close, which indicates that the model is no longer overfit.*

In [None]:
# In-Sample Error
insamplePreds = pclf.predict(xTrain)
tn, fp, fn, tp = confusion_matrix(yTrain, insamplePreds).ravel()
print('Percent Survivors Correctly Predicted:\t', 100*(tp / (tp + fn)), '%')
print('Percent Fatalities Correctly Predicted:\t', 100*(tn / (tn + fp)), '%')
print('Overall In-Sample Accuracy:\t\t', 100*accuracy_score(insamplePreds, yTrain), '%\n')
plot_confusion_matrix(pclf, xTrain, yTrain, cmap=plt.cm.Blues, display_labels = ['Did Not Survive', 'Survived'], values_format='d')
plt.show()

In [None]:
# Out-of-Sample Error
outsamplePreds = pclf.predict(xTest)
tn, fp, fn, tp = confusion_matrix(yTest, outsamplePreds).ravel()
print('Percent Survivors Correctly Predicted:\t', 100*(tp / (tp + fn)), '%')
print('Percent Fatalities Correctly Predicted:\t', 100*(tn / (tn + fp)), '%')
print('Overall Out-of-Sample Accuracy:\t\t', 100*accuracy_score(outsamplePreds, yTest), '%\n')
plot_confusion_matrix(pclf, xTest, yTest, cmap=plt.cm.Blues, display_labels = ['Did Not Survive', 'Survived'], values_format='d')
plt.show()