# Lab: Titanic Survival Exploration with Decision Trees

## Getting Started
In this lab, you will see how decision trees work by implementing a decision tree in sklearn.

We'll start by loading the dataset and displaying some of its rows.

In [43]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from IPython.display import display # Allows the use of display() for DataFrames

# Pretty display for notebooks
%matplotlib inline

# Set a random seed
import random
random.seed(42)

# Load the dataset
in_file = 'titanic_data.csv'
full_data = pd.read_csv(in_file)

kaggle_test_file = 'test.csv'
kaggle_test_set = pd.read_csv(kaggle_test_file)

# Print the first few entries of the RMS Titanic data
display(full_data.head())

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [44]:
display(kaggle_test_set.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Recall that these are the various features present for each passenger on the ship:
- **Survived**: Outcome of survival (0 = No; 1 = Yes)
- **Pclass**: Socio-economic class (1 = Upper class; 2 = Middle class; 3 = Lower class)
- **Name**: Name of passenger
- **Sex**: Sex of the passenger
- **Age**: Age of the passenger (Some entries contain `NaN`)
- **SibSp**: Number of siblings and spouses of the passenger aboard
- **Parch**: Number of parents and children of the passenger aboard
- **Ticket**: Ticket number of the passenger
- **Fare**: Fare paid by the passenger
- **Cabin** Cabin number of the passenger (Some entries contain `NaN`)
- **Embarked**: Port of embarkation of the passenger (C = Cherbourg; Q = Queenstown; S = Southampton)

Since we're interested in the outcome of survival for each passenger or crew member, we can remove the **Survived** feature from this dataset and store it as its own separate variable `outcomes`. We will use these outcomes as our prediction targets.  
Run the code cell below to remove **Survived** as a feature of the dataset and store it in `outcomes`.

In [2]:
# Store the 'Survived' feature in a new variable and remove it from the dataset
outcomes = full_data['Survived']
features_raw = full_data.drop('Survived', axis = 1)

# Show the new dataset with 'Survived' removed
display(features_raw.head())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


The very same sample of the RMS Titanic data now shows the **Survived** feature removed from the DataFrame. Note that `data` (the passenger data) and `outcomes` (the outcomes of survival) are now *paired*. That means for any passenger `data.loc[i]`, they have the survival outcome `outcomes[i]`.

## Preprocessing the data

Now, let's do some data preprocessing. First, we'll one-hot encode the features.

In [64]:
features = features_raw.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])
kaggle_test_set_dropped = kaggle_test_set.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'])

features = pd.get_dummies(features)
features_kaggle = pd.get_dummies(kaggle_test_set_dropped)

And now we'll fill in any blanks with zeroes.

In [65]:
features = features.fillna(0.0)
features_kaggle = features_kaggle.fillna(0.0)
display(features.head())

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,3,22.0,1,0,7.25,0,1,0,0,1
1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,26.0,0,0,7.925,1,0,0,0,1
3,1,35.0,1,0,53.1,1,0,0,0,1
4,3,35.0,0,0,8.05,0,1,0,0,1


## (TODO) Training the model

Now we're ready to train a model in sklearn. First, let's split the data into training and testing sets. Then we'll train the model on the training set.

In [66]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, outcomes, test_size=0.2, random_state=42)

In [67]:
len(y_test)

179

In [68]:
# Import the classifier from sklearn
from sklearn.tree import DecisionTreeClassifier

# TODO: Define the classifier, and fit it to the data
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

## Testing the model
Now, let's see how our model does, let's calculate the accuracy over both the training and the testing set.

In [69]:
# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate the accuracy
from sklearn.metrics import accuracy_score
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.9803370786516854
The test accuracy is 0.776536312849162


# Exerise: Improving the model

Ok, high training accuracy and a lower testing accuracy. We may be overfitting a bit.

So now it's your turn to shine! Train a new model, and try to specify some parameters in order to improve the testing accuracy, such as:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`

You can use your intuition, trial and error, or even better, feel free to use Grid Search!

**Challenge:** Try to get to 85% accuracy on the testing set. If you'd like a hint, take a look at the solutions notebook in this same folder.

In [70]:
X_train.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
331,1,45.5,0,0,28.5,0,1,0,0,1
733,2,23.0,0,0,13.0,0,1,0,0,1
382,3,32.0,0,0,7.925,0,1,0,0,1
704,3,26.0,1,0,7.8542,0,1,0,0,1
813,3,6.0,4,2,31.275,1,0,0,0,1


In [82]:
X_test.head()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
709,3,0.0,1,1,15.2458,0,1,1,0,0
439,2,31.0,0,0,10.5,0,1,0,0,1
840,3,20.0,0,0,7.925,0,1,0,0,1
720,2,6.0,0,1,33.0,1,0,0,0,1
39,3,14.0,1,0,11.2417,1,0,1,0,0


In [85]:
X_train.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
count,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0,712.0
mean,2.330056,23.698511,0.553371,0.379213,32.586276,0.344101,0.655899,0.175562,0.08427,0.73736
std,0.824584,17.507272,1.176404,0.791669,51.969529,0.475408,0.475408,0.380714,0.277987,0.440378
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,5.0,0.0,0.0,7.925,0.0,0.0,0.0,0.0,0.0
50%,3.0,24.0,0.0,0.0,14.4542,0.0,1.0,0.0,0.0,1.0
75%,3.0,35.0,1.0,0.0,30.5,1.0,1.0,0.0,0.0,1.0
max,3.0,80.0,8.0,6.0,512.3292,1.0,1.0,1.0,1.0,1.0


In [71]:
# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(X_train, y_train)

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8623595505617978
The test accuracy is 0.8715083798882681


In [93]:
pd.concat([X_train, X_test], axis=1)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Pclass.1,Age.1,SibSp.1,Parch.1,Fare.1,Sex_female.1,Sex_male.1,Embarked_C.1,Embarked_Q.1,Embarked_S.1
0,3.0,22.0,1.0,0.0,7.2500,0.0,1.0,0.0,0.0,1.0,,,,,,,,,,
1,1.0,38.0,1.0,0.0,71.2833,1.0,0.0,1.0,0.0,0.0,,,,,,,,,,
2,3.0,26.0,0.0,0.0,7.9250,1.0,0.0,0.0,0.0,1.0,,,,,,,,,,
3,1.0,35.0,1.0,0.0,53.1000,1.0,0.0,0.0,0.0,1.0,,,,,,,,,,
4,3.0,35.0,0.0,0.0,8.0500,0.0,1.0,0.0,0.0,1.0,,,,,,,,,,
5,,,,,,,,,,,3.0,0.0,0.0,0.0,8.4583,0.0,1.0,0.0,1.0,0.0
6,1.0,54.0,0.0,0.0,51.8625,0.0,1.0,0.0,0.0,1.0,,,,,,,,,,
7,3.0,2.0,3.0,1.0,21.0750,0.0,1.0,0.0,0.0,1.0,,,,,,,,,,
8,3.0,27.0,0.0,2.0,11.1333,1.0,0.0,0.0,0.0,1.0,,,,,,,,,,
9,2.0,14.0,1.0,0.0,30.0708,1.0,0.0,1.0,0.0,0.0,,,,,,,,,,


In [92]:
pd.concat([y_train, y_test], axis=0)

331    0
733    0
382    0
704    0
813    0
118    0
536    0
361    0
29     0
55     1
865    1
595    0
239    0
721    0
81     1
259    1
486    1
716    1
800    0
781    1
542    0
326    0
534    0
535    1
483    1
762    1
533    1
713    0
390    1
495    0
      ..
668    0
584    0
514    0
688    0
109    1
77     0
611    0
643    1
82     1
518    1
657    0
296    0
507    1
808    0
375    1
5      0
54     0
398    0
457    1
521    0
363    0
97     1
417    1
572    1
852    0
433    0
773    0
25     1
84     1
10     1
Name: Survived, Length: 891, dtype: int64

In [71]:
# Training the model
model = DecisionTreeClassifier(max_depth=6, min_samples_leaf=6, min_samples_split=10)
model.fit(pd.concat([X_train, X_test], axis=1), y_train)

# Making predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculating accuracies
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

print('The training accuracy is', train_accuracy)
print('The test accuracy is', test_accuracy)

The training accuracy is 0.8623595505617978
The test accuracy is 0.8715083798882681


In [83]:
model

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=6, min_samples_split=10,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [72]:
y_test_kaggle = model.predict(features_kaggle)

In [86]:
test_set_predictions = pd.DataFrame({'PassengerId': kaggle_test_set['PassengerId'], 'Survived': y_test_kaggle})
test_set_predictions

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [81]:
test_set_predictions.to_csv('test_predictions.csv', index=False)