# Titanic - Decision Tree

In this notebook, I will use the Decision Tree algorithm for classification (also known as Classification Tree) to predict which passengers survived the Titanic's shipwreck.

#### Table of contents

[1. Data](#data) <br/>
[2. Building the Decision Tree](#building) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;[2.1. Exploring alternative trees: different criteria for partitioning the tree](#alternative_criteria) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2. Exploring alternative trees: changing the maximum depth](#alternative_depth) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3. Exploring alternative trees: training with less features](#alternative_features) <br/>
&nbsp;&nbsp;&nbsp;&nbsp;[2.4. Exploring alternative trees: setting a maximum number of samples in a leaf](#alternative_samples)<br/>
[3. Predictions for the test dataset](#test)

## 1. Data <a name='data'/>

I will be using the training and testing datasets given at the Titanic Competition. In a <a href='./titanic-exploratory-data-analysis.ipynb'>previous notebook</a>, I performed an Exploratory Data Analysis on the dataset, and cleaned the data. I will use the clean datasets to fit the model and make the predictions here.

In [1]:
import pandas as pd

In [2]:
train_df = pd.read_csv('data/clean_titanic_train.csv', index_col='PassengerId')
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,0,3,male,22.0,1,0,7.25,S,1,0
2,1,1,female,38.0,1,0,71.2833,C,1,0
3,1,3,female,26.0,0,0,7.925,S,0,0
4,1,1,female,35.0,1,0,53.1,S,1,0
5,0,3,male,35.0,0,0,8.05,S,0,0


In [3]:
test_df = pd.read_csv('data/clean_titanic_test.csv', index_col='PassengerId')
test_df.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
892,3,male,34.5,0,0,7.8292,Q,0,0
893,3,female,47.0,1,0,7.0,S,1,0
894,2,male,62.0,0,0,9.6875,Q,0,0
895,3,male,27.0,0,0,8.6625,S,0,0
896,3,female,22.0,1,1,12.2875,S,1,1


We need to adapt the data to the kind of data our model (a Decision Tree) needs. Specifically, we need to convert categorical data (the "Sex" and "Embarked" features) to numerical data.

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Survived     891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Sex          891 non-null    object 
 3   Age          891 non-null    float64
 4   SibSp        891 non-null    int64  
 5   Parch        891 non-null    int64  
 6   Fare         891 non-null    float64
 7   Embarked     891 non-null    object 
 8   SibSpAboard  891 non-null    int64  
 9   ParChAboard  891 non-null    int64  
dtypes: float64(2), int64(6), object(2)
memory usage: 76.6+ KB


In [5]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Pclass       418 non-null    int64  
 1   Sex          418 non-null    object 
 2   Age          418 non-null    float64
 3   SibSp        418 non-null    int64  
 4   Parch        418 non-null    int64  
 5   Fare         418 non-null    float64
 6   Embarked     418 non-null    object 
 7   SibSpAboard  418 non-null    int64  
 8   ParChAboard  418 non-null    int64  
dtypes: float64(2), int64(5), object(2)
memory usage: 32.7+ KB


In [6]:
train_df['SexMale'] = train_df['Sex'].apply(lambda x: 1 if x=='male' else 0)
train_df['SexFemale'] = train_df['Sex'].apply(lambda x: 1 if x=='female' else 0)
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard,SexMale,SexFemale
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1,0,3,male,22.0,1,0,7.25,S,1,0,1,0
2,1,1,female,38.0,1,0,71.2833,C,1,0,0,1
3,1,3,female,26.0,0,0,7.925,S,0,0,0,1
4,1,1,female,35.0,1,0,53.1,S,1,0,0,1
5,0,3,male,35.0,0,0,8.05,S,0,0,1,0


In [7]:
test_df['SexMale'] = test_df['Sex'].apply(lambda x: 1 if x=='male' else 0)
test_df['SexFemale'] = test_df['Sex'].apply(lambda x: 1 if x=='female' else 0)
test_df.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard,SexMale,SexFemale
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
892,3,male,34.5,0,0,7.8292,Q,0,0,1,0
893,3,female,47.0,1,0,7.0,S,1,0,0,1
894,2,male,62.0,0,0,9.6875,Q,0,0,1,0
895,3,male,27.0,0,0,8.6625,S,0,0,1,0
896,3,female,22.0,1,1,12.2875,S,1,1,0,1


In [8]:
train_df['EmbarkedS'] = train_df['Embarked'].apply(lambda x: 1 if x=='S' else 0)
train_df['EmbarkedQ'] = train_df['Embarked'].apply(lambda x: 1 if x=='Q' else 0)
train_df['EmbarkedC'] = train_df['Embarked'].apply(lambda x: 1 if x=='C' else 0)
train_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard,SexMale,SexFemale,EmbarkedS,EmbarkedQ,EmbarkedC
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1,0,3,male,22.0,1,0,7.25,S,1,0,1,0,1,0,0
2,1,1,female,38.0,1,0,71.2833,C,1,0,0,1,0,0,1
3,1,3,female,26.0,0,0,7.925,S,0,0,0,1,1,0,0
4,1,1,female,35.0,1,0,53.1,S,1,0,0,1,1,0,0
5,0,3,male,35.0,0,0,8.05,S,0,0,1,0,1,0,0


In [9]:
test_df['EmbarkedS'] = test_df['Embarked'].apply(lambda x: 1 if x=='S' else 0)
test_df['EmbarkedQ'] = test_df['Embarked'].apply(lambda x: 1 if x=='Q' else 0)
test_df['EmbarkedC'] = test_df['Embarked'].apply(lambda x: 1 if x=='C' else 0)
test_df.head()

Unnamed: 0_level_0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,SibSpAboard,ParChAboard,SexMale,SexFemale,EmbarkedS,EmbarkedQ,EmbarkedC
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
892,3,male,34.5,0,0,7.8292,Q,0,0,1,0,0,1,0
893,3,female,47.0,1,0,7.0,S,1,0,0,1,1,0,0
894,2,male,62.0,0,0,9.6875,Q,0,0,1,0,0,1,0
895,3,male,27.0,0,0,8.6625,S,0,0,1,0,1,0,0
896,3,female,22.0,1,1,12.2875,S,1,1,0,1,1,0,0


In [10]:
train_df.drop(columns=['Sex','Embarked'],inplace=True)
test_df.drop(columns=['Sex','Embarked'],inplace=True)

In [11]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Survived     891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Age          891 non-null    float64
 3   SibSp        891 non-null    int64  
 4   Parch        891 non-null    int64  
 5   Fare         891 non-null    float64
 6   SibSpAboard  891 non-null    int64  
 7   ParChAboard  891 non-null    int64  
 8   SexMale      891 non-null    int64  
 9   SexFemale    891 non-null    int64  
 10  EmbarkedS    891 non-null    int64  
 11  EmbarkedQ    891 non-null    int64  
 12  EmbarkedC    891 non-null    int64  
dtypes: float64(2), int64(11)
memory usage: 97.5 KB


In [12]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Pclass       418 non-null    int64  
 1   Age          418 non-null    float64
 2   SibSp        418 non-null    int64  
 3   Parch        418 non-null    int64  
 4   Fare         418 non-null    float64
 5   SibSpAboard  418 non-null    int64  
 6   ParChAboard  418 non-null    int64  
 7   SexMale      418 non-null    int64  
 8   SexFemale    418 non-null    int64  
 9   EmbarkedS    418 non-null    int64  
 10  EmbarkedQ    418 non-null    int64  
 11  EmbarkedC    418 non-null    int64  
dtypes: float64(2), int64(10)
memory usage: 42.5 KB


Now all the features are numerical.

## 2. Building the Decision Tree <a name='building'/>

We will split the training dataset and allocate 80% to training the model and 20% to validating it.

Let's first build the tree using all the features available in the dataset.

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare','SibSpAboard','ParChAboard',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree = DecisionTreeClassifier(max_depth=5, random_state=17)
tree.fit(x_train, y_train)

tree_prediction = tree.predict(x_validation)
accuracy_score(y_validation, tree_prediction)

0.8044692737430168

In the <a href='./titanic-exploratory-data-analysis.ipynb'>EDA notebook</a>, we engineered the features "SibSpAboard" and "ParChAboard", which are boolean features that are given 1 if the passenger had siblings or spouses aboard, and parents or children aboard, respectively, and 0 otherwise.

Let's build two separate trees: one considering the initial features "SibSp" and "Parch" (number of siblings and spouses aboard, and number of parents and children aboard, respectively), and the other one considering "SibSpAboard" and "ParChAboard", and see which one performs better. Note that we fix a random state, so that randomness does not influence the difference in the performance of both trees.

In [14]:
target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_original_features = DecisionTreeClassifier(max_depth=5, random_state=17)
tree_original_features.fit(x_train, y_train)

tree_original_features_prediction = tree_original_features.predict(x_validation)
accuracy_score(y_validation, tree_original_features_prediction)

0.8044692737430168

In [15]:
target = train_df['Survived']
train = train_df[
    ['Pclass','Age','Fare','SibSpAboard','ParChAboard',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_engineered_features = DecisionTreeClassifier(max_depth=5, random_state=17)
tree_engineered_features.fit(x_train, y_train)

tree_engineered_features_prediction = tree_engineered_features.predict(x_validation)
accuracy_score(y_validation, tree_engineered_features_prediction)

0.7988826815642458

Though the difference in the performance is not too big (a 0.56% of differency in accuracy), the tree trained with the original "SibSp" and "Parch" features performs better on the validation set.

Let's now visualize both trees.

In [16]:
import pydotplus  # !pip install pydotplus
from sklearn.tree import export_graphviz

def tree_graph_to_png(tree, feature_names, png_file_to_save):
    tree_str = export_graphviz(
        tree, feature_names=feature_names, filled=True, out_file=None
    )
    graph = pydotplus.graph_from_dot_data(tree_str)
    graph.write_png(png_file_to_save)

In [17]:
# Tree trained with the original "SibSp" and "Parch" features
tree_graph_to_png(
    tree=tree_original_features,
    feature_names=['Pclass','Age','SibSp','Parch','Fare',
         'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC'],
    png_file_to_save='img/tree_original_features.png'
)

In [18]:
# Tree trained with the engineered "SibSpAboard" and "ParChAboard" features
tree_graph_to_png(
    tree=tree_engineered_features,
    feature_names=['Pclass','Age','Fare','SibSpAboard','ParChAboard',
         'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC'],
    png_file_to_save='img/tree_engineered_features.png'
)

The tree trained with the original "SibSp" and "Parch" features is the following:
![tree_original_features.png](./img/tree_original_features.png)

The tree trained with the engineered "SibSpAboard" and "ParChAboard" features is the following:
![tree_engineered_features.png](./img/tree_engineered_features.png)

We note that the first feature the tree splits on is the "Sex" feature (here represented with the numerical boolean feature "SexFemale"). In the <a href='./titanic-exploratory-data-analysis.ipynb'>EDA performed on the Titanic dataset</a>, I saw that this feature is the one that influenced the most the chances of survival: the proportion of survivors among female passengers was of 74.20%, while among the male passengers it was of 18.89%. Also, we see that the tree gives primary importance to the age and the ticket class of the passenger.

Since Decision Trees are not computationally expensive (that is, they allow for fast training and forecasting), we can explore some alternative trees, by modifying paramenters, or the training dataset. We will compare their performance to the first tree, `tree_original_features`, which is the one with the best accuracy on the validation set.

### 2.1. Exploring alternative trees: different criterion for partitioning the tree <a name='alternative_criteria'/>

By default, the function used by the `sklearn.tree.DecisionTreeClassifier` class to measure the quality of a split is the Gini uncertainy. For a system with $N$ possible states, the Gini uncertainty (or Gini impurity) is defined as
$$
G = 1 - \sum_{k=1}^N (p_k)^2
$$
where $p_k$ is the probability of the system being in the $k$-th state.

An alternative function to measure the quality of a split is Shannon's entropy, defined as
$$
S = -\sum_{k=1}^N p_k\log_2 p_k
$$
Entropy can be described as the "degree of chaos (or uncertainty) of a system". Thus, the lower the entropy, the more "ordered" the system is. The reduction in entropy is called the information gain. The algorithm will try to maximize the information gain in every split.

The plot of the Shannon's entropy and the plot of two times the Gini uncertainty are very close to each other. Therefore, in practice, both criteria work in a very similar way, so we can expect the performance not to change significantly.

In [19]:
target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_entropy = DecisionTreeClassifier(max_depth=5, random_state=17, criterion='entropy')
tree_entropy.fit(x_train, y_train)

tree_entropy_prediction = tree_entropy.predict(x_validation)
accuracy_score(y_validation, tree_entropy_prediction)

0.776536312849162

The performance of the tree is not improved by considering the Shannon's entropy criterion.

### 2.2. Exploring alternative trees: changing the maximum depth <a name='alternative_depth'/>

If we train a tree with less depth, the performance is reduced.

In [20]:
# Tree with original feature, less depth

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

# Tree with max_depth 3
tree_depth3 = DecisionTreeClassifier(max_depth=3, random_state=17)
tree_depth3.fit(x_train, y_train)

tree_depth3_prediction = tree_depth3.predict(x_validation)

# Tree with max_depth 4
tree_depth4 = DecisionTreeClassifier(max_depth=4, random_state=17)
tree_depth4.fit(x_train, y_train)

tree_depth4_prediction = tree_depth4.predict(x_validation)

print('Accuracy with max_depth 3:',accuracy_score(y_validation, tree_depth3_prediction))
print('Accuracy with max_depth 4:',accuracy_score(y_validation, tree_depth4_prediction))

Accuracy with max_depth 3: 0.776536312849162
Accuracy with max_depth 4: 0.7821229050279329


If we train a tree with more depth, the performance is also reduced.

In [21]:
# Tree with original feature, more depth

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

# Tree with max_depth 6
tree_depth6 = DecisionTreeClassifier(max_depth=6, random_state=17)
tree_depth6.fit(x_train, y_train)

tree_depth6_prediction = tree_depth6.predict(x_validation)

# Tree with max_depth 7
tree_depth7 = DecisionTreeClassifier(max_depth=7, random_state=17)
tree_depth7.fit(x_train, y_train)

tree_depth7_prediction = tree_depth7.predict(x_validation)

# Tree with max_depth 10
tree_depth10 = DecisionTreeClassifier(max_depth=10, random_state=17)
tree_depth10.fit(x_train, y_train)

tree_depth10_prediction = tree_depth10.predict(x_validation)

print('Accuracy with max_depth 6:',accuracy_score(y_validation, tree_depth6_prediction))
print('Accuracy with max_depth 7:',accuracy_score(y_validation, tree_depth7_prediction))
print('Accuracy with max_depth 10:',accuracy_score(y_validation, tree_depth10_prediction))

Accuracy with max_depth 6: 0.7877094972067039
Accuracy with max_depth 7: 0.7932960893854749
Accuracy with max_depth 10: 0.7988826815642458


### 2.3. Exploring alternative trees: training with less features <a name='alternative_features'/>

Let's train a tree with less features, removing the ones that we saw in the <a href='./titanic-exploratory-data-analysis.ipynb'>EDA</a> that were highly correlated to other features.

We saw that the "Fare" feature is highly correlated to the ticket class, and the port of embarkation ("Embarked" feature) holds relation to the ticket class as well: most of the passengers picked up at Cherbourg were First class. We will remove these two features, and consider only the "Pclass" one, and the rest of the features.

In [22]:
target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch',
     'SexMale','SexFemale']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_less_features = DecisionTreeClassifier(max_depth=5, random_state=17)
tree_less_features.fit(x_train, y_train)

tree_less_features_prediction = tree_less_features.predict(x_validation)
accuracy_score(y_validation, tree_less_features_prediction)

0.7821229050279329

The performance of the tree does not improve in comparison to the initial one.

### 2.4. Exploring alternative trees: setting a maximum number of samples in a leaf <a name='alternative_samples'/>

Let's set different minimum numbers of samples in a leaf, and see how the performance of the tree changes.

In [23]:
# Minimum of 10 samples in a leaf

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_10samples = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, random_state=17)
tree_10samples.fit(x_train, y_train)

tree_10samples_prediction = tree_10samples.predict(x_validation)
accuracy_score(y_validation, tree_10samples_prediction)

0.776536312849162

In [24]:
# Minimum of 20 samples in a leaf

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_20samples = DecisionTreeClassifier(max_depth=5, min_samples_leaf=20, random_state=17)
tree_20samples.fit(x_train, y_train)

tree_20samples_prediction = tree_20samples.predict(x_validation)
accuracy_score(y_validation, tree_20samples_prediction)

0.770949720670391

In [25]:
# Minimum of 50 samples in a leaf

target = train_df['Survived']
train = train_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

x_train, x_validation, y_train, y_validation = train_test_split(
    train, target, test_size=0.2, random_state=17
)

tree_50samples = DecisionTreeClassifier(max_depth=5, min_samples_leaf=50, random_state=17)
tree_50samples.fit(x_train, y_train)

tree_50samples_prediction = tree_50samples.predict(x_validation)
accuracy_score(y_validation, tree_50samples_prediction)

0.7653631284916201

## 3. Predictions for the test dataset <a name='test'/>

Since we have not seen significant improvement in accuracy for any of the alternative Decision Trees, I will use the initial one, with the original fatures "SibSp" and "Parch" (`tree_original_features`, the one with the best accuracy, of 80.45%) to predict the survivals for the test set.

In [26]:
final_tree = tree_original_features

test = test_df[
    ['Pclass','Age','SibSp','Parch','Fare',
     'SexMale','SexFemale','EmbarkedS','EmbarkedQ','EmbarkedC']
]

predictions = final_tree.predict(test)

test_df['Survived'] = predictions
test_df.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,SibSpAboard,ParChAboard,SexMale,SexFemale,EmbarkedS,EmbarkedQ,EmbarkedC,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
892,3,34.5,0,0,7.8292,0,0,1,0,0,1,0,0
893,3,47.0,1,0,7.0,1,0,0,1,1,0,0,0
894,2,62.0,0,0,9.6875,0,0,1,0,0,1,0,0
895,3,27.0,0,0,8.6625,0,0,1,0,1,0,0,0
896,3,22.0,1,1,12.2875,1,1,0,1,1,0,0,1


I will submit the result to the <a href='https://www.kaggle.com/c/titanic'>Titanic competition</a>.

In [27]:
submission_df = test_df['Survived']
submission_df.head()

PassengerId
892    0
893    0
894    0
895    0
896    1
Name: Survived, dtype: int64

In [28]:
submission_df.mean()

0.35167464114832536

The proportion of survivals in the test set (35.17%) is close to the proportion of survivals in the train set (38.38%), seen in the <a href='./titanic-exploratory-data-analysis.ipynb'>EDA performed on the Titanic dataset</a>, so the result makes sense.

In [30]:
submission_df.to_csv('data/titanic_decision_tree_submission.csv',header=True,index=True)

The score of the submission made using the Decision Tree algorithm is: 0.77751