# Version 02: Decision Tree and Random Forest

Here I'll try a different approach and use a random forest algorithm to tackle this Titanic classification task. I suspect that I'll get similar results: My instinct is that I'll have to think creatively about the features and do some feature engineering in order to improve model performance. Still, I'd like to have one more go at it with the features I used before to see whether changing the algorithm affects performance.

In [1]:
# Standard imports
import pandas as pd
import numpy as np

# Load data
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

# Preview train set
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# Select features we want
features = ['Pclass', 'Sex', 'SibSp', 'Parch']

# One-hot encode categorical features for both training and test sets
X = pd.get_dummies(train[features], columns=['Pclass', 'Sex'])
X_test = pd.get_dummies(test[features], columns=['Pclass', 'Sex'])

# Select target vector
y = train['Survived']

X.head()

Unnamed: 0,SibSp,Parch,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male
0,1,0,0,0,1,0,1
1,1,0,1,0,0,1,0
2,0,0,0,0,1,1,0
3,1,0,1,0,0,1,0
4,0,0,0,0,1,0,1


## Decision Tree

First I'll try a single decision tree.

In [3]:
# Import libraries we'll need for the model
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier

# Split training data into training and cross-val sets
X_train, X_cv, y_train, y_cv = train_test_split(X, y, random_state=1)

# Create model
model = DecisionTreeClassifier(random_state=1)

# Fit model
model.fit(X_train, y_train)

# Make predictions
y_predict = model.predict(X_cv)

# Assess accuracy
print("Accuracy score: {:.4f}". format(accuracy_score(y_cv, y_predict)))

Accuracy score: 0.7578


This tree uses the default model settings, but I'm curious whether we can tune the model and increase performance by specifying different max depths.

In [4]:
# Function that returns accuracy score of a model
def get_accuracy_score(X_train, y_train, X_cv, y_cv, max_depth):
    model = DecisionTreeClassifier(max_depth=max_depth, random_state=1)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_cv)
    acc_score = accuracy_score(y_cv, y_predict)
    return acc_score

In [5]:
max_depth_options = np.arange(2, 10)

scores = {max_depth: get_accuracy_score(X_train, y_train, X_cv, y_cv, max_depth) 
          for max_depth in max_depth_options}

# Return max_depth that resulted in highest accuracy score
best_max_depth = max(scores, key=scores.get)
best_score = scores.get(best_max_depth)

print("Highest accuracy score of {:.4f} achieved with a max depth of {}"
      .format(best_score, best_max_depth))

Highest accuracy score of 0.7848 achieved with a max depth of 3


By optimizing the max_depth parameter, we've increased the accuracy of the model by 3%.

## Random Forest

Let's see if a random forest model will yield better results.

In [6]:
from sklearn.ensemble import RandomForestClassifier

# Function that returns accuracy score of a random forest model
def get_rf_accuracy_score(X_train, y_train, X_cv, y_cv, max_depth):
    model = RandomForestClassifier(max_depth=max_depth, random_state=1)
    model.fit(X_train, y_train)
    y_predict = model.predict(X_cv)
    acc_score = accuracy_score(y_cv, y_predict)
    return acc_score

rf_scores = {max_depth: get_rf_accuracy_score(X_train, y_train, X_cv, y_cv, max_depth) 
             for max_depth in max_depth_options}

# Return max_depth that resulted in highest accuracy score
best_max_depth = max(rf_scores, key=scores.get)
best_score = rf_scores.get(best_max_depth)

print("Highest accuracy score of {:.4f} achieved with a max depth of {}"
      .format(best_score, best_max_depth))

Highest accuracy score of 0.7803 achieved with a max depth of 3


Close to the single tree score, but more robust, so I'll try the test set on this model and see how we do!

In [7]:
model = RandomForestClassifier(max_depth=3, random_state=1)

model.fit(X, y)
y_test = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test.PassengerId, 'Survived': y_test})
output

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [8]:
output.to_csv('submissions/submission02.csv', index=False)