# Titanic dataset - Simple Random Forest Model

The Titanic dataset present a binary classification problem, where the goal is to predict whether a passenger survived or not.   
In this notebook, I will build a simple Random Forest model to predict the survival of passengers.

Will import the data cleaned in the previous step, with the following features:

| Variable | Definition                          | Value                        |
|----------|-------------------------------------|----------------------------|
| survival | Survival                            | 0=No, 1=Yes            |
| pclass   | Ticket class                        | 1=1st, 2=2nd, 3=3rd  |
| sex      | Sex                                 | 0=Female, 1=Male                           |
| Age      | Age in years                        |                            |
| sibsp    | # of siblings / spouses aboard the Titanic |                    |
| parch    | # of parents / children aboard the Titanic |                    |
| fare     | Passenger fare                      |                            |
| cabin    | Cabin number                        | 0=NaN/Unidentified, 1=Yes/Valid Cabin nr                           |
| embarked | Port of Embarkation                 | 0=Cherbourg, 1=Queenstown, 2=Southampton |

In [1]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Load Data

In [2]:
# load cleaned data
train_data = pd.read_csv('data/train_clean.csv')
test_data = pd.read_csv('data/test_clean.csv')

# print the shape of the data
print('Train dataset shape (rows, columns):', train_data.shape)
print('Test dataset shape (rows, columns):', test_data.shape)

train_data.head(3)

Train dataset shape (rows, columns): (891, 10)
Test dataset shape (rows, columns): (418, 9)


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Survived
0,1,3,1,22.0,1,0,7.25,0,2,0
1,2,1,0,38.0,1,0,71.28,1,0,1
2,3,3,0,26.0,0,0,7.92,0,2,1


In [3]:
test_data.head(3)

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,892,3,1,34.5,0,0,7.83,0,1
1,893,3,0,47.0,1,0,7.0,0,2
2,894,2,1,62.0,0,0,9.69,0,1


## Structure data

### One Hot Encoding and Feature Scaling

In [4]:
from sklearn.preprocessing import StandardScaler

y = train_data["Survived"] # target variable

features = ["Pclass", "Sex", "SibSp", "Parch"] # subset of features
X = pd.get_dummies(train_data[features], columns=['Pclass', 'SibSp', 'Parch'])
X_test = pd.get_dummies(test_data[features], columns=['Pclass', 'SibSp', 'Parch'])

# # Feature Scaling
# scaler = StandardScaler()
# X[['Age', 'Fare']] = scaler.fit_transform(X[['Age', 'Fare']])
# X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])
X.head(3)

Unnamed: 0,Sex,Pclass_1,Pclass_2,Pclass_3,SibSp_0,SibSp_1,SibSp_2,SibSp_3,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6
0,1,False,False,True,False,True,False,False,False,False,False,True,False,False,False,False,False,False
1,0,True,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False
2,0,False,False,True,True,False,False,False,False,False,False,True,False,False,False,False,False,False


## Define model - Random Forest v1
Using a subset of features.

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- Parch_9


In [None]:
# Evaluate performance
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print('Cross-Validation Accuracy Scores', scores)
print('Mean Cross-Validation Accuracy Score', scores.mean())

In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submissions/random_forest_1.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


## Define model - Random Forest v2
Using all features/variables in this model.

In [None]:
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Cabin", "Embarked"] # all features
X = pd.get_dummies(train_data[features], columns=['Pclass', 'SibSp', 'Parch', 'Embarked'])
X_test = pd.get_dummies(test_data, columns=['Pclass', 'SibSp', 'Parch', 'Embarked'])

# Feature Scaling
scaler = StandardScaler()
X[['Age', 'Fare']] = scaler.fit_transform(X[['Age', 'Fare']])
X_test[['Age', 'Fare']] = scaler.transform(X_test[['Age', 'Fare']])
X.head(3)

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

In [None]:
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
print('Cross-Validation Accuracy Scores', scores)
print('Mean Cross-Validation Accuracy Score', scores.mean())

Cross-Validation Accuracy Scores [0.75977654 0.83707865 0.86516854 0.79213483 0.82022472]
Mean Cross-Validation Accuracy Score 0.8148766555771765


In [None]:
output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submissions/random_forest_2.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!
