## The Challenge

https://www.kaggle.com/c/titanic/

The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc). 

Output should be a .csv containing the columns "PassenderId" and "Survived" on the test set.

In [1319]:
import pandas as pd

train_set = pd.read_csv('datasets/Titanic_Train.csv')
test_set = pd.read_csv('datasets/Titanic_Test.csv')

In [1320]:
train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [1321]:
# Look at number of missing values across the dataframe

train_set.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [1322]:
for col_name in train_set.columns:
    print(col_name, len(train_set[col_name].unique()))

PassengerId 891
Survived 2
Pclass 3
Name 891
Sex 2
Age 89
SibSp 7
Parch 7
Ticket 681
Fare 248
Cabin 148
Embarked 4


Columns that might be useful for the model:
 - Survived
 - Pclass
 - Sex
 - Age
 - SibSp
 - Parch
 - Fare
 - Embarked
 
Columns that might be useful:
 - Cabin: In the current form, there are too many values to be useful, we will split this column up next
 
Columns that probably won't be useful:
 - PassengerId
 - Ticket
 - Name

The level of each cabin (the letter) was separated and each entry was labeled whether it was located at the front, back, or centre of the ship. (data from: https://www.encyclopedia-titanica.org/titanic-deckplans.html)

In [1323]:
cabin_loc = pd.read_csv('all_cabins.csv')

In [1324]:
cabin_loc.head()

Unnamed: 0,Cabin,Level,Position
0,A10,A,Front
1,A14,A,Front
2,A16,A,Front
3,A19,A,Front
4,A20,A,Front


Add the level of the cabin to the train dataframe

In [1325]:
# Match cabins to levels

level_list_train = []
for cabin in train_set['Cabin']:
    if cabin:
        level_list_train.append(cabin_loc['Level'].loc[cabin_loc['Cabin'] == cabin].values)
        
level_list_test = []
for cabin in test_set['Cabin']:
    if cabin:
        level_list_test.append(cabin_loc['Level'].loc[cabin_loc['Cabin'] == cabin].values)

In [1326]:
# Match cabins to positions

pos_list_train = []
for cabin in train_set['Cabin']:
    if cabin:
        pos_list_train.append(cabin_loc['Position'].loc[cabin_loc['Cabin'] == cabin].values)
        
pos_list_test = []
for cabin in test_set['Cabin']:
    if cabin:
        pos_list_test.append(cabin_loc['Position'].loc[cabin_loc['Cabin'] == cabin].values)

In [1327]:
# Add the new columns to the train and test dataframes

train_set['Cabin_Level'] = pd.DataFrame(level_list_train)
test_set['Cabin_Level'] = pd.DataFrame(level_list_test)

train_set['Cabin_Pos'] = pd.DataFrame(pos_list_train)
test_set['Cabin_Pos'] = pd.DataFrame(pos_list_test)

In [1328]:
train_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_Level,Cabin_Pos
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C,Centre
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C,Centre
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,,


In [1329]:
# Combine columns into a 'RelativesOnboard' column
train_set["RelativesOnboard"] = train_set["SibSp"] + train_set["Parch"]
train_set = train_set.drop(['SibSp', 'Parch'], axis=1)

test_set["RelativesOnboard"] = test_set["SibSp"] + test_set["Parch"]
test_set = test_set.drop(['SibSp', 'Parch'], axis=1)

In [1330]:
train_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Cabin_Level,Cabin_Pos,RelativesOnboard
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S,,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C,C,Centre,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,STON/O2. 3101282,7.9250,,S,,,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S,C,Centre,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S,,,0
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,112053,30.0000,B42,S,B,Front,0
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S,,,3
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C,C,Centre,0


Create a pipeline for numerical attributes

In [1331]:
from sklearn.base import BaseEstimator, TransformerMixin

# Select numerical or categorical columns 
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

In [1332]:
# Create pipeline to select the numerical attributes and impute missing values with the median

from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer

num_pipeline = Pipeline([
        ("select_numeric", DataFrameSelector(["RelativesOnboard", "Fare"])),
        ("imputer", KNNImputer(n_neighbors=3)),
    ])


In [1333]:
num_pipeline.fit_transform(train_set)

array([[ 1.    ,  7.25  ],
       [ 1.    , 71.2833],
       [ 0.    ,  7.925 ],
       ...,
       [ 3.    , 23.45  ],
       [ 0.    , 30.    ],
       [ 0.    ,  7.75  ]])

Create a pipeline for categorical attributes

In [1334]:
class MostFrequentImputer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.most_frequent_ = pd.Series([X[c].value_counts().index[0] for c in X],
                                        index=X.columns)
        return self
    def transform(self, X, y=None):
        return X.fillna(self.most_frequent_)

In [1335]:
# Create age categories

bins= [0, 15, 30, 45, 60, 75, 90]
labels = ['0-15', '16-30', '31-45', '46-60', '61-75', '76-90']
train_set['Age'] = pd.cut(train_set['Age'], bins=bins, labels=labels, right=False)
test_set['Age'] = pd.cut(test_set['Age'], bins=bins, labels=labels, right=False)

In [1336]:
# Map the categorical attributes to numbers
train_set['Sex'] = train_set['Sex'].map({'male':0, 'female':1})
train_set['Embarked'] = train_set['Embarked'].map({'Q':0, 'S':1, 'C':2})
train_set['Cabin_Level'] = train_set['Cabin_Level'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':6})
train_set['Cabin_Pos'] = train_set['Cabin_Pos'].map({'Front':0, 'Centre':1, 'Back':2})
train_set['Age'] = train_set['Age'].map({'0-15':0, '16-30':1, '31-45':2, '46-60':3, '61-75':4, '76-90':5})

# Map the categorical attributes to numbers
test_set['Sex'] = test_set['Sex'].map({'male':0, 'female':1})
test_set['Embarked'] = test_set['Embarked'].map({'Q':0, 'S':1, 'C':2})
test_set['Cabin_Level'] = test_set['Cabin_Level'].map({'A':0, 'B':1, 'C':2, 'D':3, 'E':4, 'F':5, 'G':6})
test_set['Cabin_Pos'] = test_set['Cabin_Pos'].map({'Front':0, 'Centre':1, 'Back':2})
test_set['Age'] = test_set['Age'].map({'0-15':0, '16-30':1, '31-45':2, '46-60':3, '61-75':4, '76-90':5})

In [1337]:
# Impute the mapped categorical columns and round to the nearest whole number

import numpy as np
testimputer = KNNImputer(n_neighbors=3)
train_set[["Pclass", "Sex", "Age", "Embarked", "Cabin_Level", "Cabin_Pos"]] = pd.DataFrame(np.round(testimputer.fit_transform(train_set[["Pclass", "Sex", "Age", "Embarked", "Cabin_Level", "Cabin_Pos"]])))

test_set[["Pclass", "Sex", "Age", "Embarked", "Cabin_Level", "Cabin_Pos"]] = pd.DataFrame(np.round(testimputer.fit_transform(test_set[["Pclass", "Sex", "Age", "Embarked", "Cabin_Level", "Cabin_Pos"]])))

In [1338]:
train_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Cabin_Level,Cabin_Pos,RelativesOnboard
0,1,0,3.0,"Braund, Mr. Owen Harris",0.0,1.0,A/5 21171,7.2500,,1.0,5.0,1.0,1
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1.0,2.0,PC 17599,71.2833,C85,2.0,2.0,1.0,1
2,3,1,3.0,"Heikkinen, Miss. Laina",1.0,1.0,STON/O2. 3101282,7.9250,,1.0,5.0,1.0,0
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1.0,2.0,113803,53.1000,C123,1.0,2.0,1.0,1
4,5,0,3.0,"Allen, Mr. William Henry",0.0,2.0,373450,8.0500,,1.0,4.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2.0,"Montvila, Rev. Juozas",0.0,1.0,211536,13.0000,,1.0,2.0,1.0,0
887,888,1,1.0,"Graham, Miss. Margaret Edith",1.0,1.0,112053,30.0000,B42,1.0,1.0,0.0,0
888,889,0,3.0,"Johnston, Miss. Catherine Helen ""Carrie""",1.0,1.0,W./C. 6607,23.4500,,1.0,6.0,2.0,3
889,890,1,1.0,"Behr, Mr. Karl Howell",0.0,1.0,111369,30.0000,C148,2.0,2.0,1.0,0


In [1339]:
train_set

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked,Cabin_Level,Cabin_Pos,RelativesOnboard
0,1,0,3.0,"Braund, Mr. Owen Harris",0.0,1.0,A/5 21171,7.2500,,1.0,5.0,1.0,1
1,2,1,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1.0,2.0,PC 17599,71.2833,C85,2.0,2.0,1.0,1
2,3,1,3.0,"Heikkinen, Miss. Laina",1.0,1.0,STON/O2. 3101282,7.9250,,1.0,5.0,1.0,0
3,4,1,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1.0,2.0,113803,53.1000,C123,1.0,2.0,1.0,1
4,5,0,3.0,"Allen, Mr. William Henry",0.0,2.0,373450,8.0500,,1.0,4.0,1.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2.0,"Montvila, Rev. Juozas",0.0,1.0,211536,13.0000,,1.0,2.0,1.0,0
887,888,1,1.0,"Graham, Miss. Margaret Edith",1.0,1.0,112053,30.0000,B42,1.0,1.0,0.0,0
888,889,0,3.0,"Johnston, Miss. Catherine Helen ""Carrie""",1.0,1.0,W./C. 6607,23.4500,,1.0,6.0,2.0,3
889,890,1,1.0,"Behr, Mr. Karl Howell",0.0,1.0,111369,30.0000,C148,2.0,2.0,1.0,0


In [1340]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
        ("select_cat", DataFrameSelector(["Pclass", "Sex", "Age", "Embarked", "Cabin_Level"])),
        ("cat_encoder", OneHotEncoder(sparse=False)),
    ])

In [1341]:
cat_pipeline.fit_transform(train_set)
cat_pipeline.fit_transform(test_set)

array([[0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 1., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.],
       [0., 0., 1., ..., 0., 1., 0.]])

In [1342]:
from sklearn.pipeline import FeatureUnion
preprocess_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", num_pipeline),
        ("cat_pipeline", cat_pipeline),
    ])

In [1343]:
X_train = preprocess_pipeline.fit_transform(train_set)
y_train = train_set["Survived"]
X_test = preprocess_pipeline.fit_transform(test_set)

In [1344]:
pd.DataFrame(X_train)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,1.0,7.2500,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,71.2833,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,7.9250,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,53.1000,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,8.0500,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0.0,13.0000,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
887,0.0,30.0000,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
888,3.0,23.4500,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
889,0.0,30.0000,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [1345]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_scores = cross_val_score(forest_clf, X_train, y_train, cv=10)
forest_scores.mean()

0.8081023720349563

In [1346]:
from sklearn.model_selection import GridSearchCV


param_grid = [
    {
     'n_estimators':[21, 22, 23, 24, 25, 26, 27, 28],
     'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
     'max_features': ['auto', 'sqrt', 'log2'], 
    }
]

grid_search = GridSearchCV(forest_clf, param_grid, cv=10,
                          verbose=3,
                          n_jobs=-1)

grid_search.fit(X_train, y_train)

Fitting 10 folds for each of 240 candidates, totalling 2400 fits


GridSearchCV(cv=10, estimator=RandomForestClassifier(random_state=42),
             n_jobs=-1,
             param_grid=[{'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'max_features': ['auto', 'sqrt', 'log2'],
                          'n_estimators': [21, 22, 23, 24, 25, 26, 27, 28]}],
             verbose=3)

In [1347]:
grid_search.best_params_

{'max_depth': 7, 'max_features': 'auto', 'n_estimators': 22}

In [1348]:
grid_search.best_score_

0.8384269662921348

In [1349]:
forest_clf = RandomForestClassifier(n_estimators=22, max_depth=7, random_state=42)

In [1350]:
X_test = preprocess_pipeline.transform(test_set)

In [1351]:
forest_clf.fit(X_train, y_train)

RandomForestClassifier(max_depth=7, n_estimators=22, random_state=42)

In [1352]:
y_pred = forest_clf.predict(X_test)

In [1353]:
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0,