# Titanic Survival Classifier

## Preparation

In [1]:
import numpy as np
import pandas as pd

Load up the provided training dataset and index by the provided passenger ID.

In [2]:
training_set = pd.read_csv('./data/train.csv').set_index('PassengerId')

Take a quick peek and make sure everything is as expected.

In [3]:
training_set.head(5)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Proceed as before and further split the tagged training dataset into training and testing subsets. Unlike Orange, `scikit-learn` requires some manual data preprocessing. First, in order to deal with categorical data like `Sex`, dummy variables are introduced. Additionally, `np.ravel` is invoked to transform `y` into the one dimensional array expected by the model's `score` method.

In [4]:
from sklearn.model_selection import train_test_split

X = pd.get_dummies(training_set[['Pclass', 'Sex', 'Age', 'Fare']])
y = training_set[['Survived']]

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

## Random Forest Classifier Revisited

Now, load up the random forest classifier that was created and pickled in the Orange visual environment.

In [5]:
import pickle

with open('titanic-randomforest.pkcls', 'rb') as f:
    trained_random_forest = pickle.load(f)

Inspect the domain to ensure alignment in the ordering of features.

In [6]:
trained_random_forest.domain

[Pclass, Sex=female, Sex=male, Age, Fare | Survived] {PassengerId, Name}

 As scikit-learn is [more flexible and less lenient](https://scikit-learn.org/stable/modules/impute.html), there is a need to handle imputing any missing values before proceeding.

In [7]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X_test_imputed = imputer.fit_transform(X_test[['Pclass', 'Sex_female', 'Sex_male', 'Age', 'Fare']])

### Result

In [8]:
'Accuracy: {:f}%'.format(trained_random_forest.skl_model.score(X_test_imputed, y_test) * 100)

'Accuracy: 94.170404%'

## Automated Tuning

This initial model was hand-tuned with a guess-and-check approach. Can it be between with fancier tuning? [TPOT](https://epistasislab.github.io/tpot/) optimizes learning pipelines using genetic programming.

In [9]:
from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=1, population_size=100, cv=10, n_jobs=-1, memory='auto', verbosity=2)
tpot.fit(X_train, y_train)
tpot.export('tpot_titanic_pipeline.py')

Imputing missing values in feature set


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=200, style=ProgressStyle(descript…

Generation 1 - Current best internal CV score: 0.8233431771623166

Best pipeline: RandomForestClassifier(input_matrix, bootstrap=True, criterion=gini, max_features=0.35000000000000003, min_samples_leaf=2, min_samples_split=3, n_estimators=100)


### Result

In [10]:
'Accuracy: {:f}%'.format(tpot.score(X_test, y_test) * 100)

Imputing missing values in feature set


'Accuracy: 85.201794%'

## Prediction

With trained models in hand, load up the test dataset, massage the data as needed, add predictions, and export to CSV.

In [11]:
test_set = pd.read_csv('./data/test.csv').set_index('PassengerId')
test_set_dummies = pd.get_dummies(test_set[['Pclass', 'Sex', 'Age', 'Fare']])

# Hand-tuned
test_set_imputed = imputer.fit_transform(test_set_dummies[['Pclass', 'Sex_female', 'Sex_male', 'Age', 'Fare']])
test_set['Survived'] = trained_random_forest.skl_model.predict(test_set_imputed)
test_set['Survived'] = test_set['Survived'].astype(int)
test_set[['Survived']].to_csv('final-random-forest.csv')

# TPOT optimized
test_set['Survived'] = tpot.predict(test_set_dummies)
test_set['Survived'] = test_set['Survived'].astype(int)
test_set[['Survived']].to_csv('final-tpot.csv')

Imputing missing values in feature set
