# Building the model

The model is an ensemble of:

* a decision tree that tries to capture the ["women and children first"](https://en.wikipedia.org/wiki/Women_and_children_first) protocol,

* a random forest classifier,

* and logistic regression with L1 regularization.

The predicted class is decided with majority voting. Decision trees are used first to impute missing `Fare` and `Age` values, see the model pipeline below.

In [1]:
%matplotlib inline

import random
import itertools

import numpy as np
import pandas as pd
import patsy
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.model_selection import GridSearchCV

from util import get_random_seed, ModelImputer, FeatureSelector

## Random seeds

In [2]:
random_seed = get_random_seed()
print('np.random.seed:', random_seed)
np.random.seed(random_seed)

random_seed = get_random_seed()
print('random.seed:', random_seed)
random.seed(random_seed)

np.random.seed: 643783166
random.seed: 434102240


## Load training data

In [3]:
train_data = pd.read_csv('clean_data/train.csv')

formula = 'Survived ~ Embarked + Class + Cabin + Fare + Sex + Age + Relatives - 1'
y_train, X_train = patsy.dmatrices(formula, train_data, NA_action='raise')

Print the indices of the variables (used as `ModelImputer` parameters below).

In [4]:
pd.DataFrame({'Variable': X_train.design_info.column_names})

Unnamed: 0,Variable
0,Embarked[Cherbourg]
1,Embarked[Queenstown]
2,Embarked[Southampton]
3,Class[T.second]
4,Class[T.third]
5,Cabin[T.B]
6,Cabin[T.C]
7,Cabin[T.D]
8,Cabin[T.E]
9,Cabin[T.F]


## Build the model pipeline

In [5]:
complete_variables = [i for i in range(17) if i not in (14, 15)]

ensemble = VotingClassifier(voting='hard', estimators=[
    ('women_and_children', make_pipeline(FeatureSelector([13, 15]), DecisionTreeClassifier())),
    ('rf', RandomForestClassifier()), 
    ('lr', LogisticRegression(penalty='l1')),
])

pipeline = Pipeline([
    ('fare_imputer', ModelImputer(DecisionTreeRegressor(), 14, complete_variables)),
    ('age_imputer', ModelImputer(DecisionTreeRegressor(), 15, complete_variables)),
    ('ensemble', ensemble),
])

## Estimate the model parameters

In [6]:
param_grid = {
    'ensemble__rf__n_estimators': np.arange(10, 100, 10),
    'ensemble__lr__C': np.logspace(-4, 4, 30),
}

model = GridSearchCV(pipeline, param_grid=param_grid, cv=10, n_jobs=4, verbose=1)
model.fit(np.asarray(X_train), np.asarray(y_train.ravel()))

print('Best accuracy:', model.best_score_)
print('Best parameters:', model.best_params_)

Fitting 10 folds for each of 270 candidates, totalling 2700 fits


[Parallel(n_jobs=4)]: Done 268 tasks      | elapsed:    8.5s
[Parallel(n_jobs=4)]: Done 868 tasks      | elapsed:   26.7s
[Parallel(n_jobs=4)]: Done 1868 tasks      | elapsed:  1.0min
[Parallel(n_jobs=4)]: Done 2700 out of 2700 | elapsed:  1.5min finished


Best accuracy: 0.834645669291
Best parameters: {'ensemble__rf__n_estimators': 10, 'ensemble__lr__C': 1.3738237958832638}


## Create a submission

In [7]:
test_data = pd.read_csv('clean_data/test.csv')

X_test = patsy.build_design_matrices([X_train.design_info], test_data, NA_action='raise')[0]
y_test = model.predict(X_test)

submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': y_test}, dtype=int)
submission.to_csv('submission/submission.csv', index=False)