# Building the model

The model is an ensemble of:

* a decision tree that tries to capture the ["women and children first"](https://en.wikipedia.org/wiki/Women_and_children_first) protocol,

* a random forest classifier,

* and logistic regression with L1 regularization.

The predicted class is decided using soft voting. Decision trees are used first to impute missing `Fare` and `Age` values, see the model pipeline below.

In [1]:
%matplotlib inline

import random
import pprint
import itertools

import numpy as np
import pandas as pd
from scipy.optimize import minimize
import patsy
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.grid_search import GridSearchCV

from util import get_random_seed, ModelImputer, FeatureSelector

## Random seeds

In [2]:
random_seed = get_random_seed()
print('np.random.seed:', random_seed)
np.random.seed(random_seed)

random_seed = get_random_seed()
print('random.seed:', random_seed)
random.seed(random_seed)

np.random.seed: 1714891882
random.seed: 2679597300


## Load training data

In [3]:
train_data = pd.read_csv('clean_data/train.csv')

formula = 'Survived ~ Embarked + Class + Cabin + Fare + Sex + Age + Relatives - 1'
y_train, X_train = patsy.dmatrices(formula, train_data)

Print the indices of the variables (used as parameters in the model pipeline below).

In [4]:
pd.DataFrame({'Variable': X_train.design_info.column_names})

Unnamed: 0,Variable
0,Embarked[Cherbourg]
1,Embarked[Queenstown]
2,Embarked[Southampton]
3,Class[T.second]
4,Class[T.third]
5,Cabin[T.B]
6,Cabin[T.C]
7,Cabin[T.D]
8,Cabin[T.E]
9,Cabin[T.F]


## Build the model pipeline

In [5]:
ensemble = VotingClassifier(voting='soft', estimators=[
    ('women_and_children', make_pipeline(FeatureSelector([13, 15]), 
                                         DecisionTreeClassifier())),
    ('rf', RandomForestClassifier()), 
    ('lr', LogisticRegression(penalty='l1', tol=0.00001)),
])

complete_variables = [i for i in range(17) if i not in (14, 15)]

pipeline = Pipeline([
    ('fare_imputer', ModelImputer(DecisionTreeRegressor(), 14, 
                                  complete_variables)),
    ('age_imputer', ModelImputer(DecisionTreeRegressor(), 15, 
                                 complete_variables + [14])),
    ('ensemble', ensemble),
])

## Estimate the model parameters

In [6]:
param_grid = {
    'ensemble__rf__n_estimators': np.arange(10, 100, 10),
    'ensemble__lr__C': np.logspace(-4, 4, 30),
}

model = GridSearchCV(pipeline, param_grid=param_grid, cv=10, n_jobs=-1, verbose=1)
model.fit(np.asarray(X_train), np.asarray(y_train.ravel()))

print('Best score:', model.best_score_)
print('Best weights:', model.best_params_)

Fitting 10 folds for each of 270 candidates, totalling 2700 fits


[Parallel(n_jobs=-1)]: Done 268 tasks      | elapsed:    8.4s
[Parallel(n_jobs=-1)]: Done 872 tasks      | elapsed:   27.2s
[Parallel(n_jobs=-1)]: Done 1872 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 2316 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 2700 out of 2700 | elapsed:  2.0min finished


Best score: 0.838020247469
Best weights: {'ensemble__lr__C': 4.8939009184774891, 'ensemble__rf__n_estimators': 50}


## Create a submission

In [7]:
test_data = pd.read_csv('clean_data/test.csv')

X_test = patsy.build_design_matrices([X_train.design_info], test_data)[0]
y_test = model.predict(X_test)

submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 
                           'Survived': y_test}, dtype=int)
submission.to_csv('submission/submission.csv', index=False)