# Building the model

The model is an ensemble of:

* a decision tree that tries to capture the ["women and children first"](https://en.wikipedia.org/wiki/Women_and_children_first) protocol,

* support vector machines, and

* a random forest classifier.

The predicted class is decided with majority voting. 

Decision trees are used to impute missing `Fare` and `Age` values before building the model.

In [1]:
import random
import pprint

import numpy as np
import pandas as pd
import patsy

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.grid_search import GridSearchCV

from util import get_random_seed, ModelImputer, ColumnSelector

## Random seeds

In [2]:
random_seed = get_random_seed()
print('np.random.seed:', random_seed)
np.random.seed(random_seed)

random_seed = get_random_seed()
print('random.seed:', random_seed)
random.seed(random_seed)

np.random.seed: 762272023
random.seed: 152070533


## Load training data

In [3]:
train_data = pd.read_csv('clean_data/train.csv')

formula = 'Survived ~ Embarked + Class + Cabin + Fare + Title + Sex + Age + Relatives - 1'
y_train, X_train = patsy.dmatrices(formula, train_data)

Print the indices of the variables (used as parameters in the model pipeline below).

In [4]:
pd.DataFrame({'Variable': X_train.design_info.column_names})

Unnamed: 0,Variable
0,Embarked[Cherbourg]
1,Embarked[Queenstown]
2,Embarked[Southampton]
3,Class[T.second]
4,Class[T.third]
5,Cabin[T.B]
6,Cabin[T.C]
7,Cabin[T.D]
8,Cabin[T.E]
9,Cabin[T.F]


## Build the model pipeline

In [5]:
ensemble = VotingClassifier(voting='hard', estimators=[
    ('women_and_children', make_pipeline(ColumnSelector([17, 19]), 
                                         DecisionTreeClassifier())),
    ('svm', SVC(kernel='rbf')),
    ('rf', RandomForestClassifier()), 
])

complete_variables = [i for i in range(17) if i not in (18, 19)]

pipeline = Pipeline([
    ('fare_imputer', ModelImputer(DecisionTreeRegressor(), 18, 
                                  complete_variables)),
    ('age_imputer', ModelImputer(DecisionTreeRegressor(), 19, 
                                 complete_variables + [18])),
    ('ensemble', ensemble),
])

## Estimate the model parameters

In [6]:
param_grid = {
    'ensemble__svm__C': np.logspace(-2, 4, 10),
    'ensemble__svm__gamma': np.logspace(-4, 2, 10),
    'ensemble__rf__n_estimators': [5, 11, 17, 23],
}

model = GridSearchCV(pipeline, param_grid, cv=10, n_jobs=-1, verbose=1)
model.fit(np.asarray(X_train), np.asarray(y_train.ravel()))

print('Best score:', model.best_score_)
print('Best weights:', pprint.pformat(model.best_params_), sep='\n')

Fitting 10 folds for each of 400 candidates, totalling 4000 fits


[Parallel(n_jobs=-1)]: Done 144 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 744 tasks      | elapsed:   18.1s
[Parallel(n_jobs=-1)]: Done 1198 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 1867 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 2616 tasks      | elapsed:  2.4min
[Parallel(n_jobs=-1)]: Done 3277 tasks      | elapsed:  3.4min
[Parallel(n_jobs=-1)]: Done 3927 tasks      | elapsed:  4.2min
[Parallel(n_jobs=-1)]: Done 4000 out of 4000 | elapsed:  4.4min finished


Best score: 0.839145106862
Best weights:
{'ensemble__rf__n_estimators': 23,
 'ensemble__svm__C': 10000.0,
 'ensemble__svm__gamma': 0.0001}


## Create a submission

In [7]:
test_data = pd.read_csv('clean_data/test.csv')

X_test = patsy.build_design_matrices([X_train.design_info], test_data)[0]
y_test = model.predict(X_test)

submission = pd.DataFrame({'PassengerId': test_data['PassengerId'], 
                           'Survived': y_test}, dtype=int)
submission.to_csv('submission/submission.csv', index=False)