# Introduction

This notebook presents a comparison between a popular classifier (random forest) and a neural network to predict the survival of the passengers of the Titanic. The dataset used [comes from Kaggle](https://www.kaggle.com/c/titanic). 

The main goal of this notebook is to show how we can obtain good performances on structured data using neural networks. The main trick here is to use the concept of [word embedding](https://en.wikipedia.org/wiki/Word_embedding) and apply it to the categorical variables of a structured dataset. This idea comes from the article [Entity embeddings of Categorical Variables](https://arxiv.org/abs/1604.06737) and before that [Artificial neural networks applied to Taxi Destination Prediction](https://arxiv.org/abs/1508.00021). 

# Imports

In [1]:
%matplotlib inline

import pandas            as pd
import numpy             as np
import matplotlib.pyplot as plt
import seaborn           as sns
sns.set_style('whitegrid')

from sklearn.preprocessing   import LabelEncoder
from sklearn.preprocessing   import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble        import RandomForestClassifier
from sklearn.metrics         import accuracy_score

from keras.models            import Model
from keras.layers            import Dense
from keras.layers            import Input
from keras.layers            import Embedding
from keras.layers            import Activation
from keras.layers            import BatchNormalization
from keras.layers            import Flatten
from keras.layers            import Dropout
from keras.layers            import merge
from keras.utils.np_utils    import to_categorical
from keras.initializations   import uniform

Using TensorFlow backend.


# Data loading

## Basic loading

In [2]:
data_path = '../data/'

In [3]:
train_df = pd.read_csv(data_path + 'train.csv')
train_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [4]:
test_df = pd.read_csv(data_path + 'test.csv')
test_df.drop(['Name', 'Ticket', 'Cabin'], axis = 1, inplace = True)
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,male,34.5,0,0,7.8292,Q
1,893,3,female,47.0,1,0,7.0,S
2,894,2,male,62.0,0,0,9.6875,Q
3,895,3,male,27.0,0,0,8.6625,S
4,896,3,female,22.0,1,1,12.2875,S


In [5]:
train_df['Embarked'].fillna('S', inplace = True)  
test_df[ 'Embarked'].fillna('S', inplace = True)

## Categorical attribute encoding

In [6]:
def encode_attribute(train_df, test_df, attribute):
    encoder    = LabelEncoder()
    all_values = pd.concat([train_df[attribute], test_df[attribute]])
    encoder.fit(all_values)
    train_df[attribute] = encoder.transform(train_df[attribute])
    test_df[attribute]  = encoder.transform(test_df[attribute])
    
    return encoder

In [7]:
sex_encoder      = encode_attribute(train_df, test_df, 'Sex')
pclass_encoder   = encode_attribute(train_df, test_df, 'Pclass')
embarked_encoder = encode_attribute(train_df, test_df, 'Embarked')

## Dealing with missing values

In [8]:
pd.isnull(train_df).any()

Survived    False
Pclass      False
Sex         False
Age          True
SibSp       False
Parch       False
Fare        False
Embarked    False
dtype: bool

In [9]:
train_df.Age.fillna(train_df.Age.mean(), inplace = True)

In [10]:
pd.isnull(test_df).any()

PassengerId    False
Pclass         False
Sex            False
Age             True
SibSp          False
Parch          False
Fare            True
Embarked       False
dtype: bool

In [11]:
test_df.Age.fillna( test_df.Age.mean() , inplace = True)
test_df.Fare.fillna(test_df.Fare.mean(), inplace = True)

## Final visualization

In [12]:
train_df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,2,1,22.0,1,0,7.25,2
1,1,0,0,38.0,1,0,71.2833,0
2,1,2,0,26.0,0,0,7.925,2
3,1,0,0,35.0,1,0,53.1,2
4,0,2,1,35.0,0,0,8.05,2


In [13]:
test_df.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,2,1,34.5,0,0,7.8292,1
1,893,2,0,47.0,1,0,7.0,2
2,894,1,1,62.0,0,0,9.6875,1
3,895,2,1,27.0,0,0,8.6625,2
4,896,2,0,22.0,1,1,12.2875,2


# Predictions

We start with a standard random forest model and we will try to see how a neural network with [entity embeddings of categorical variables](https://arxiv.org/abs/1604.06737) compares to it.

## Data preparation

In [14]:
X_train = train_df.drop(['Survived'], axis = 1).as_matrix()
y_train = train_df['Survived'].as_matrix()
X_test  = test_df.drop(['PassengerId'], axis = 1).as_matrix()
X_train.shape, y_train.shape, X_test.shape

((891, 7), (891,), (418, 7))

In [15]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.25)
(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

((668, 7), (223, 7), (668,), (223,))

## Random forest model

In [16]:
model = RandomForestClassifier(n_estimators = 500)

In [17]:
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=500, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [18]:
print('training set accuracy:', accuracy_score(model.predict(X_train), y_train))
print('validation set accuracy:', accuracy_score(model.predict(X_val), y_val))

training set accuracy: 0.991017964072
validation set accuracy: 0.80269058296


## Neural network model

### Data reorganization

In order to produce embeddings for the categorical variables, they first need to be separated from the rest of the dataset.

In [76]:
def split_categorical(X):
    X_pclass   = X[:, 0].astype(np.int64)
    X_sex      = X[:, 1].astype(np.int64)
    X_embarked = X[:, -1].astype(np.int64)
    X_noncat   = X[:, 2 : -1]
    
    return (X_pclass, X_sex, X_embarked), X_noncat

In [77]:
(X_train_pclass, X_train_sex, X_train_embarked), X_train_noncat = split_categorical(X_train)
(X_val_pclass  , X_val_sex  , X_val_embarked)  , X_val_noncat   = split_categorical(X_val)
(X_test_pclass , X_test_sex , X_test_embarked) , X_test_noncat  = split_categorical(X_test)

In [78]:
y_train_cat = to_categorical(y_train)
y_val_cat   = to_categorical(y_val)

Now that the categorical variable are separated from the rest of the dataset, we can create the embeddings. 

As a first test, I try to compute embeddings which dimension are twice the number of possible values

In [79]:
def embedding_init(shape, name = None):
    return uniform(shape = shape, scale = 2 / shape[1] + 1, name = name)

In [80]:
def create_embedding(class_number):
    inp = Input(shape = (1,), dtype = 'int64')
    x   = Embedding(input_dim = class_number, output_dim = 2 * class_number, init = embedding_init)(inp)
    x   = Flatten()(x)
    
    return inp, x

In [81]:
def evaluate_model(model, inp, y_true):
    predictions  = model.predict(inp)
    pred_classes = np.argmax(predictions, axis = 1)
    
    return accuracy_score(y_true, pred_classes)

In [82]:
pclass_inp, pclass_embedding       = create_embedding(3)
sex_inp, sex_embedding             = create_embedding(2)
embarked_input, embarked_embedding = create_embedding(3)

We can now create our neural network. It will first concatenate all the embeddings to the non categorical features before feeding it to the fully connected layers.

In [83]:
non_cat_input = Input(shape = (4,))
x             = merge([pclass_embedding, sex_embedding, embarked_embedding, non_cat_input], mode = 'concat')
x             = Dropout(0.2)(x)
x             = Dense(512)(x)
x             = BatchNormalization()(x)
x             = Activation('relu')(x)
x             = Dropout(0.5)(x)
x             = Dense(512)(x)
x             = BatchNormalization()(x)
x             = Activation('relu')(x)
x             = Dropout(0.6)(x)
x             = Dense(512)(x)
x             = BatchNormalization()(x)
x             = Activation('relu')(x)
x             = Dropout(0.7)(x)
x             = Dense(2)(x)
x             = Activation('softmax')(x)

In [84]:
model = Model([pclass_inp, sex_inp, embarked_input, non_cat_input], x)
model.compile('adam', 'binary_crossentropy')

In [85]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_14 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_15 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
input_16 (InputLayer)            (None, 1)             0                                            
____________________________________________________________________________________________________
embedding_10 (Embedding)         (None, 1, 6)          18          input_14[0][0]                   
___________________________________________________________________________________________

In [86]:
fit_parameters = {
    'x'               : [X_train_pclass, X_train_sex, X_train_embarked, X_train_noncat],
    'y'               : y_train_cat,
    'batch_size'      : 64,
    'nb_epoch'        : 50,
    'validation_data' : ([X_val_pclass, X_val_sex, X_val_embarked, X_val_noncat], y_val_cat)
}

In [87]:
model.fit(**fit_parameters)

Train on 668 samples, validate on 223 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f8dd9657b38>

In [88]:
print('training set accuracy:', evaluate_model(model, [X_train_pclass, X_train_sex, X_train_embarked, X_train_noncat], y_train))
print('validation set accuracy:', evaluate_model(model, [X_val_pclass, X_val_sex, X_val_embarked, X_val_noncat], y_val))

training set accuracy: 0.811377245509
validation set accuracy: 0.789237668161


In [89]:
model.optimizer.lr = 1e-4

In [90]:
model.fit(**fit_parameters)

Train on 668 samples, validate on 223 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f8ddb621da0>

In [91]:
print('training set accuracy:', evaluate_model(model, [X_train_pclass, X_train_sex, X_train_embarked, X_train_noncat], y_train))
print('validation set accuracy:', evaluate_model(model, [X_val_pclass, X_val_sex, X_val_embarked, X_val_noncat], y_val))

training set accuracy: 0.808383233533
validation set accuracy: 0.80269058296
