In [3]:
from keras.models import Sequential
from keras.layers import Dense
import tensorflow as ts
import pandas as pd
import numpy as np
import sklearn.model_selection as sk

The Titanic set contains information about the passengers of the Titanic. We can use Pandas to download the Excel file directly into a Pandas DataFrame.

In [4]:
df = pd.read_excel('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.xls')

We'll build a model to test the hypothesis that the "title" of a passenger on the Titanic can help predict whether they survived or not. The titles is not available as a column in the dataset, but it can be extracted from the name of the passenger. The title is information-rich and might have as much predictive power than sex, age, or even cabin combined.

In [5]:
# pandas and regex magic to extract the title of the passenger into a 'title' column
df['title'] = df['name'].str.extract('.*, (.*)\.', expand=False)
x_all = pd.get_dummies(df['title'])
y_all = np.array([ [y==1, y==0] for y in df.survived]).astype(int)

Before building a model, split the data into a training set and a test set.

In [6]:
X_train, X_test, y_train, y_test = sk.train_test_split(x_all, 
                                                       y_all, 
                                                       test_size=0.33, 
                                                       random_state=42)

A neural network is a good way to explore the predictive power of the 'title' field. Here, we use the high level Keras API to build and compile a simple network.

In [14]:
model = Sequential()
model.add(Dense(units=32, input_dim = len(x_all.columns)))
model.add(Dense(units=32, input_dim = len(x_all.columns)))
model.add(Dense(units=2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
             optimizer='sgd',
             metrics=['accuracy'])

Now, we can use the training data to train the model

In [9]:
model.fit(X_train.as_matrix(), y_train, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x11e137710>

The model can be used to predict the survival status of passengers in the test data.

In [16]:
predictions = model.predict(X_test.as_matrix())

Finally, we can calculate a simple score for our model. 

In [17]:
# assign the softmax probabilities to a boolean e.g. true or false
predictions_bool = predictions[:,0] > 0.5
# do the same for the response
y_test_bool = y_test[:,0] ==1

A simple measure of accuracy ...

In [21]:
accuracy = sum(predictions_bool == y_test_bool) / len(y_test_bool) 