# **Titanic Lab**

Send this file with the solution to econometrics.methods@gmail.com

Information below is from https://www.kaggle.com/c/titanic/data

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine/deep learning to predict which passengers survived the tragedy.

The data has been split into two groups:

1.   training set (train.csv)
2.   test set (test.csv)

The training set should be used to build your machine /deep learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

**Goal**
It is your job to predict if a passenger survived the sinking of the Titanic or not. 
For each in the test set, you must predict a 0 or 1 value for the variable.

**Metric**
Your score is the percentage of passengers you correctly predict. This is known simply as "accuracy”.

Variable Notes

**pclass**: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

In [0]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [0]:
raw_train = pd.read_csv('https://github.com/VitorKamada/ECO7110/raw/master/Data/Titanic/train.csv', index_col=0)
raw_train['is_test'] = 0
raw_test = pd.read_csv('https://github.com/VitorKamada/ECO7110/raw/master/Data/Titanic/test.csv', index_col=0)
raw_test['is_test'] = 1
raw_train = raw_train.reset_index()
raw_test = raw_test.reset_index()
combine = [raw_train, raw_test]

In [121]:
raw_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,is_test
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0


Cleaning and engineering the data

Drop useless columns

In [0]:
raw_train = raw_train.drop(['Ticket', 'Cabin'], axis=1)
raw_test = raw_test.drop(['Ticket', 'Cabin'], axis=1)
combine = [raw_train, raw_test]

Assign numerical values to non-numerical data based on frequency

In [123]:
for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(raw_train['Title'], raw_train['Sex'])

Sex,female,male
Title,Unnamed: 1_level_1,Unnamed: 2_level_1
Capt,0,1
Col,0,2
Countess,1,0
Don,0,1
Dr,1,6
Jonkheer,0,1
Lady,1,0
Major,0,2
Master,0,40
Miss,182,0


In [124]:
for dataset in combine:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
 	'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
raw_train[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()

Unnamed: 0,Title,Survived
0,Master,0.575
1,Miss,0.702703
2,Mr,0.156673
3,Mrs,0.793651
4,Rare,0.347826


In [125]:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
raw_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,is_test,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,0,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,0,3
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,0,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,0,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,0,1


In [0]:
raw_train = raw_train.drop(['Name', 'PassengerId'], axis=1)
raw_test = raw_test.drop(['Name'], axis=1)
combine = [raw_train, raw_test]

In [127]:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
raw_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,is_test,Title
0,0,3,0,22.0,1,0,7.25,S,0,1
1,1,1,1,38.0,1,0,71.2833,C,0,3
2,1,3,1,26.0,0,0,7.925,S,0,2
3,1,1,1,35.0,1,0,53.1,S,0,3
4,0,3,0,35.0,0,0,8.05,S,0,1


In [128]:
freq_port = raw_train.Embarked.dropna().mode()[0]
freq_port

'S'

In [129]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
raw_train[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Embarked,Survived
0,C,0.553571
1,Q,0.38961
2,S,0.339009


In [130]:
for dataset in combine:
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
raw_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,is_test,Title
0,0,3,0,22.0,1,0,7.25,0,0,1
1,1,1,1,38.0,1,0,71.2833,1,0,3
2,1,3,1,26.0,0,0,7.925,0,0,2
3,1,1,1,35.0,1,0,53.1,0,0,3
4,0,3,0,35.0,0,0,8.05,0,0,1


In [131]:
raw_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,is_test,Title
0,0,3,0,22.0,1,0,7.25,0,0,1
1,1,1,1,38.0,1,0,71.2833,1,0,3
2,1,3,1,26.0,0,0,7.925,0,0,2
3,1,1,1,35.0,1,0,53.1,0,0,3
4,0,3,0,35.0,0,0,8.05,0,0,1


In [132]:
raw_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,is_test,Title
0,892,3,0,34.5,0,0,7.8292,2,1,1
1,893,3,1,47.0,1,0,7.0,0,1,3
2,894,2,0,62.0,0,0,9.6875,2,1,1
3,895,3,0,27.0,0,0,8.6625,0,1,1
4,896,3,1,22.0,1,1,12.2875,0,1,3


In [133]:
raw_train.isnull().sum(axis = 0)

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Embarked      0
is_test       0
Title         0
dtype: int64

Creating two different datasets: one with rows containing na values removed and one without

In [103]:
raw_train.shape, raw_test.shape

((891, 10), (418, 10))

In [134]:
raw_train_nona = raw_train.dropna()
raw_train_nona.shape

(714, 10)

In [136]:
raw_test_nona = raw_test.dropna()
raw_test_nona.shape

(331, 10)

## Machine learning models
(Only the datasets with NAs removed were able to run)

In [137]:
X_train_nona = raw_train_nona.drop("Survived", axis=1)
Y_train_nona = raw_train_nona["Survived"]
X_test_nona  = raw_test_nona.drop("PassengerId", axis=1).copy()
X_train_nona.shape, Y_train_nona.shape, X_test_nona.shape

((714, 9), (714,), (331, 9))

Logistic regression

In [140]:
logreg_nona = LogisticRegression()
logreg.fit(X_train_nona, Y_train_nona)
Y_pred_nona = logreg.predict(X_test_nona)
acc_log_nona = round(logreg.score(X_train_nona, Y_train_nona) * 100, 2)
acc_log_nona



81.09

Support vector machine

In [141]:
svc_nona = SVC()
svc.fit(X_train_nona, Y_train_nona)
Y_pred_nona = svc.predict(X_test_nona)
acc_svc_nona = round(svc.score(X_train_nona, Y_train_nona) * 100, 2)
acc_svc_nona



91.18

K-nearest neighbor

In [142]:
knn_nona = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train_nona, Y_train_nona)
Y_pred_nona = knn.predict(X_test_nona)
acc_knn_nona = round(knn.score(X_train_nona, Y_train_nona) * 100, 2)
acc_knn_nona

83.19

Random forest

In [143]:
random_forest_nona = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train_nona, Y_train_nona)
Y_pred_nona = random_forest.predict(X_test_nona)
random_forest.score(X_train_nona, Y_train_nona)
acc_random_forest_nona = round(random_forest.score(X_train_nona, Y_train_nona) * 100, 2)
acc_random_forest_nona

98.88

## Deep learning (attempt)

In [256]:
X_train = raw_train.drop("Survived", axis=1)
Y_train = raw_train["Survived"]
X_test  = raw_test.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape

((891, 9), (891,), (418, 9))

In [0]:
X_train = X_train.values
Y_train = Y_train.values

In [0]:
X_train = (X_train - X_train.mean()) / X_train.std()

In [0]:
import keras
from keras.models import Sequential
from keras.layers import Dense

In [0]:
from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(64, activation='relu',
                       input_shape=(9,)))
model.add(layers.Dense(1, activation='sigmoid'))

In [0]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

In [0]:
x_val = X_train_nona[:200]
partial_x_train = X_train_nona[200:]

y_val = Y_train_nona[:200]
partial_y_train = Y_train_nona[200:]

Could only get model to run after removing NAs that were present in Age column

In [268]:
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=20,
                    batch_size=10,
                    validation_data=(x_val, y_val))

Train on 514 samples, validate on 200 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
