# Titanic RAMP: an analysis
## Introduction
The following notebook aims at building and submitting a model on RAMP to try and identify people who survived the tragedy of the sinking of the Titanic, based on the dataset made available by Kaggle as an initiation data science challenge.

More information is available from the Kaggle and RAMP websites, and the notebook made available as part of the RAMP Titanic Starting Kit.

In [4]:
%matplotlib inline
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

## Loading the data

In [25]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)

y_df = data['Survived']
X_df = data.drop(['Survived'], axis=1)

display(X_df.head(5))
display(X_df.describe())

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,568,3,"Palsson, Mrs. Nils (Alma Cornelia Berglund)",female,29.0,0,4,349909,21.075,,S
1,544,2,"Beane, Mr. Edward",male,32.0,1,0,2908,26.0,,S
2,375,3,"Palsson, Miss. Stina Viola",female,3.0,3,1,349909,21.075,,S
3,604,3,"Torber, Mr. Ernst William",male,44.0,0,0,364511,8.05,,S
4,866,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0,,S


Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,356.0,356.0,290.0,356.0,356.0,356.0
mean,451.713483,2.300562,29.123862,0.550562,0.412921,31.65797
std,260.505039,0.833861,14.103122,1.120978,0.798415,43.474154
min,7.0,1.0,0.92,0.0,0.0,0.0
25%,229.75,2.0,19.0,0.0,0.0,7.925
50%,445.0,3.0,28.0,0.0,0.0,15.2458
75%,686.75,3.0,37.0,1.0,1.0,31.275
max,890.0,3.0,71.0,8.0,6.0,263.0


PassengerId      0
Pclass           0
Name             0
Sex              0
Age             66
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          273
Embarked         0
dtype: int64

We notice that:
* Some data are missing (`Age`, `Cabin`), we might need to fill in data gaps;
* Some data will probably not be possible to treat as such (e.g. `Name`, `Ticket`,...);
* Ranges and mean values of numerical features are of different orders, some regularisation treatment might appear necessary in some cases.


Let's have a look at missing data.

In [26]:
display(356-X_df.count())

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             66
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          273
Embarked         0
dtype: int64

## Predicting survival
### Raw data
The label to predict, `Survival` has been isolated in `y_df`.

Let's have a look at the data for the two groups of passengers. Given the scale of this tragedy, we notice that the data is not evenly distributed between the two groups. We observe some differences in mean and variability for several features.

In [29]:
display(data.groupby('Survived').count())
display(data.groupby('Survived').mean())
display(data.groupby('Survived').std())

Unnamed: 0_level_0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,219,219,219,219,179,219,219,219,219,27,219
1,137,137,137,137,111,137,137,137,137,56,137


Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,456.378995,2.534247,29.960894,0.589041,0.3379,22.478994
1,444.255474,1.927007,27.774054,0.489051,0.532847,46.330931


Unnamed: 0_level_0,PassengerId,Pclass,Age,SibSp,Parch,Fare
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,268.462523,0.731008,14.052515,1.297517,0.804219,33.700918
1,248.03952,0.854358,14.143312,0.75845,0.776977,52.539184


### A closer look at categories
Some features are categorical, let's have a look at the values taken.

In [30]:
print(X_df['Embarked'].sort_values().unique())
print("There are {} different ports.".format(X_df['Embarked'].unique().shape[0]))

['C' 'Q' 'S']
There are 3 different ports.


We recognise of course Cherbourg, Queenstone and Southampton.

In [31]:
print(X_df['Cabin'].sort_values().unique())
print("There are {} different values for Cabin.".format(X_df['Cabin'].unique().shape[0]))

['A10' 'A14' 'A20' 'A24' 'A6' 'B102' 'B18' 'B19' 'B20' 'B22' 'B3' 'B37'
 'B38' 'B42' 'B49' 'B5' 'B51 B53 B55' 'B57 B59 B63 B66' 'B77' 'B79' 'B80'
 'B82 B84' 'B96 B98' 'C103' 'C106' 'C110' 'C118' 'C123' 'C124' 'C125'
 'C126' 'C148' 'C2' 'C22 C26' 'C23 C25 C27' 'C30' 'C49' 'C52' 'C54' 'C70'
 'C78' 'C91' 'C92' 'D' 'D10 D12' 'D17' 'D33' 'D35' 'D36' 'D45' 'D56' 'D6'
 'D7' 'D9' 'E101' 'E24' 'E25' 'E33' 'E34' 'E46' 'E58' 'E63' 'E67' 'E68'
 'F2' 'F33' 'G6' nan]
There are 68 different values for Cabin.


We noticed that:
* Some passengers do notre have a cabin registered (2nd and/or 3rd class?);
* Cabin nomeclature seems somehow structured, it is always comprised of a letter between `E` and `G`, followed in every but one case (`D`) by a number.
* Strangely enough, several passengers are registereb with multiple cabins...

For the <a href=https://en.wikipedia.org/wiki/RMS_Titanic>Wikipedia page about the Titanic</a>, we learn that the titanic consisted of 10 decks, namely:
* Boat deck : where lifeboats were housed;
* A Deck: promenade deck;
* B Deck: bridge deck;
* C Deck: shelter deck;
* D Deck: saloon deck;
* E Deck: upper deck;
* F Deck: middle deck;
* G Deck: lower deck;
* Orlop deck/tank tops : for machines, cargo and fuel.

We'll therefore assume that the letter refers to the deck, and derive features from this data.

In [70]:
# We remove the numbers and spaces from Cabin to create Deck
X_df['Deck'] = X_df['Cabin'].str.replace("[0-9]*","")
X_df['Deck'] = X_df['Deck'].str.replace(" ","")

# We count the number of letters to capture when some passengers have booked several cabins
X_df['Cabin_nr'] = X_df['Deck'].str.len()

# We keep only the first letter as the Deck
X_df['Deck'] = X_df['Deck'].str.extract("(^[A-Z])", expand=True)

X_df.tail(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Deck,Cabin_nr
351,551,1,"Thayer, Mr. John Borland Jr",male,17.0,0,2,17421,110.8833,C70,C,C,1.0
352,715,2,"Greenberg, Mr. Samuel",male,52.0,0,0,250647,13.0,,S,,
353,662,3,"Badt, Mr. Mohamed",male,40.0,0,0,2623,7.225,,C,,
354,743,1,"Ryerson, Miss. Susan Parker ""Suzette""",female,21.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C,B,4.0
355,410,3,"Lefebre, Miss. Ida",female,,3,1,4133,25.4667,,S,,


### Getting rid of probably useless data
Some data are probably useless, as unique identifiers, among them : `PassengerId` and `Ticket`.

`Name` could perhaps be interpreted to extract some useful features, but we'll drop it at this stage.