In [38]:
# Import various required modules
import numpy as np
import pandas as pd

# Plotting
import matplotlib.pyplot as plt

# Preprocessing 

# Ideation

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

**Overview**

The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

**Data Dictionary**

|Variable|	Definition|	Key|
|--------|------------|----|
|survival|	Survival |	0 = No, 1 = Yes
|pclass	|Ticket class|	1 = 1st, 2 = 2nd, 3 = 3rd
|sex|	Sex	|
|Age|	Age in years	|
|sibsp|	# of siblings / spouses aboard the Titanic	|
|parch|	# of parents / children aboard the Titanic	|
|ticket	|Ticket number	|
|fare|	Passenger fare	|
|cabin|	Cabin number	|
|embarked|	Port of Embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton|

**Variable Notes**

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

# Collect Data

Collect the data into train and test sets with X and y variables. As discussed above, it is already split into train and test data.

In [39]:
# Collect into train and test data
df_train = pd.read_csv('./data/train.csv', index_col='PassengerId')
X_test = pd.read_csv('./data/test.csv', index_col='PassengerId')

# Train and test sets
X_train = df_train.drop('Survived', axis=1)
y_train = df_train['Survived']

# View the feature set
X_train.head()

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# EDA

In [40]:
X_train.describe(include='all')

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,891.0,891,891,714.0,891.0,891.0,891.0,891.0,204,889
unique,,891,2,,,,681.0,,147,3
top,,"Dooley, Mr. Patrick",male,,,,1601.0,,B96 B98,S
freq,,1,577,,,,7.0,,4,644
mean,2.308642,,,29.699118,0.523008,0.381594,,32.204208,,
std,0.836071,,,14.526497,1.102743,0.806057,,49.693429,,
min,1.0,,,0.42,0.0,0.0,,0.0,,
25%,2.0,,,20.125,0.0,0.0,,7.9104,,
50%,3.0,,,28.0,0.0,0.0,,14.4542,,
75%,3.0,,,38.0,1.0,0.0,,31.0,,


In [41]:
X_train.isnull().sum()

Pclass        0
Name          0
Sex           0
Age         177
SibSp         0
Parch         0
Ticket        0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [52]:
X_train['Title'] = X_train['Name'].str.extract('(?<=,) (.*?)(?=\.)')
X_train['Title'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess',
       'Jonkheer'], dtype=object)

In [53]:
X_train[X_train['Age'].isnull()]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
6,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q,Mr
18,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,Mr
20,3,"Masselmani, Mrs. Fatima",female,,0,0,2649,7.2250,,C,Mrs
27,3,"Emir, Mr. Farred Chehab",male,,0,0,2631,7.2250,,C,Mr
29,3,"O'Dwyer, Miss. Ellen ""Nellie""",female,,0,0,330959,7.8792,,Q,Miss
...,...,...,...,...,...,...,...,...,...,...,...
860,3,"Razi, Mr. Raihed",male,,0,0,2629,7.2292,,C,Mr
864,3,"Sage, Miss. Dorothy Edith ""Dolly""",female,,8,2,CA. 2343,69.5500,,S,Miss
869,3,"van Melkebeke, Mr. Philemon",male,,0,0,345777,9.5000,,S,Mr
879,3,"Laleff, Mr. Kristo",male,,0,0,349217,7.8958,,S,Mr


# Clean data

# Transform data

**Ideas for feature engineering**

- Convert title of passenger to number
- Convert cabins to binary, have and have not
- Create a deck number of cabin based on map and letter
- Create number of family members on boat
- Strip strings from ticket number
- Convert embarkation points to numbers
- Convert sex to binary 0 and 1


# Model

# Validation