# Titanic : Machine learning from Disaster
Author - Rishabh Jain

In [1]:
import warnings,os,math
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm

## Loading Train and Test set

In [2]:
train=pd.read_csv('data/train.csv')
test=pd.read_csv('data/test.csv')
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
recordId='PassengerId'
target='Survived'
trainId=train[recordId]
testId=test[recordId]

# Dropping **PassengerId** (unique identifier) feature from train & test set.
train.drop(recordId,axis=1,inplace=True)
test.drop(recordId,axis=1,inplace=True)

# Checking Dataset shape
print('Train Set\t %d X %d'%(train.shape[0],train.shape[1]))
print('Test Set\t %d X %d'%(test.shape[0],test.shape[1]))

## Data Preprocessing

In [None]:
features=['Pclass','SibSp','Parch','Sex','Embarked','Age','Fare','Survived']
nrows=2
ncols=int(np.ceil(len(features)/nrows))
fig,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(14,5))
fig.subplots_adjust(wspace=0.4,hspace=0.4)
for row in range(nrows):
    for col in range(ncols):
        feature=features[row*ncols+col]
        if feature in ['Age','Fare']:
            sns.violinplot(train[target],train[feature],ax=ax[row,col])
        else:
            sns.barplot(train[feature],train[target],ax=ax[row,col])            

**Few observations from the plots :**
- **Pclass -** An ordinal feature where passsenger with `Pclass=1` had higher probablitly of surviving than compared to passenger with `Pclass=3`.
- **SibSp -** Passengers with lesser number of siblings and spouses had higher chances of surviving.
- **Sex -** Females were more likely to survive in titanic disaster.
- **Embarked -** Passengers who embarked at port C had higher probability than other passengers.
- **Fare -** Passengers who paid higher fare or in other words passengers with higher socio-economic status had better better chance of survival.

**Let's first concatenate the train and test set for handling missing data and feature engineering**

In [None]:
nTrain=train.shape[0]
nTest=test.shape[0]
trainY=train[target]
allData=pd.concat((train,test)).reset_index(drop=True)
allData.drop(target,axis=1,inplace=True)
print('Train + Test Set\t %d X %d'%(allData.shape[0],allData.shape[1]))

### Handling Missing Data

In [None]:
count=allData.isnull().sum().sort_values(ascending=False)
percentage=(allData.isnull().sum()/allData.isnull().count()).sort_values(ascending=False)*100
dtypes=allData[count.index].dtypes
missingData=pd.DataFrame({'Count':count,'Percentage':percentage,'Type':dtypes})
missingData.drop(missingData[missingData['Count']==0].index,inplace=True)
missingData.head(10)

Since, **Cabin** feature is directly propotional to Socio-economic status of the passenger and contains the deck information. We will keep this feature by replacing the passengers with `Cabin=NaN` value replaced by `M`.

In [None]:
idx=allData[allData['Cabin'].isnull()].index
allData.loc[idx,'Cabin']='M'

According to the plots above, **Survival** of the passenger is not affected much by the age. We are going to drop this feature.

In [None]:
allData.drop(columns=['Age'],inplace=True)

After, looking at the names of the passenger with their embarkement port missing on internet, we can conclude that :
- Icard, Miss. Amelie embarked for Southampton ([source](https://www.encyclopedia-titanica.org/titanic-survivor/)).
- Martha Evelyn emabarked for Southampton ([source](https://www.encyclopedia-titanica.org/titanic-survivor/martha-evelyn-stone.html)).

In [None]:
allData[allData['Embarked'].isnull()]

In [None]:
idx=allData[allData['Embarked'].isnull()].index
allData.loc[idx,'Embarked']='S'

Only one passenger is there with fare missing. We will deal with this by following these steps :
- First identify if this passenger is from train or test set using passenger name.
- Replace the fare value with the mean fare of passengers with `Pclass=3`,`Sex=male` and `Embarked=S` in train/test set.

In [None]:
allData[allData['Fare'].isnull()]

In [None]:
name=allData[allData['Fare'].isnull()].Name.values[0]
dataset=train if name in train['Name'].tolist() else test
groups=dataset.groupby(['Pclass','Sex','Embarked'])['Fare'].mean().to_frame('Mean Fare')
groups

In [None]:
idx=allData[allData['Fare'].isnull()].index
allData.loc[idx,'Fare']=groups.loc[3,'male','S'].values[0]

Verifying if the all the missing values are dealt with.

In [None]:
count=allData.isnull().sum().sort_values(ascending=False).to_frame(name='count')
count

### Feature Engineering
Here, we are going to create few new features :

- **FamilySize -** Created by adding **SibSp** and **Parch** variables with 1.
- **IsAlone -** Binary feature will be created by setting its value to 1 if **FamilySize** is 1 otherwise 0 value will be set.
- **Title -** Extracted by the **Name** feature. Moreover, all the Titles whose occurrences is less than 10 are replaced with **'Misc'**.
- **Deck -** Extracted by the **Cabin** feature.

After the new features are derived, **Name**,**Cabin** and **Ticket** features will be dropped from the dataframe.

In [None]:
# FamilySize
allData['FamilySize']=allData['SibSp']+allData['Parch']+1
# IsAlone
allData['IsAlone']=None
idx=allData[allData['FamilySize']==1].index
allData.loc[idx,'IsAlone']=1
idx=allData[allData['FamilySize']>1].index
allData.loc[idx,'IsAlone']=0
# Title
allData['Title']=allData['Name'].str.extract(" ([A-Za-z]+)\.")
titleNames=(allData['Title'].value_counts()<10)
allData['Title']=allData['Title'].apply(lambda title: 'Misc' if titleNames.loc[title]==True else title)
# Deck
allData['Deck']=allData['Cabin'].str[0]
# Dropping Name, Cabin and Ticket feature
allData.drop(columns=['Name','Cabin','Ticket'],inplace=True)

In [None]:
allData.head()

**Let's take a look at how our new features effect the survival of a passenger in training set.**

In [None]:
_train=allData[:nTrain]
features=['FamilySize','IsAlone','Title','Deck']
nrows=1
ncols=int(np.ceil(len(features)/nrows))
fig,ax=plt.subplots(nrows=nrows,ncols=ncols,figsize=(14,2.5))
fig.subplots_adjust(wspace=0.4,hspace=0.4)
for col in range(ncols):
    feature=features[col]
    if feature is not 'Deck':
        sns.barplot(_train[feature],trainY,ax=ax[col])
    else:
        sns.barplot(_train[feature],trainY,ax=ax[col],order=['A','B','C','D','E','F','G','M','T'])

**It's important to understand if the Cabin/Deck are assigned to passengers based on socio-economic status.**<br><br>
<img src="images/Titanic_side_plan.png" style="width:700px;">

In [None]:
_train.groupby(['Deck','Pclass']).size().to_frame(name='Passenger Count')

From the table above, we can clearly conclude few things :
- Deck A,B,C and T were only reserved for passenger with `Pclass=1` and will be replaced by 'ABC'.
- Deck D & E were reserved for passengers with `Pclass=1,2,3` and will be replaced by 'DE'.
- Deck F & G were reserved for passengers with `Pclass=2,3` and will be replace by 'FG'.
- Deck M is just a placeholder values for passengers with no cabins.

This way, we will have reduced the cardinality of deck feature from 9 to 4.

In [None]:
allData['Deck']=allData['Deck'].replace(['A','B','C','T'],'ABC')
allData['Deck']=allData['Deck'].replace(['D','E'],'DE')
allData['Deck']=allData['Deck'].replace(['F','G'],'FG')
allData['Deck'].value_counts()

In [None]:
fig,ax=plt.subplots(figsize=(5,4))
corrMat=allData.corr()
sns.heatmap(corrMat,annot=True)

**Converting Categorical variables into Dummy variables**

In [None]:
allData=pd.get_dummies(allData)
print('Train + Test Set\t %d X %d'%(allData.shape[0],allData.shape[1]))
allData.sample(5)

**Splitting dataset back to training and test set**

In [None]:
trainX=allData[:nTrain]
testX=allData[nTrain:]

## Modelling
Train Validation Split on Training Data (for Cross Validation)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Splitting training set further into training and validation set
subTrainX,valX,subTrainY,valY=train_test_split(trainX,trainY,test_size=0.2,random_state=42)