### Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Data Exploration

We load both the <i>train</i> and <i>test</i> datasets and explore and contrast their structure and content.

In [3]:
train = pd.read_csv('../../data/train.csv')
test = pd.read_csv('../../data/test.csv')

train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [34]:
train.shape

(891, 12)

In [35]:
train.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [56]:
#Excluding the 'PassengerId' index column to reduce clutter

train[train.columns.difference(['PassengerId'])].describe(include = 'all')

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
count,714.0,204,889,891.0,891,891.0,891.0,891,891.0,891.0,891.0
unique,,147,3,,891,,3.0,2,,2.0,681.0
top,,B96 B98,S,,"Brown, Mr. Thomas William Solomon",,3.0,male,,0.0,347082.0
freq,,4,644,,1,,491.0,577,,549.0,7.0
mean,29.699118,,,32.204208,,0.381594,,,0.523008,,
std,14.526497,,,49.693429,,0.806057,,,1.102743,,
min,0.42,,,0.0,,0.0,,,0.0,,
25%,20.125,,,7.9104,,0.0,,,0.0,,
50%,28.0,,,14.4542,,0.0,,,0.0,,
75%,38.0,,,31.0,,0.0,,,1.0,,


In [38]:
train['Survived'].value_counts(normalize = True)

0    0.616162
1    0.383838
Name: Survived, dtype: float64

Two features - <code>Parch</code> and <code>SibSp</code>, while numeric in type, have a very skewed distribution. We get a deeper look at the distribution of values of these two features.

In [4]:
train['SibSp'].value_counts(normalize = True)

0    0.682379
1    0.234568
2    0.031425
4    0.020202
3    0.017957
8    0.007856
5    0.005612
Name: SibSp, dtype: float64

In [5]:
train['Parch'].value_counts(normalize = True)

0    0.760943
1    0.132435
2    0.089787
5    0.005612
3    0.005612
4    0.004489
6    0.001122
Name: Parch, dtype: float64

In [6]:
train.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We repeat the data exploration steps for the <i>test</i> dataset. 

In [7]:
test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [8]:
test.shape

(418, 11)

In [9]:
test.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [10]:
test[test.columns.difference(['PassengerId'])].describe(include = 'all')

Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Ticket
count,332.0,91,418,417.0,418,418.0,418.0,418,418.0,418
unique,,76,3,,418,,,2,,363
top,,B57 B59 B63 B66,S,,"Danbom, Master. Gilbert Sigvard Emanuel",,,male,,PC 17608
freq,,3,270,,1,,,266,,5
mean,30.27259,,,35.627188,,0.392344,2.26555,,0.447368,
std,14.181209,,,55.907576,,0.981429,0.841838,,0.89676,
min,0.17,,,0.0,,0.0,1.0,,0.0,
25%,21.0,,,7.8958,,0.0,1.0,,0.0,
50%,27.0,,,14.4542,,0.0,3.0,,0.0,
75%,39.0,,,31.5,,0.0,3.0,,1.0,


In [12]:
test['SibSp'].value_counts(normalize = True)

0    0.677033
1    0.263158
2    0.033493
4    0.009569
3    0.009569
8    0.004785
5    0.002392
Name: SibSp, dtype: float64

In [13]:
test['Parch'].value_counts(normalize = True)

0    0.775120
1    0.124402
2    0.078947
3    0.007177
9    0.004785
4    0.004785
6    0.002392
5    0.002392
Name: Parch, dtype: float64

In [14]:
test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

#### Observations
<ol> 
    <li> Overall the two datasets have similar structure and content.
    <li> The chance of survival across all passengers was only ~38%.
    <li> The <i>train</i> dataset has 891 records and each record with 12 features. The <i>test</i> dataset has 418 records.
    <li> Of the features - <code>Name, Sex, Ticket, Cabin, Embarked</code> - have string data and the rest of the features have numeric data.
    <li> For the <i>train</i> dataset, the features - <code>Age, Cabin, Embarked</code> - report 177, 687 and 2 missing values respectively.
    <li> For the <i>test</i> datset, the features - <code>Age, Fare, Cabin</code> - report 86, 1 and 327 missing values.
    <li> For both the datasets, the features - <code>SibSp</code> and <code>Parch</code> - report significantly skewed distribution with ~70% records reporting <code>SibSp</code> = 0 and ~80% records reporting <code>Parch</code> = 0

### Data Pre-processing

For both the <i>train</i> and <i>test</i> datasets, we run the following data pre-processing steps:
<ol> 
    <li> Convert the features <code>Survived, Pclass</code> from <code>int</code> to <code>category</code>.
    <li> Convert the features <code>Sex, Embarked</code> from <code>object</code> to <code>category</code>.
    <li> Given the significant skew in the distribution of their values, we convert the features <code>SibSp, Parch</code> from <code>int</code> to <code>category</code>, such that for the two features the value 0 will represent one category and every other value will represent the second category.

In [28]:
train.loc[:, ['Survived', 'Pclass']] = train.loc[:, ['Survived', 'Pclass']].astype('category')
test.loc[:, 'Pclass'] = test.loc[:, 'Pclass'].astype('category')

train.loc[:, ['Sex', 'Embarked']] = train.loc[:, ['Sex', 'Embarked']].astype('category')
test.loc[:, ['Sex', 'Embarked']] = test.loc[:, ['Sex', 'Embarked']].astype('category')

train.loc[train['SibSp'] != 0, ['SibSp']] = 1
train.loc[train['Parch'] != 0, ['Parch']] = 1
train.loc[:, ['SibSp', 'Parch']] = train.loc[:, ['SibSp', 'Parch']].astype('category')
test.loc[test['SibSp'] != 0, ['SibSp']] = 1
test.loc[test['Parch'] != 0, ['Parch']] = 1
test.loc[:, ['SibSp', 'Parch']] = test.loc[:, ['SibSp', 'Parch']].astype('category')

### Exploratory Data Analysis

We make the reasonable assumption that the survival of a passenger does not depend on the <code>string</code>-type features <code>Name, Ticket, Cabin</code>. For the remaining 7 features, for the <i>train</i> dataset, we plot to explore how the different features vary between the Survivors and the non-Survivors.

#### Distrubution of Survivors by Categorial features

In [31]:
pd.crosstab(index = train['Survived'], columns = train['Pclass'], 
            normalize = 'columns', colnames = ['Passenger Class'], 
            margins = True)

Passenger Class,1,2,3,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.37037,0.527174,0.757637,0.616162
1,0.62963,0.472826,0.242363,0.383838


In [32]:
pd.crosstab(index = train['Survived'], columns = train['Sex'], 
            normalize = 'columns', colnames = ['Sex'], 
            margins = True)

Sex,female,male,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.257962,0.811092,0.616162
1,0.742038,0.188908,0.383838


In [33]:
pd.crosstab(index = train['Survived'], columns = train['Embarked'], 
            normalize = 'columns', colnames = ['Port of Embarkation'], 
            margins = True)

Port of Embarkation,C,Q,S,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.446429,0.61039,0.663043,0.617548
1,0.553571,0.38961,0.336957,0.382452


In [34]:
pd.crosstab(index = train['Survived'], columns = train['SibSp'], 
            normalize = 'columns', colnames = ['Siblings on Board'], 
            margins = True)

Siblings on Board,0,1,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.654605,0.533569,0.616162
1,0.345395,0.466431,0.383838


In [36]:
pd.crosstab(index = train['Survived'], columns = train['Parch'], 
            normalize = 'columns', colnames = ['Parents/Children on Board'], 
            margins = True)

Parents/Children on Board,0,1,All
Survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.656342,0.488263,0.616162
1,0.343658,0.511737,0.383838


#### Observations
When compared to the overall chance of survival (~38%), interesting insights emerge:
<ol>
    <li> Chance of survival was dramatically better for passengers in the first class (~63%) than those in the third class (~24%), with the second class passengers also having significantly better chance of survival (~47%) than those in third class.
    <li> Women had a dramatically better chance of survival (~74%) than men (~19%).
    <li> Passengers embarking from port C had much better chance of survival (~55%) than those embarking from the other two ports (~34%-39%).
    <li> It helps to have siblings on board. Those with siblings on board had better chance of survival (~47%) than those that didn't (~35%).
    <li> Similarly, it helps if families are travelling together. Those with familes on board had better chance of survival (~51%) than those that didn't (~34%).</ol>

#### Distribution of Survivors by numerical features

In [40]:
train.dtypes

PassengerId       int64
Survived       category
Pclass         category
Name             object
Sex            category
Age             float64
SibSp          category
Parch          category
Ticket           object
Fare            float64
Cabin            object
Embarked       category
dtype: object