# Titanic: Machine Learning from Disaster

## Introduction

RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean on April 15, 1912, during her maiden voyage, after colliding with an iceberg. It is one of the most infamous shipwrecks in human history, killing 1502 out of 2224 passengers and crew and is considered to be the deadliest commercial peacetime maritime disasters in modern history. 

In this notebook, I will build a machine learning model which will predict whether a passenger on the Titanic would have survived. 

To begin, we need to import the necessary libraries:

In [1]:
# Centering the plots in the notebook
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    displ ay: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""");

In [2]:
#We then import the helpful libraries and the data.
%matplotlib inline

import numpy as np 
import pandas as pd
import os
pd.options.display.max_columns = 100

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)

In [3]:
# Read in both the train and test data
train = pd.read_csv("_data/train.csv")
test = pd.read_csv("_data/test.csv")

## Exploratory Data Analysis (EDA)

### Cleaning data

In [4]:
print('Train shape:', train.shape)

Train shape: (891, 12)


In [45]:
print('Test shape:', test.shape)

Test shape: (418, 11)


In [5]:
train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


The above analysis shows that we have 891 observations, 11 features and the target variable 'Survived'.

From Kaggle we know the feature names are:
- Survival = whether the passenger had survived with 0 being no and 1 being yes
- Pclass = Ticket class
- Sex
- Age
- Sibsp = # of siblings / spouses aboard the Titanic
- Parch = # of parents / children aboard the Titanic
- Ticket = Ticket number
- Fare
- Cabin = Cabin number
- Embarked = Port of Embarkation

We can note a few things regarding the data quality which we will need to address in this analysis:
- Some features contain missing values.
- There are some features which will need to be converted to numberical values so the models will be able to process them. 
- Some features will need to be converted to roughly the same scale as each other, due to having a wide range of values.

#### Missing values
The following function will assess how many missing values are in both of our datasets

In [19]:
def missing_values_table(data):
    missing_values = data.isnull().sum().sort_values(ascending = False)
    missing_percent = data.isnull().sum()/data.isnull().count()*100
    missing_percent = (round(missing_percent, 1)).sort_values(ascending = False)

    table = pd.concat([missing_values, missing_percent], axis=1, keys=['Total', 'Percent'])
    table = table[table['Percent'] > 0]
    return table

In [21]:
missing_values_table(train)

Unnamed: 0,Total,Percent
Cabin,687,77.1
Age,177,19.9
Embarked,2,0.2


In [20]:
missing_values_table(test)

Unnamed: 0,Total,Percent
Cabin,327,78.2
Age,86,20.6
Fare,1,0.2


__Cabin missing values__

The above table shows that a significant number of values are missing from the cabin column (more than 77% in both test and train tables). The format of this feature is deck letter and cabin number - e.g. C85

In order to make this feature more useful, I will extract the deck letter from the cabin number where available, and create a new feature 'deck'. I will then drop the cabin feature.

In [39]:
test['deck'] = test['Cabin'].str[0].fillna("U")
train['deck'] = train['Cabin'].str[0].fillna("U")

In [42]:
train = train.drop(['Cabin'], axis=1)
test = test.drop(['Cabin'], axis=1)

Check that works

In [43]:
train['deck'].describe()

count     891
unique      9
top         U
freq      687
Name: deck, dtype: object

In [44]:
test['deck'].describe()

count     418
unique      8
top         U
freq      327
Name: deck, dtype: object

__Age missing values__

The above table shows that 177 values are missing from the age column in the train table, while 86 are missing in the test table. 

I have decided to fill this column with random integers, computed based on the mean age in regards to the standard deviation. Another possible method would have been to use the median age, as that is robust when it comes to outliers. However, when researching how to carry out this analysis, a lot of other competitiors stated that this sees better results. It would be good to test both to see what impact they have. 

In [46]:
data = [train, test]

for dataset in data:
    mean = dataset["Age"].mean()
    std = dataset["Age"].std()
    is_null = dataset["Age"].isnull().sum()
    # select random numbers in the range of mean and standard deviation
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)
    # fill in the unknown numbers with rand_age values
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train["Age"].astype(int)

Check that works

In [25]:
train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.414141,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,13.513214,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,223.5,0.0,2.0,21.0,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,37.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292
