# Titanic: Machine Learning from Disaster

### Import Libraries

In [228]:
import random

import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression

### Load Data

In [229]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

In [230]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [231]:
# Separate data and target values
y_train_df = train_df['Survived']

X_train_df = train_df.drop('Survived', axis=1)
X_test_df = test_df

### Visualize data before cleanup

In [232]:
train_df.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Cleanup

### Missing Values

* https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/

In [233]:
train_df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### Age

The age of passengers is a very important attribute for our analysis, as Children were among the group of people onboard with a higher chance of survival (others being Women and Upper class). Removing records with no age information (~20% of the dataset) is not an option, neither is replacing the age with mean, median or mode. Let's use the age values we have from the other 80% of data to predict the missing 20%.

In [234]:
linear = LinearRegression()

data_with_null = train_df[train_df['Age'].isnull()][['PassengerId', 'Pclass', 'Survived', 'SibSp', 'Parch', 'Fare', 'Age']]
data_without_null = train_df[['PassengerId', 'Pclass', 'Survived', 'SibSp', 'Parch', 'Fare', 'Age']].dropna()

age_train_X = data_without_null.drop('Age', axis=1)
age_train_y = data_without_null['Age']

# FIXME - Does the presence of PassengerId affect fit?
linear.fit(age_train_X, age_train_y)

age_predicted = data_with_null
age_test_X = data_with_null.drop('Age', axis=1)
age_predicted['Age'] = pd.DataFrame(linear.predict(age_test_X))
age_predicted.head(5)

# https://stackoverflow.com/questions/41773728/pandas-fill-na-with-data-from-another-dataframe-based-on-same-id
train_df = train_df.set_index("PassengerId").combine_first(age_predicted.set_index("PassengerId")).reset_index()

#### Cabin

Unfortunately, over 77% of Cabin info is missing in the dataset. Discarding data which does not have cabin data is out of the question. Also, survival may have been affected by which cabin a person was in and consequently which deck they were on, when the Titanic sank. While we could set it to a new value like `U`, a better way would be to analyse the deck structure of RMS Titanic and assign cabins by class.

A cursory read-through of how Titanic's cabins were organized in Wikipedia gives the following insights:
* A-Deck: It was reserved exclusively for First Class passengers
* B-Deck: More First Class passenger accommodations were located here 
* C-Deck: Crew Cabins
* D-Deck: First, Second and Third Class passengers had cabins on this deck
* E-Deck: The majority of E-Deck was occupied by Second-Class
* F-Deck: Second and Third Class passengers

So, let us assign decks based on a passenger's class in the following way:
* First Class: Random assignment beteween A and B Decks
* Second Class: Random assignment between D and E Decks
* Third Class: Random assignment between E, F and G Decks

While we are at it, let's also convert values in Cabin column to Decks, because that's a better feature for our analysis.

In [235]:
# Convert Cabins to Decks
def convert_to_deck(cabin):
    if not pd.isna(cabin):
        cabin = cabin[0]
    return cabin

train_df['Deck'] = train_df['Cabin'].apply(convert_to_deck)

In [236]:
# Random assignment of Decks for passengers with no Cabin info
for i, row in train_df[train_df['Cabin'].isnull()].iterrows(): 
    if row['Pclass'] == 1:
        train_df.at[i, 'Deck'] = random.choice(['A', 'B'])
    elif row['Pclass'] == 2:
        train_df.at[i, 'Deck'] = random.choice(['D', 'E'])
    else:
        train_df.at[i, 'Deck'] = random.choice(['E', 'F', 'G'])

#### Embarked

Most passengers boarded Titanic at Southampton (923, vs. 274 in Cherbourg and 123 in Queenstown). Let's just assign the missing `Embarked` values to `S` for Southampton.

In [237]:
# Update null values in Embarked to S

train_df['Embarked'].fillna('S', inplace=True)

#### Dataset after replacing null values

In [238]:
train_df.head(10)

Unnamed: 0,PassengerId,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket,Deck
0,1,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,3,male,1,0,A/5 21171,F
1,2,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,1,PC 17599,C
2,3,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,female,0,1,STON/O2. 3101282,F
3,4,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,1,113803,C
4,5,35.0,,S,8.05,"Allen, Mr. William Henry",0,3,male,0,0,373450,F
5,6,28.967301,,Q,8.4583,"Moran, Mr. James",0,3,male,0,0,330877,F
6,7,54.0,E46,S,51.8625,"McCarthy, Mr. Timothy J",0,1,male,0,0,17463,E
7,8,2.0,,S,21.075,"Palsson, Master. Gosta Leonard",1,3,male,3,0,349909,F
8,9,27.0,,S,11.1333,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",2,3,female,0,1,347742,E
9,10,14.0,,C,30.0708,"Nasser, Mrs. Nicholas (Adele Achem)",0,2,female,1,1,237736,D


### Outliers

## Data Analysis

## Feature Selection

### PCA

## Models and Predictions

## Conclusion