# Titanic // Machine Learning from Disaster

An introduction to using machine learning for predicting which passengers survived the Titanic shipwreck.

Further resources available at: https://www.kaggle.com/c/titanic

For a tutorial on how to use Kaggle, getting set up, and finding your own environment to code in, see: https://www.kaggle.com/alexisbcook/titanic-tutorial

Discussion of scores: https://www.kaggle.com/c/titanic/discussion/57447

### Load necessary packages

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport

### Load data

In [2]:
df_train = pd.read_csv('data/train.csv')

### Inspect Data

In [3]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Assumptions

* PassengerId is not a predicator (since it has been assigned randomly afterwards)
* Survived is our "Target Variable", ie what we are trying to predict in our test dataset
* Pclass is important, need to make this one-hot encoded
* Name is not important. Remove it.
* Sex is maybe important, one-hot encode it
* Age is important. Needs to be normalized.
* SibSp could be important. Could be either one-hot encoded, or turned into a boolean.
* Parch could be important. Could be either one-hot encoded, or turned into a boolean.
* Ticket likely not important. Remove it.
* Fare could be important. Likely correlated with Pclass. Needs to be normalized.
* Cabin could be important. Needs to one-hot encoded.
* Embarked likely not important, but keep it. Needs to be one-hot encoded.


Notes for feature engineering
* Maybe extract letter from Cabin variable

### Automated Exploratory Data Analysis

From: https://pypi.org/project/pandas-profiling/ 

The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* Type inference: detect the types of columns in a dataframe.
* Essentials: type, unique values, missing values
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent values
* Histogram
* Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* Missing values matrix, count, heatmap and dendrogram of missing values
* Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
* File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

In [10]:
profile = ProfileReport(df_train, title="Pandas Profiling Report - Titanic Dataset")
profile.to_file("titanic data analysis.html")


Summarize dataset: 100%|██████████| 52/52 [00:03<00:00, 14.80it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:02<00:00,  2.28s/it]
Render HTML: 100%|██████████| 1/1 [00:01<00:00,  1.60s/it]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 100.01it/s]


## Feature Engineering

In [57]:
# X will represent our features, y will represent our target variable
X = df_train.drop(columns='Survived')
y = df_train[['Survived']]

# Drop unneccessary columns
X = X.drop(columns=['Name', 'Ticket'])
X.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,3,male,22.0,1,0,7.25,,S
1,2,1,female,38.0,1,0,71.2833,C85,C
2,3,3,female,26.0,0,0,7.925,,S
3,4,1,female,35.0,1,0,53.1,C123,S
4,5,3,male,35.0,0,0,8.05,,S


In [58]:
# Extract first letter of Cabin
X['Cabin'] = X['Cabin'].astype(str).replace('nan','')
X['Cabin'] = X['Cabin'].astype(str).str[0]

In [59]:
# Convert numerical values to binned categorical values
age_bins = [0,10,20,30,40,50,60,70,80,90,100]
fare_bins = [0,5,10,15,20,30,40,50,75,100,200,1000]
X['Age'] = pd.cut(X['Age'],bins=age_bins, labels=age_bins[:-1])
X['Fare'] = pd.cut(X['Fare'],bins=fare_bins, labels=fare_bins[:-1])

# Define which columns contain categorical values
categorical = ['Age', 'Fare', 'Pclass', 'Sex','SibSp','Parch','Cabin', 'Embarked']

for col in categorical:

    prefix = col + '_'
    dummies = pd.get_dummies(X[col], prefix = prefix, dummy_na = True)

    X = X.drop(columns = col)
    X = pd.concat([X, dummies], axis=1)

X

Unnamed: 0,PassengerId,Age__0.0,Age__10.0,Age__20.0,Age__30.0,Age__40.0,Age__50.0,Age__60.0,Age__70.0,Age__80.0,...,Cabin__D,Cabin__E,Cabin__F,Cabin__G,Cabin__T,Cabin__nan,Embarked__C,Embarked__Q,Embarked__S,Embarked__nan
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,2,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,3,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,4,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,5,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
887,888,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
888,889,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
889,890,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
