# Titanic // Machine Learning from Disaster

An introduction to using machine learning for predicting which passengers survived the Titanic shipwreck.

Further resources available at: https://www.kaggle.com/c/titanic

For a tutorial on how to use Kaggle, getting set up, and finding your own environment to code in, see: https://www.kaggle.com/alexisbcook/titanic-tutorial

Discussion of what makes a good score: https://www.kaggle.com/c/titanic/discussion/57447

### Load necessary packages

In [3]:
import pandas as pd
from pandas_profiling import ProfileReport

### Load data

In [50]:
df = pd.read_csv('data/train.csv')

### Inspect Data

In [51]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Assumptions

* PassengerId is not a predicator (since it has been assigned randomly afterwards)
* Survived is our "Target Variable", ie what we are trying to predict in our test dataset
* Pclass is important, need to make this one-hot encoded
* Name is not important. Remove it.
* Sex is maybe important, one-hot encode it
* Age is important. Needs to be normalized.
* SibSp could be important. Could be either one-hot encoded, or turned into a boolean.
* Parch could be important. Could be either one-hot encoded, or turned into a boolean.
* Ticket likely not important. Remove it.
* Fare could be important. Likely correlated with Pclass. Needs to be normalized.
* Cabin could be important. Needs to one-hot encoded.
* Embarked likely not important, but keep it. Needs to be one-hot encoded.


Notes for feature engineering
* Maybe extract letter from Cabin variable

### Automated Exploratory Data Analysis

From: https://pypi.org/project/pandas-profiling/ 

The pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

* Type inference: detect the types of columns in a dataframe.
* Essentials: type, unique values, missing values
* Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
* Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
* Most frequent values
* Histogram
* Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
* Missing values matrix, count, heatmap and dendrogram of missing values
* Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data.
* File and Image analysis extract file sizes, creation dates and dimensions and scan for truncated images or those containing EXIF information.

In [6]:
# profile = ProfileReport(df_train, title="Pandas Profiling Report - Titanic Dataset")
# profile.to_file("titanic data analysis.html")


## Feature Engineering

In [62]:
def split_df(df):

    # X will represent our features, y will represent our target variable
    X = df.drop(columns='Survived')
    y = df[['Survived']]
   
    
    return X,y

X,y = split_df(df)

X.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [71]:
def feature_engineer(X):

    # Drop unneccessary columns
    X = X.drop(columns=['Name', 'Ticket','PassengerId'])
    
    # Extract first letter of Cabin
    X['Cabin'] = X['Cabin'].astype(str).replace('nan','')
    X['Cabin'] = X['Cabin'].astype(str).str[0]

    # Convert numerical values to binned categorical values
    age_bins = [0,10,20,30,40,50,60,70,80,90,100]
    fare_bins = [0,5,10,15,20,30,40,50,75,100,200,1000]
    X['Age'] = pd.cut(X['Age'],bins=age_bins, labels=age_bins[:-1])
    X['Fare'] = pd.cut(X['Fare'],bins=fare_bins, labels=fare_bins[:-1])

    # Define which columns contain categorical values
    categorical = ['Age', 'Fare', 'Pclass', 'Sex','SibSp','Parch','Cabin', 'Embarked']

    for col in categorical:

        prefix = col + '_'
        dummies = pd.get_dummies(X[col], prefix = prefix, dummy_na = True)

        X = X.drop(columns = col)
        X = pd.concat([X, dummies], axis=1)

    return X


X = feature_engineer(X)

X

Unnamed: 0,Age__0.0,Age__10.0,Age__20.0,Age__30.0,Age__40.0,Age__50.0,Age__60.0,Age__70.0,Age__80.0,Age__90.0,...,Cabin__D,Cabin__E,Cabin__F,Cabin__G,Cabin__T,Cabin__nan,Embarked__C,Embarked__Q,Embarked__S,Embarked__nan
0,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
1,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
887,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
888,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0
889,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


## Modelling

In [57]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier

In [58]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=22)

In [91]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

clf = DecisionTreeClassifier(random_state=42)

params = {"criterion": ["gini", "entropy"],
          "splitter": ["best", "random"],
          "class_weight": ['balanced', None], 
          "max_depth": randint(2, 21),
          "min_samples_leaf": randint(1, 11),
          "max_features": uniform(0.0, 1.0)}

search = RandomizedSearchCV(clf, param_distributions=params, n_iter=1000, scoring='accuracy', cv=10, verbose=2)
search = search.fit(X_train, y_train)

Fitting 10 folds for each of 1000 candidates, totalling 10000 fits
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, max_features=0.6300351588709903, min_samples_leaf=7, splitter=random; total time=   0.0s
[CV] END class_weight=None, criterion=gini, max_depth=6, 

In [92]:
y_pred = search.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.86      0.84       110
           1       0.76      0.68      0.72        69

    accuracy                           0.79       179
   macro avg       0.79      0.77      0.78       179
weighted avg       0.79      0.79      0.79       179



## Visualize Tree

In [93]:
from sklearn import tree
from matplotlib import pyplot as plt

In [94]:
cn = ['Survived','Died']
fn = X.columns

fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (5,4), dpi=2000)
tree.plot_tree(search.best_estimator_,
               feature_names = fn, 
               class_names=cn,
               filled = True)

fig.savefig('tree.png')

## Estimate on unknown data

In [95]:
# Load test data
unknown_data = pd.read_csv('data/test.csv')

unknown_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [96]:
X_unknown_data = feature_engineer(unknown_data)

# Ensure columns are the same

for col in X_train.columns:
    if col not in X_unknown_data.columns:
        X_unknown_data[col] = 0

for col in X_unknown_data.columns:
    if col not in X_train.columns:
        X_unknown_data = X_unknown_data.drop(columns= col)

X_unknown_data = X_unknown_data[X_train.columns]

In [97]:
# Predict survivals based on model
predicted_survival = search.predict(X_unknown_data)

In [98]:
# Combine submission data

submission_data = pd.DataFrame(unknown_data['PassengerId'])
submission_data['Survived'] = predicted_survival
submission_data

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,0
...,...,...
413,1305,0
414,1306,1
415,1307,0
416,1308,0


In [99]:
# Create submission file

output = submission_data.to_csv('data/output.csv', index=False)