# SHAPxTitanic

## 01 Load, inspect and clean data

SHapley Additive exPlanations (or SHAP for short) can be used to explain individually predictions made by ML models by calculating the contribution of each feature to each prediction. It is particularly useful to use in combination with methods that are not normally known for their interpretability. The fact that each prediction is analysed individually is also intriguing. I can see how this could be useful when tailoring a response at an individual level.

I've been interested in learning more about SHAP for a while, so thought I'd have a play around with it on the titanic dataset.

**Acknowledgements/Helpful sources** 

After doing some Googling, I found that Manuel Amunategui from Viral ML had a video detailing how to apply SHAP to the titanic dataset, and had shared his source code (https://www.viralml.com/video-content.html?v=ZkIxZ5xlMuI). He used a model I'd not heard of before called Catboost which I'm going to look more into and use here too! I've watched a few of his videos since discovering him, and highly recommend his content :)

In [7]:
# install packages
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split

import shap

# Load and inspect data

In [2]:
# Downloaded the titanic dataset from the Kaggle competition
titanic_df = pd.read_csv('train.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [5]:
titanic_df.shape

(891, 12)

In [12]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [9]:
print(titanic_df.apply(lambda col: col.unique()))

PassengerId    [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
Survived                                                  [0, 1]
Pclass                                                 [3, 1, 2]
Name           [Braund, Mr. Owen Harris, Cumings, Mrs. John B...
Sex                                               [male, female]
Age            [22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0,...
SibSp                                      [1, 0, 3, 4, 2, 5, 8]
Parch                                      [0, 1, 2, 5, 3, 4, 6]
Ticket         [A/5 21171, PC 17599, STON/O2. 3101282, 113803...
Fare           [7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51....
Cabin          [nan, C85, C123, E46, G6, C103, D56, A6, C23 C...
Embarked                                          [S, C, Q, nan]
dtype: object


In [15]:
# 20% age missing, 77% cabin is missing, 0.2% embarked missing
(titanic_df.isnull().sum()/len(titanic_df))*100

PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

In [None]:
# PassengerId - id unique to passenger, 
# Survived - binary response, this is a classification problem
# Pclass - numeric but relates to class of passenger so need to convert to cat var. Higher class, higher survival I assume
# Name - char. Could be useful in identifying married women?
# Sex - char but convert to dummy
# Age - missing some variables, continuous. min age a decimal? Kids likely to have higher survival.
# SibSp - number of siblings/spouse on board. 
# Parch - number of parents/children on board. Perhaps those with bigger families were less likely to survive? More kids to locate on the boat.
# Ticket - ticket number
# Fare - ticket price in £
# Cabin - cabin number, contains nans
# Embarked - where the passenger embarked, (C = Cherbourg; Q = Queenstown; S = Southampton), contains nans

In [19]:
# create female field and use numerical values, 1 = female, 0 = male
titanic_df['Female'] = np.where(titanic_df['Sex'] == 'female', 1, 0)

# make pclass categorical
titanic_df['Pclass'] = np.where(titanic_df['Pclass'] == 1, 'First', 
                                np.where(titanic_df['Pclass'] == 2, 'Second', 'Third'))

# replace with unknown or use most common?
titanic_df['Embarked'] = titanic_df['Embarked'].replace(np.NaN, 'Unknown') 

# fill unknown ages with mean - could use mean based on whether they are mr, mrs, miss etc.
titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Age']

titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Female
0,1,0,Third,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,0
1,2,1,Third,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1
2,3,1,Third,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,1
3,4,1,Third,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,1
4,5,0,Third,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,0
