# Classification with Titanic dataset with Shapash

We build a Survival classification model and explain few predictions with [shapash](https://github.com/MAIF/shapash)


## Variables
Variables are both categorical and numerical:


* Survived is the target variable we are trying to predict (0 or 1):
    + 1 = Survived 
    + 0 = Not Survived

* Pclass (Passenger Class):
    + 1 = Upper Class
    + 2 = Middle Class
    + 3 = Lower Class

* Sex and Age are self-explanatory

* SibSp is the total number of the passengers' siblings and spouse

* Parch is the total number of the passengers' parents and children

* Fare is the passenger fare

* Embarked is port of embarkation and it is a categorical feature which has 3 unique values (C, Q or S):
    + C = Cherbourg
    + Q = Queenstown
    + S = Southampton

In [24]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split


X_orig, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
X_orig['pclass'] = X_orig['pclass'].astype('int')
X_orig.head()
#X_orig.tail(50)
#X_orig.shape

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [25]:
X_orig.columns
X_orig.dtypes

pclass          int64
name           object
sex          category
age           float64
sibsp         float64
parch         float64
ticket         object
fare          float64
cabin          object
embarked     category
boat           object
body          float64
home.dest      object
dtype: object

In [26]:
#Convert y to int instead of category
y = y.astype('int')
y

0       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: int64

## Features
We will use 7 variables (2 categorical features, 5 numerical features) to predict if a passenger will survive.


In [27]:
categorical_columns = ['sex', 'embarked']
numerical_columns = ['pclass', 'age', 'sibsp', 'parch', 'fare']

X = X_orig[categorical_columns + numerical_columns].copy()
X.head()
#X.info()

Unnamed: 0,sex,embarked,pclass,age,sibsp,parch,fare
0,female,S,1,29.0,0.0,0.0,211.3375
1,male,S,1,0.9167,1.0,2.0,151.55
2,female,S,1,2.0,1.0,2.0,151.55
3,male,S,1,30.0,1.0,2.0,151.55
4,female,S,1,25.0,1.0,2.0,151.55


## Features transformation
We fill NA values and convert categorical features with Ordinal encoding.

In [28]:
#Replace na by median for numerical features
for c in numerical_columns:
    X[c] = X[c].fillna(X[c].median())
#Replace na by mode for categorical features
for c in categorical_columns:
    X[c] = X[c].astype('category')
    X[c] = X[c].fillna(X[c].mode().iloc[0])

In [29]:
from category_encoders import OrdinalEncoder
# from sklearn.preprocessing import OrdinalEncoder #Not convenient because convert into np.array instead of dataframa
categ_encoding = OrdinalEncoder(cols=categorical_columns, handle_unknown='ignore', return_df=True).fit(X)
#categ_encoding = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1).fit(X)
X = categ_encoding.transform(X)
#X[0:2]
X.head()

Unnamed: 0,sex,embarked,pclass,age,sibsp,parch,fare
0,1,1,1,29.0,0.0,0.0,211.3375
1,2,1,1,0.9167,1.0,2.0,151.55
2,1,1,1,2.0,1.0,2.0,151.55
3,2,1,1,30.0,1.0,2.0,151.55
4,1,1,1,25.0,1.0,2.0,151.55


In [30]:
features = X.columns
features

Index(['sex', 'embarked', 'pclass', 'age', 'sibsp', 'parch', 'fare'], dtype='object')

## Random Forest classifier
We build and train a Random Forest classifier with all the data

In [31]:
#Train a RandomForestClassifier with X and y
rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(X, y)
#rf.fit(X.values, y.values) #Updated to clear scikitlearn warning

RandomForestClassifier(random_state=0)

In [32]:
print("RF model accuracy: %0.3f" % rf.score(X, y))
#print("RF model accuracy: %0.3f" % rf.score(X.values, y.values))
#print("RF test accuracy: %0.3f" % rf.score(X_test, y_test))

RF model accuracy: 0.966


## Explainability with Shapash
Shapash can use different methods for explainability. 
We will use SHAP that is the default one.

In [33]:
#Use of feature & label dict for a better display
feature_dict = {
                'pclass': 'Ticket class',
                 'sex': 'Sex',
                 'age': 'Age',
                 'sibsp': 'Relatives such as brother or wife',
                 'parch': 'Relatives like children or parents',
                 'fare': 'Passenger fare',
                 'embarked': 'Port of embarkation',
               }
label_dict = {0: "Not Survived", 1: "Survived"}

In [34]:
!pip show shapash

Name: shapash
Version: 2.0.1
Summary: Shapash is a Python library which aims to make machine learning interpretable and understandable by everyone.
Home-page: https://github.com/MAIF/shapash
Author: Yann Golhen, Sebastien Bidault, Yann Lagre, Maxime Gendre
Author-email: yann.golhen@maif.fr
License: Apache Software License 2.0
Location: /home/tlevy/.cache/pypoetry/virtualenvs/explainability-ZpE4ZdDW-py3.8/lib/python3.8/site-packages
Requires: dash, dash-bootstrap-components, dash-core-components, dash-daq, dash-html-components, dash-renderer, dash-table, matplotlib, nbformat, numba, numpy, pandas, plotly, scikit-learn, shap
Required-by: 


In [43]:
from shapash.explainer.smart_explainer import SmartExplainer

xpl = SmartExplainer(label_dict=label_dict, features_dict=feature_dict)
#xpl = SmartExplainer(rf, backend='shap', label_dict=label_dict, features_dict=feature_dict)


### Global explainability
We determine global explainabilty on a sample of data.

In [44]:
#Sample of data for which we compute explainability
n = 100 #Sample size as reference
X_s = X.sample(n, random_state=0)
y_pred = pd.DataFrame(rf.predict(X_s), columns=['pred'], index=X_s.index) #Build y_pred as pandas dataframe (not mandatory)

X_s.sample(10).head()

Unnamed: 0,sex,embarked,pclass,age,sibsp,parch,fare
459,2,1,2,42.0,1.0,0.0,27.0
1291,2,1,3,28.0,0.0,0.0,8.7125
916,1,2,3,4.0,0.0,1.0,13.4167
942,2,2,3,28.0,0.0,0.0,7.225
159,1,2,1,16.0,0.0,1.0,57.9792


In [45]:
#Compile without y_pred
xpl.compile(x=X_s,
            model=rf,
            preprocessing=categ_encoding,
            )

Backend: Shap TreeExplainer


In [46]:
#Display global feature importances
xpl.plot.features_importance()

In [17]:
#Various feature contributions
xpl.plot.contribution_plot(col="pclass")

In [18]:
xpl.plot.contribution_plot("age")

In [17]:
xpl.plot.contribution_plot("sex")

In [18]:
#Local explanation from a random element of the sample
import random
id = random.choice(list(X_s.index))
xpl.plot.local_plot(index=id)

Jack and Rose
==============

Now comes Jack and Rose!
We will evaluate the model with them.


![Jack and Rose](https://img.20mn.fr/OTHEAkuxRsabSyyULbWmkg/640x410_leonardo-dicaprio-kate-winslet-film-titanic.jpg)

In [19]:
features = ['sex', 'embarked', 'pclass','age', 'sibsp', 'parch', 'fare']

# Jack & Rose data

personal_info = np.array([
    ['female', 'S', 1, 17, 1, 1, 151.5],
    ['male', 'S', 3, 20, 0, 0, 9.5],
    ['male', 'S', 1, 20, 0, 0, 211.3],
    ['male', 'S', 3, 5, 0, 1, 9.5],
    ])

df_test = pd.DataFrame(personal_info, index=['Rose','Jack','Jack_Rich','Jack_Boy'], columns=features)
#df_test = pd.DataFrame(personal_info, index=['Rose'], columns=features)
X_test = df_test.copy()
df_test.head(2)

Unnamed: 0,sex,embarked,pclass,age,sibsp,parch,fare
Rose,female,S,1,17,1,1,151.5
Jack,male,S,3,20,0,0,9.5


In [20]:
#Encode categ features
X_test = categ_encoding.transform(df_test.copy())
X_test.head()

Unnamed: 0,sex,embarked,pclass,age,sibsp,parch,fare
Rose,1,1,1,17,1,1,151.5
Jack,2,1,3,20,0,0,9.5
Jack_Rich,2,1,1,20,0,0,211.3
Jack_Boy,2,1,3,5,0,1,9.5


In [21]:
y_pred = pd.DataFrame(rf.predict(X_test), columns=['pred'], index=X_test.index) #Build y_pred as pandas dataframe (not mandatory)
y_pred

Unnamed: 0,pred
Rose,1
Jack,0
Jack_Rich,0
Jack_Boy,1


In [22]:
#Build y_proba
y_proba = pd.DataFrame(rf.predict_proba(X_test), index=X_test.index)
y_proba

Unnamed: 0,0,1
Rose,0.25,0.75
Jack,0.99,0.01
Jack_Rich,0.88,0.12
Jack_Boy,0.14,0.86


## Will Rose survive (and why)?

In [23]:
from shapash.explainer.smart_explainer import SmartExplainer
xpl = SmartExplainer(label_dict=label_dict, features_dict=feature_dict)
xpl.compile(x=X_test,
            model=rf,
            preprocessing=categ_encoding,
            )

Backend: Shap TreeExplainer


In [24]:
idx = 'Rose'
xpl.plot.local_plot(index=idx)

## What about Jack?

In [25]:
idx = 'Jack'
xpl.plot.local_plot(index=idx)

## What if Jack would be richer?
Jack would now travel in pclass 1 with an high fare ticket.

In [26]:
idx = 'Jack_Rich'
xpl.plot.local_plot(index=idx)

## If Jack was a litle boy?
Jack is now a litle boy (5 years old) travelling with his mother (parch=1) in class 3.

In [27]:
idx = 'Jack_Boy'
xpl.plot.local_plot(index=idx)

In [28]:
#Test the web app
app = xpl.run_app()

Dash is running on http://0.0.0.0:8050/





INFO:root:Your Shapash application run on http://LAPTOP-6PNC4SFU:8050/
INFO:shapash.webapp.smart_app:Dash is running on http://0.0.0.0:8050/



 * Serving Flask app 'shapash.webapp.smart_app' (lazy loading)


INFO:root:Use the method .kill() to down your app.


 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: off


In [29]:
#Stop the webapp
app.kill()

INFO:werkzeug: * Running on http://172.26.37.159:8050/ (Press CTRL+C to quit)


In [30]:
#Summary of all the local predictions
xpl.add(y_pred=y_pred)
summary_df = xpl.to_pandas(max_contrib=3,proba=True)
summary_df

Unnamed: 0,pred,proba,feature_1,value_1,contribution_1,feature_2,value_2,contribution_2,feature_3,value_3,contribution_3
Rose,Survived,0.75,Sex,female,0.253051,Ticket class,1,0.13812,Relatives like children or parents,1,0.043218
Jack,Not Survived,0.99,Sex,male,0.113793,Ticket class,3,0.074564,Passenger fare,9.5,0.071285
Jack_Rich,Not Survived,0.88,Sex,male,0.259451,Ticket class,1,-0.04172,Port of embarkation,S,0.015824
Jack_Boy,Survived,0.86,Age,5,0.343691,Relatives like children or parents,1,0.116543,Relatives such as brother or wife,0,0.062783
