# SHAP 101 - explaining ml models and beyond

## Feature Attributions
SHAP (SHapley Additive exPlanations) - https://shap.readthedocs.io/en/latest/
Understand individual predictions - https://www.kaggle.com/code/dansbecker/shap-values/tutorial
Aggregate SHAP values for even more detailed model insights - https://www.kaggle.com/code/dansbecker/advanced-uses-of-shap-values/tutorial

Convert SHAP Score to percentage: https://medium.com/towards-data-science/black-box-models-are-actually-more-explainable-than-a-logistic-regression-f263c22795d

## Partial Dependence Plot
Partial Dependence Plot Theory - https://christophm.github.io/interpretable-ml-book/pdp.html
Partial Dependence Plots - https://scikit-learn.org/stable/modules/partial_dependence.html

## Additional References
Fairlearn - https://fairlearn.org

# Kaggle Titanic Compitition
https://www.kaggle.com/competitions/titanic

# Data Description
https://www.kaggle.com/competitions/titanic/data?select=train.csv

| Variable | Definition	| Key | 
| :--- | :--- | :--- |
| survival | Survival |	0 = No, 1 = Yes |
| pclass   | Ticket class |	1 = 1st, 2 = 2nd, 3 = 3rd |
| sex |	Sex	| |
| Age |	Age | in years | 	
| sibsp	| # of siblings / spouses aboard the Titanic |	
| parch	| # of parents / children aboard the Titanic |	
| ticket |	Ticket number | |	
| fare | Passenger fare | (Y.W.: Ticket price paid)  |	
| cabin	| Cabin number | |	
| embarked	| Port of Embarkation |	C = Cherbourg, Q = Queenstown, S = Southampton |

#### Variable Notes

**pclass:** A proxy for socio-economic status (SES)
* 1st = Upper
* 2nd = Middle
* 3rd = Lower

**age:** Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp:** The dataset defines family relations in this way...
* Sibling = brother, sister, stepbrother, stepsister
* Spouse = husband, wife (mistresses and fiancés were ignored)

**parch:** The dataset defines family relations in this way...
* Parent = mother, father
* Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.

In [12]:
from utils.util import KaggleData, current_dir_subpath
import pandas as pd
titanic_train_path = current_dir_subpath("data/train.csv")
titanic_test_path = current_dir_subpath("data/test.csv")

titanic = KaggleData(
     train_path = titanic_train_path,
     test_path = titanic_test_path
)

train_X_df, test_X_df, train_y = titanic.load()
# excluded Survived column from all the train and test dataframe and concate them
data_X_df = titanic.load_all(label_col="Survived")

In [6]:
data_X_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0
mean,655.0,2.294882,29.881138,0.498854,0.385027,33.295479
std,378.020061,0.837836,14.413493,1.041658,0.86556,51.758668
min,1.0,1.0,0.17,0.0,0.0,0.0
25%,328.0,2.0,21.0,0.0,0.0,7.8958
50%,655.0,3.0,28.0,0.0,0.0,14.4542
75%,982.0,3.0,39.0,1.0,0.0,31.275
max,1309.0,3.0,80.0,8.0,9.0,512.3292


In [7]:
train_X_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104
50%,446.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,3.0,38.0,1.0,0.0,31.0
max,891.0,3.0,80.0,8.0,6.0,512.3292


In [8]:
test_X_df.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,3.0,39.0,1.0,0.0,31.5
max,1309.0,3.0,76.0,8.0,9.0,512.3292


In [9]:
data_X_df.dtypes

PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [10]:
data_X_df.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
# correlation
data_X_df.corr(numeric_only=True)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.038354,0.028814,-0.055224,0.008942,0.031428
Pclass,-0.038354,1.0,-0.408106,0.060832,0.018322,-0.558629
Age,0.028814,-0.408106,1.0,-0.243699,-0.150917,0.17874
SibSp,-0.055224,0.060832,-0.243699,1.0,0.373587,0.160238
Parch,0.008942,0.018322,-0.150917,0.373587,1.0,0.221539
Fare,0.031428,-0.558629,0.17874,0.160238,0.221539,1.0


In [16]:
train_df = pd.concat([train_X_df, train_y], axis=1)
train_df.corr(numeric_only=True)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Survived
PassengerId,1.0,-0.035144,0.036847,-0.057527,-0.001652,0.012658,-0.005007
Pclass,-0.035144,1.0,-0.369226,0.083081,0.018443,-0.5495,-0.338481
Age,0.036847,-0.369226,1.0,-0.308247,-0.189119,0.096067,-0.077221
SibSp,-0.057527,0.083081,-0.308247,1.0,0.414838,0.159651,-0.035322
Parch,-0.001652,0.018443,-0.189119,0.414838,1.0,0.216225,0.081629
Fare,0.012658,-0.5495,0.096067,0.159651,0.216225,1.0,0.257307
Survived,-0.005007,-0.338481,-0.077221,-0.035322,0.081629,0.257307,1.0
