#  Interpret ML - Titanic Demo
<br>
InterpretML is an open-source python package for training interpretable models and explaining blackbox systems.
<br>
Interpretability is essential for:

**Model debugging** - Why did my model make this mistake?<br>
**Detecting bias** - Does my model discriminate?<br>
**Human-AI cooperation** - How can I understand and trust the model's decisions?<br>
**Regulatory compliance** - Does my model satisfy legal requirements?<br>
**High-risk applications** - Healthcare, finance, judicial, ...<br>
<br>
Historically, the most intelligible models were not very accurate, and the most accurate models were not intelligible. Microsoft Research has developed an algorithm called the **Explainable Boosting Machine (EBM)** which has both high accuracy and intelligibility. EBM uses modern machine learning techniques like bagging and boosting to breathe new life into traditional GAMs (Generalized Additive Models). This makes them as accurate as random forests and gradient boosted trees, and also enhances their intelligibility and editability.

**EBM is a fast implementation of GA2M.**
<br> <br>
https://github.com/Microsoft/interpret

<img src="https://kwmp.ca/wp-content/uploads/2018/04/titanic-the-musical-1024x538.jpg">

 ## Variables:
 - PassengerId: and id given to each traveler on the boat
 - Pclass: the passenger class. It has three possible values: 1,2,3 (first, second and third class)
 - The Name of the passeger
 - The Sex
 - The Age
 - SibSp: number of siblings and spouses traveling with the passenger
 - Parch: number of parents and children traveling with the passenger
 - The ticket number
 - The ticket Fare
 - The cabin number
 - The embarkation. This describe three possible areas of the Titanic from which the people embark. Three possible values S,C,Q

https://www.kaggle.com/c/titanic
<br><br>
Dataset: 
https://www.kaggle.com/c/titanic/data

## 0. Settings

In [1]:
import sys
sys.version

'3.6.8 |Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]'

In [2]:
import pandas as pd

### Installation of interpretML package

In [3]:
#!pip install -U interpret

## 1. Data loading

In [4]:
url="https://raw.githubusercontent.com/retkowsky/titanic/master/train.csv"
df=pd.read_csv(url, index_col=None, na_values=['NA'])

In [5]:
# removing na
df = df.dropna()

In [6]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


In [7]:
df.shape

(183, 12)

In [8]:
df = df.drop(['Cabin','Parch'],axis=1)

In [9]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Ticket,Fare,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,PC 17599,71.2833,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,113803,53.1000,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,17463,51.8625,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,PP 9549,16.7000,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,113783,26.5500,S
...,...,...,...,...,...,...,...,...,...,...
871,872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,11751,52.5542,S
872,873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,695,5.0000,S
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,11767,83.1583,C
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,112053,30.0000,S


## 2. Statistics

In [10]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Fare
count,183.0,183.0,183.0,183.0,183.0,183.0
mean,455.36612,0.672131,1.191257,35.674426,0.464481,78.682469
std,247.052476,0.470725,0.515187,15.643866,0.644159,76.347843
min,2.0,0.0,1.0,0.92,0.0,0.0
25%,263.5,0.0,1.0,24.0,0.0,29.7
50%,457.0,1.0,1.0,36.0,0.0,57.0
75%,676.0,1.0,1.0,47.5,1.0,90.0
max,890.0,1.0,3.0,80.0,3.0,512.3292


In [11]:
df.corr()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Fare
PassengerId,1.0,0.148495,-0.089136,0.030933,-0.083488,0.02974
Survived,0.148495,1.0,-0.034542,-0.254085,0.106346,0.134241
Pclass,-0.089136,-0.034542,1.0,-0.306514,-0.103592,-0.315235
Age,0.030933,-0.254085,-0.306514,1.0,-0.156162,-0.092424
SibSp,-0.083488,0.106346,-0.103592,-0.156162,1.0,0.286433
Fare,0.02974,0.134241,-0.315235,-0.092424,0.286433,1.0


## 3. Partitionning

In [12]:
col_target=['Survived']
col_df=['Age','Pclass','Sex','Fare','SibSp']
X=df[col_df]
y=df[col_target]

In [13]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3, random_state=10)

## 4. InterpretML

### 4.1 Descriptive Statistics

In [14]:
from interpret import show
from interpret.data import ClassHistogram

hist = ClassHistogram().explain_data(X_train, y_train, name = 'Train Data')
show(hist)

### 4.2 EBM (Explanable Boosting Machine) to train a ML for survival prediction

In [15]:
seed = 1

from interpret.glassbox import ExplainableBoostingClassifier, LogisticRegression, ClassificationTree, DecisionListClassifier

ebm = ExplainableBoostingClassifier(random_state=seed)
ebm.fit(X_train, y_train)

ExplainableBoostingClassifier(binning_strategy='uniform',
               data_n_episodes=2000, early_stopping_run_length=50,
               early_stopping_tolerance=1e-05,
               feature_names=['Age', 'Pclass', 'Sex', 'Fare', 'SibSp'],
               feature_step_n_inner_bags=0,
               feature_types=['continuous', 'continuous', 'categorical', 'continuous', 'continuous'],
               holdout_size=0.15, holdout_split=0.15, interactions=0,
               learning_rate=0.01, max_tree_splits=2,
               min_cases_for_splits=2, n_estimators=16, n_jobs=-2,
               random_state=1, schema=None, scoring=None,
               training_step_episodes=1)

#### 4.2.1 How to understand EBM

In [16]:
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)

#### 4.2.2 How to understand EBM results for each score

In [17]:
ebm_local = ebm.explain_local(X_test[:10], y_test[:10], name='EBM')
show(ebm_local)

#### 4.2.3 EBM Roc curve

In [18]:
from interpret.perf import ROC

ebm_perf = ROC(ebm.predict_proba).explain_perf(X_test, y_test, name='EBM')
show(ebm_perf)

### 4.3 Let's try now with a decision tree and a logistic regresssion

In [19]:
from interpret.glassbox import LogisticRegression, ClassificationTree

# We have to transform categorical variables to use Logistic Regression and Decision Tree
X_enc = pd.get_dummies(X, prefix_sep='.')
feature_names = list(X_enc.columns)
X_train_enc, X_test_enc, y_train, y_test = train_test_split(X_enc, y, test_size=0.20, random_state=seed)

lr = LogisticRegression(random_state=seed, feature_names=feature_names, penalty='l1')
lr.fit(X_train_enc, y_train)

tree = ClassificationTree()
tree.fit(X_train_enc, y_train)





<interpret.glassbox.decisiontree.ClassificationTree at 0x16dfc15f240>

### 4.4 Validation of these 2 new models

### 4.4.1. ROC

In [20]:
lr_perf = ROC(lr.predict_proba).explain_perf(X_test_enc, y_test, name='Logistic Regression')
tree_perf = ROC(tree.predict_proba).explain_perf(X_test_enc, y_test, name='Classification Tree')

show(lr_perf)

In [21]:
show(tree_perf)

### 4.4.2. Decision Tree

In [22]:
lr_global = lr.explain_global(name='LR')
tree_global = tree.explain_global(name='Tree')

show(lr_global)

In [23]:
show(tree_global)

In [24]:
show(tree_perf)

### 4.4.3. EBM

In [25]:
show(ebm_perf)

### 4.5 Dashboard

In [26]:
show([hist, lr_global, lr_perf, tree_global, tree_perf, ebm_global, ebm_perf], share_tables=True)