# Explainable AI: Explainable Boosting Machine


In this example we use the white-box model "EBM - Explainable Boosting Machine" from the package [InterpretML](https://github.com/interpretml/interpret) by Microsoft. The package supports a range of explainable AI tools, from white-box models to explainers for black-box models such as Shapley values, LIME, partial dependency plots and others.

EBM is based on Generalized Additive Models with Pairwise Interactions (GA2M) by Lou et al. ([Accurate Intelligible Models with Pairwise Interactions](https://www.cs.cornell.edu/~yinlou/papers/lou-kdd13.pdf))


We will use the [adult](https://archive.ics.uci.edu/ml/datasets/adult) that focuses on a (binary) classification task whether or not a person makes more than 50k USD per year. The data are taken from a 1994 census and were first discussed in the paper [Ron Kohavi, "Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996](https://www.academia.edu/download/40088603/Scaling_Up_the_Accuracy_of_Naive-Bayes_C20151116-5477-1fw84ob.pdf)

The data have a number of categorial and numerical features.
We can access the data directly from the archive (or use a local copy).
Since we already discussed this example when we looked at classifications, we won't do any exploratory data analysis here.

In [9]:
import pandas as pd

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt
import seaborn as sns

# white-box model EBM
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show

# general parameters
n_splits=3
random_state=1337

## Data Access

Read in data directly from the archive.

In [4]:
df = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    header=None)
df.columns = [
    "Age", "WorkClass", "fnlwgt", "Education", "EducationNum",
    "MaritalStatus", "Occupation", "Relationship", "Race", "Gender",
    "CapitalGain", "CapitalLoss", "HoursPerWeek", "NativeCountry", "Income"
]
train_cols = df.columns[0:-1]
label = df.columns[-1]
X = df[train_cols]
y = df[label]


In [5]:
df.head(5)

Unnamed: 0,Age,WorkClass,fnlwgt,Education,EducationNum,MaritalStatus,Occupation,Relationship,Race,Gender,CapitalGain,CapitalLoss,HoursPerWeek,NativeCountry,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


First, we split the data into a training and a test dataset, retaining 25% of the data for the test data.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=random_state)

Now we create an instance of the model and train it.
Note that the explainable boosting classifier can directly work on the categorial variables as text, i.e. we do not need to transform them to a numerical representation.

In [8]:
model = ExplainableBoostingClassifier(random_state=random_state)
model.fit(X_train, y_train)

# Global interpretation

First, we look at the global features of the model.
In particular, the "summary" page will show us the importance of each feature.
We can then dive into individual features.

In [10]:
global_explanation = model.explain_global()
show(global_explanation)

# Local Interpretation

As this is a "white box" model, we can look at the details of individual predictions.

In the interactive widget, we can select individual cases and then analyse how the features contribute to the classification result


In [12]:
#understand individual predictions
local_explanation = model.explain_local(X_test.iloc[0:5], y_test.iloc[0:5])

In [13]:
show(local_explanation)