<center>
    <h1>Explainable Boosting machines for Tabular data</h1>
    <h2>Interpretable or Accurate? Why Not Both?</h2>
    
</center>



<p style="text-align:center;"><img src="https://imgur.com/vYz7PYw.png" alt="Dashboard" width="800" height="400"></p>

<center>A dashboard showing the performance comparison of various regressors in InterpretMl</center>


As summed up by [Miller](https://arxiv.org/abs/1706.07269), interpretability refers to the degree to which a human can understand the cause of a decision. Lately, there has been a lot of emphasis on creating inherently interpretable models and doing away from their black box counterparts. EBMs or [Explainable Boosting Machine](https://www.youtube.com/watch?v=MREiHgHgl0k),are models designed to have accuracy comparable to state-of-the-art machine learning methods like Random Forest and Boosted Trees while being highly intelligible and explainable. In this notebook, we will look at the idea behind EBMs and implement them for the given problem via [InterpretML](https://arxiv.org/pdf/1909.09223.pdf), a Unified Framework for Machine Learning Interpretability.

This notebook is an extension of an article I wrote on the same topic, which you would find useful : [Interpretable or Accurate? Why Not Both?](https://towardsdatascience.com/interpretable-or-accurate-why-not-both-4d9c73512192?sk=2f44377541a2f49939c921e54eb3cde7)

---

## What are EBMs?

EBM is a type of [generalized additive mode](https://projecteuclid.org/journals/statistical-science/volume-1/issue-3/Generalized-Additive-Models/10.1214/ss/1177013604.full)l or GAM for short. Linear models assume a linear relationship between the response and predictors. Thus, they are unable to capture the non-linearities in the data.

Linear Model: y = β0 + β1x1 + β2x2 + … + βn xn

To overcome this shortcoming, in the late 80’s statisticians [Hastie & Tibshirani developed generalized additive models](https://projecteuclid.org/journals/statistical-science/volume-1/issue-3/Generalized-Additive-Models/10.1214/ss/1177013604.full)(GAMs), which keep the additive structure, and therefore the interpretability of the linear models. Thus, the linear relationship between the response and predictor variable gets [replaced by several non-linear smooth functions](https://datascienceplus.com/generalized-additive-models/)(f1, f2, etc.) to model and capture the non-linearities in the data. GAMs are more accurate than simple linear models, and since they do not contain any interactions between features, users can also easily interpret them.

Additive Model: y = f1(x1) + f2(x2) + … + fn(xn)

EBMs are an improvement on the GAMs utilizing techniques like gradient boosting and bagging. EBMs include pairwise interaction terms, which increases their accuracy even further.

EBMs: y = Ʃi fi (xi) + Ʃij fij(xi , xj) + Ʃijk fijk (xi , xj , xk )

---


## IntepretML: A Unified Framework for Machine Learning Interpretability

EBMs come packaged within a Machine Learning Interpretability toolkit called [InterpretML](https://arxiv.org/pdf/1909.09223.pdf). It is an open-source package for training interpretable models as well as explaining black-box systems. Within InterpretML, the explainability algorithms are organized into two major sections, i.e., **Glassbox models** and **Blackbox explanations**. This means that this tool can not only explain the decisions of inherently interpretable models but also provide possible reasoning for black-box models. The following code architecture from the [official paper](https://arxiv.org/pdf/1909.09223.pdf) sums it nicely.

![code architecture from the official paper | Source: [InterpretML: A Unified Framework for Machine Learning Interpretability](https://arxiv.org/pdf/1909.09223.pdf)](https://cdn-images-1.medium.com/max/2030/1*MxM1QHK31w16F9U0d5t7CQ.png)

To showcase EBM's properties, we'll first train a model using only the `target_carbon_monoxide` as the target column. However, later we will use all the other target columns to create our final submission.

In [None]:
#Installation
!pip install interpret -q

In [None]:
## Importing the necessary libraries and data
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

#interpretml 
from interpret import show
from interpret.data import Marginal
from interpret.glassbox import ExplainableBoostingRegressor, LinearRegression, RegressionTree

seed = 1
train = pd.read_csv('../input/tabular-playground-series-jul-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-jul-2021/test.csv')


In [None]:
train.head()

Columns to be used for training:

In [None]:
columns = ['deg_C','relative_humidity','absolute_humidity','sensor_1','sensor_2','sensor_3','sensor_4','sensor_5']

In [None]:
target = 'target_carbon_monoxide'
X_train, X_test, y_train, y_test = train_test_split(train[columns],train[target], test_size=0.20)

We'll first split our train datasets so that we can see how Interpretml can help us in the following:
* Exploring the dataset
* Train the Explainable Boosting Machine (EBM)
* Understanding what the model learnt overall - Global Explanations
* Understanding how an individual prediction was made - Local Explanations

## Exploring the dataset

Interpret exposes a top-level method `show`, of which acts as the surface for rendering explanation visualizations. 

In [None]:
marginal = Marginal().explain_data(X_train, y_train, name = 'Train Data')
show(marginal)

Feel free to interact with the above visualisation and gain understanding of your data

## Training the Explainable Boosting Machine (EBM)

In [None]:
ebm = ExplainableBoostingRegressor(random_state=seed, n_jobs=-1)
ebm.fit(X_train, y_train) 

## Interpretability Approaches

![](https://miro.medium.com/max/642/1*8ov3dWV39WHkx8SG6pMXWA.png)

## Global Explanations - explaining the entire model behavior.

In [None]:
ebm_global = ebm.explain_global(name='EBM')
show(ebm_global)

## Local Explanations: explaining individual predictions

In [None]:
ebm_local = ebm.explain_local(X_test[:5], y_test[:5], name='EBM')
show(ebm_local)

## Evaluating EBM performance on the hold out dataset

In [None]:
from interpret.perf import RegressionPerf

ebm_perf = RegressionPerf(ebm.predict).explain_perf(X_test, y_test, name='EBM')
show(ebm_perf)

## Comparing EBM performance with other Regressors - Linear Regression, RegressionTree and Random Forest

Interpret gives us the ability to compare the performance of multiple models in a single dashboard

In [None]:
lr = LinearRegression(random_state=seed)
lr.fit(X_train, y_train)

rt = RegressionTree(random_state=seed)
rt.fit(X_train, y_train)

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train, y_train)

In [None]:
lr_perf = RegressionPerf(lr.predict).explain_perf(X_test, y_test, name='Linear Regression')
rt_perf = RegressionPerf(rt.predict).explain_perf(X_test, y_test, name='Regression Tree')
rf_perf = RegressionPerf(rf.predict).explain_perf(X_test, y_test, name='Blackbox')


In [None]:
lr_global = lr.explain_global(name='Linear Regression')
rt_global = rt.explain_global(name='Regression Tree')



## Comparing Performances of different models

In [None]:
show(lr_perf)
show(rt_perf)
show(ebm_perf)
show(rf_perf)

# TPS Competition
## Creating multiple predictions and merging them in a common submission file

Now that we have an idea about the working of EBMs, let's calculate the predictions for different target columns i.e `target_benzene`, `target_carbon_monoxide` and `target_nitrogen_oxides`.

In [None]:
def predictions(target_column):
    """
    Function to calculate EBM predictions based on the target column specified
    
    """
    training_set = train[columns]
    target = train[target_column]
    ebm = ExplainableBoostingRegressor(n_jobs=-1)
    ebm.fit(training_set,target)
    preds = ebm.predict(test[columns])
    return preds

In [None]:
preds_benzene = predictions('target_benzene')
preds_carbon_monoxide = predictions('target_carbon_monoxide')
preds_nitrogen_oxides = predictions('target_nitrogen_oxides')

In [None]:
submission = pd.DataFrame({
    'date_time': test.date_time,
    'target_carbon_monoxide': preds_carbon_monoxide,
    'target_benzene': preds_benzene,
    'target_nitrogen_oxides': preds_nitrogen_oxides
})

submission.head()

In [None]:
submission.to_csv('submission.csv',index=False)