## 5.Explainability

Authors : Haddam Yacine, Ka Alioune, Renaud Adrien

<p align="center">
  <a>
    <img src="../src/figures/logo-hi-paris-retina.png" alt="Logo" width="280" height="180">
  </a>

  <h3 align="center">Data Science Bootcamp</h3>
</p>

Machine learning (ML) models are increasingly complex. Indeed, a sophisticated model (Random Forest, Boosting or deep learning) generally leads to more precise predictions than a simple model (linear regression or decision tree). There is thus a compromise between the performance of a model and its interpretability.

Interpretability is defined as the ability for a human to understand the reasons for a model’s decision. This criterion has become preponderant for many reasons:

- **Scientific**: It is about understanding, having confidence and having proof of the consistency and consistency of the model.


- **Ethic**: It is unacceptable to entrust the fate of people or the economy to algorithms without being able to justify the decision-making process taken by these algorithms.


- **Legislative**: [Article 22](https://www.cnil.fr/fr/profilage-et-decision-entierement-automatisee) of the RGPD (General Data Protection Regulation) provides that a person must not be the subject of a decision based exclusively on automated processing and emanating solely from the decision of a machine .

In this lab , we present two methods of interpreting machine learning models: the **LIME** and **SHAP** algorithms.

## Interpretability methods

The different interpretability approaches can be defined according to the following typologies:

- **Agnostic** versus **specific** interpretation methods: Agnostic methods can be used for any type of model. On the contrary, specific models can only be used to interpret a specific family of algorithms.


- **Local** versus **global** methods: Local methods give an interpretation for a single or a small number of observations. On the contrary, global interpretation methods allow all observations to be explained at the same time, globally.


In [None]:
import sys

### Install shap package

We need to reinstall the package before doing anything else.

And reload the notebook!

In [None]:
!conda install --yes -c conda-forge --prefix {sys.prefix} shap
!{sys.executable} -m pip install shap

### Data Path

`data_dir` is the path to data folder.

In [None]:
data_dir = "/home/jovyan/personal_workspace/bootcamp/data"

In [None]:
import os
import sys

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn import tree
import lime
import lime.lime_tabular
import shap


pd.set_option('display.max_columns', 500)

## Load data

In [None]:
train = pd.read_feather(os.path.join(data_dir, 'model/train.feather'))
val = pd.read_feather(os.path.join(data_dir, 'model/val.feather'))
test = pd.read_feather(os.path.join(data_dir, 'model/test.feather'))

In [None]:
features = [
    # 'building_id',
    'lat',
    'lng',
    'square_feet',
    'air_temperature',
    'dew_temperature',
    'precip_depth_1_hr',
    'wind_speed',
    'sea_level_pressure',
    'wind_direction',
    'hour',
    'weekday',
    'month',
    'meter_name_chilledwater',
    'meter_name_electricity',
    'meter_name_hotwater',
    'meter_name_steam',
    'primary_use_Education',
    'primary_use_Entertainment/public assembly',
    'primary_use_Healthcare',
    'primary_use_Industry',
    'primary_use_Lodging/residential',
    'primary_use_Office',
    'primary_use_Other',
    'primary_use_Parking',
    'primary_use_Public services',
    'primary_use_Services',
    'zone_geo_EUROPE',
    'zone_geo_US',
    'site_id_0',
    'site_id_1',
    'site_id_2',
    'site_id_3',
    'site_id_4',
    'site_id_5',
    'site_id_6',
    'site_id_7',
    'site_id_9',
    'site_id_11',
    'site_id_12',
    'site_id_13',
    'site_id_15',
]

target = "meter_reading"

In [None]:
# from model optimization
features = [
    'square_feet',
    "site_id_13",
    'air_temperature', 'precip_depth_1_hr', 'wind_speed',
    'hour', 'weekday', 'month',
    'meter_name_chilledwater',
    'meter_name_electricity',
    'meter_name_hotwater',
    'meter_name_steam',
    'zone_geo_US',
    'primary_use_Education'
]

## Fitting a model

In [None]:
regressor = RandomForestRegressor(
    n_estimators=10,
    max_depth=16,
    random_state=0,
    n_jobs=-1
)

regressor.fit(train[features], train[target])

## LIME

The LIME algorithm (Local Interpretable Model-agnostic Explanations) is a local model that seeks to explain the prediction of an individual by analyzing his neighborhood.

LIME has the particularity of being a model:

- *Interpretable*. It provides a qualitative understanding between the input variables and the response. The input-output relationships are easy to understand.

- *Locally simple*. The model is globally complex, it is then necessary to look for locally simpler answers.

- *Agnostic*. He is able to explain any machine learning model.

<font color = "red"> The main drawback of the LIME method is linked to its local operation. And, LIME does not allow us to generalize the interpretability from the local model at a more global level. </font>

### LimeTabularExplainer

Now, we go for the LIME. First we create our explainer with **LimeTabularExplainer**. This function need a train data set used to compute similarity of observations

In [None]:
explainer = lime.lime_tabular.LimeTabularExplainer(
    np.array(train[features]),
    feature_names=features,
    verbose=True,
    mode='regression'
)

### Explaining an instance

The LimeTabularExplainer has a method named **explain_instance()** which takes as input a local sample and method which predicts output. It generates explanation object and this Explanation object has information about feature contribution to this particular prediction.

How can we explain this consumption according to our algorithm using LIME ?

We randomly select a instance from site n° 3.

In [None]:
instance_indice = 387608

input_test = val[features].iloc[instance_indice]

explanation = explainer.explain_instance(
    input_test,
    regressor.predict,
    num_features=len(features),
    num_samples=10
)

We apply the lime methodology on this instance

And we plot the result:

In [None]:
results = pd.DataFrame(explanation.as_list(), columns=["names", "coef"])

with plt.style.context("ggplot"):
    fig = plt.figure(figsize=(16, 8))
    plt.barh(range(len(results.coef)), results.coef, color=["green" if coef < 0 else "red" for coef in results.coef])
    plt.yticks(range(len(results.coef)), results.names);
    plt.title("Local Explanation with LIME")

**red** : features that lead to overconsumption

### Warning !!!  

The interpretation of LIME is strictly local and cannot be generalized on all of your data.

## SHAP

The goal of SHAP (SHapley Additive exPlanations) is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the “payout” (= the prediction) among the features. A player can be an individual feature value, e.g. for tabular data. A player can also be a group of feature values.

Shapley values can be combined into global explanations. If we run SHAP for every instance, we get a matrix of Shapley values. This matrix has one row per data instance and one column per feature. We can interpret the entire model (global) by analyzing the Shapley values in this matrix

**Adventages**

- fast implementation for tree-based models
- global model interpretations

### Explainer

Is this case, we call **TreeExplainer** - which is used for models that are based on a tree-like decision tree, random forest, gradient boosting. There are many other king of explainer for different machine learning model ([here](https://coderzcolumn.com/tutorials/machine-learning/shap-explain-machine-learning-model-predictions-using-game-theoretic-approach))

### SHAP Summary Plot
The summary plot combines feature importance with feature effects : 

- Variables are ranked according to feature importances in descending order . 
- The color represents the value of the feature from low to high. 
- Each point on the summary plot is a Shapley value for a feature and an instance. 

Exemple : 

<img src="../src/figures/shap.png" width=1000 height=700 />

In [None]:
import shap


shap.initjs()

# Let us have a look on SHAP summary plot
samples = val[features].sample(1000)

explainer = shap.TreeExplainer(regressor)
shap_values = explainer.shap_values(samples, approximate=False, check_additivity=False)

In [None]:
shap.summary_plot(shap_values, samples, alpha=0.5, plot_size=(20, 8))

The main conclusions that we can have following the global estimates of the explanatory factors of energy consumption are as follows:
    
    - the area of the buildings is the most influential factor on energy consumption. in fact, the larger the area is, the more energy overconsumption is observed
    
    - buildings for educational use are also the buldings that consume the most energy
    
    - Very high temperatures and poor air circulation are also a source of overconsumption of energy.
    
    - the Chilledwater and Steam energy source is the most energy intensive source
    
The regional effect can be explore by separating data into european an us data to see if effects are same between these two regions.

## To Do

1. Fit a regression only on European building
2. Explain the main factor of overconsumption with SHAP
3. Do same analysis with US building
4. Do you remark a difference between the two region according to the main factor of overconsumption ?


$\textbf{Go further ! }$    


- [More about Treee feature importance](https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3)
- [More about permutation importance](https://scikit-learn.org/stable/modules/permutation_importance.html)
- [The technical details about LIME ](https://christophm.github.io/interpretable-ml-book/lime.html)
- [A python example of how LIME is used (the LIME github is not the most helpful) ](https://coderzcolumn.com/tutorials/machine-learning/how-to-use-lime-to-understand-sklearn-models-predictions)
- [SHAP's GitHub (you'll also find the research paper there)](https://github.com/slundberg/shap)