# **DSFM Workshop**: Model Interpretation

---

## **Section 2**: Model-agnostic methods

Creator: [Data Science for Managers - EPFL Program](https://www.dsfm.ch)  
Source:  [https://github.com/dsfm-org/code-bank.git](https://github.com/dsfm-org/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

---

## **Overview**

Model-agnostic methods separate the explanations from the prediction model. There are many different model-agnostic methods, including partial dependence plots (PDPs), accumulated local effects (ALEs), permutation feature importance, local surrogate (LIME), and Shapley values with different advantages and disadvantages. 

## **Learning goals**

- Learn how to interpret different model-agnostic methods, such as partial dependence plots, permutation feature importance, and Shapley values
- Get an intuition about the advantages and disadvantages
- Experiment with `SHAP`, a powerful interpretability package for Python

## **Useful resources**

- Chapter 5, Molnar (2019)
- `SHAP` GitHub repository containing many [example notebooks](https://github.com/slundberg/shap)
- `LIME` GitHub repository, another common [model-agnostic interpretation method](https://github.com/marcotcr/lime)
- NIPS paper introducing Shapley value for model interpretation: http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions


---

<center><img src="https://images.unsplash.com/photo-1584748452591-640305621fc5?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=908&q=80
" width=300></center>

A US Census Bureau letter. [Image source](https://images.unsplash.com/photo-1584748452591-640305621fc5?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=908&q=80)



## **Part 1:** Load data

We will use the **Adult income dataset** from the UCI machine learning repository, which has 12 variables and a binary target feature. We predict the probability of an individual making over $50k a year in annual income. Data source: https://archive.ics.uci.edu/ml/datasets/adult

In [None]:
from sklearn.model_selection import train_test_split
import shap

X, y = shap.datasets.adult()

# create a train/test split
SEED = 1
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=SEED)

X_train.describe()

In [None]:
print('Original shape: {}'.format(X.shape))
print('Train shapes: {} {} '.format(X_train.shape, X_test.shape) + 'Test shapes: {} {}'.format(y_train.shape, y_test.shape))

In [None]:
# What's the target feature base rate?
round(sum(y_train)/len(y_train), 4)

## **Part 2:** Partial dependence plots (PDPs)

A simple method to interpret the impact of each predictor on the target variable are partial dependence plot. Assuming that input variables are uncorrelated, these plot show how the average prediction changes when the i'th feature value changes. 

In [None]:
from sklearn.inspection import plot_partial_dependence, partial_dependence
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [16, 14]
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=SEED).fit(X_train, y_train)
# clf = LogisticRegression(random_state=SEED, solver='liblinear').fit(X_train, y_train)
print('AUC: {}'.format(roc_auc_score([int(i) for i in y_test], clf.predict_proba(X_test)[:, 1])))

# Note: the y-axes are clipped at the 5th and 95th percentiles
fig = plot_partial_dependence(clf, X_train, features = X_train.columns.values, grid_resolution = 300) 
fig.figure_.tight_layout()

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import numpy as np

fig = plt.figure()

features = ('Education-Num', 'Age')
pdp, axes = partial_dependence(clf, X_train, features=features, grid_resolution=20, percentiles=(0,1))
XX, YY = np.meshgrid(axes[0], axes[1])
Z = pdp[0].T
ax = Axes3D(fig)
surf = ax.plot_surface(XX, YY, Z, rstride=1, cstride=1, cmap=plt.cm.BuPu, edgecolor='k')
ax.set_xlabel(features[0])
ax.set_ylabel(features[1])
ax.set_zlabel('Partial dependence')
#  Pretty init view
ax.view_init(elev=22, azim=122)
plt.colorbar(surf)
plt.suptitle('Partial dependence of house value on median\n'
             'age and average occupancy, with Gradient Boosting')
plt.subplots_adjust(top=0.9)

plt.show()

**Review questions**:

- What would the PDPs look like when using fitting a linear regression? What about a logistic regression?
- What do the tick marks on the X-axes represent? Answer: the deciles, i.e. 10% percentile marks

**In summary**, partial dependence plots show the marginal impact of a feature on the predicted outcome. 

**Advantages**: 

- Intuitive interpretation: each point on the partial depence function simulates the average prediction if we were to force all data points to assume that value
- Computationally efficient 

**Disadvantages**:

- Features are assumed to be independent (hardly ever true): correlated features can lead to unrealistic feature combinations that make averaging predictions unreliable
- Heterogeneous effects might be hidden because plots only show *average* marginal effects
- Partial dependence plots might not show the feature distribution


## **Part 3:** Individual conditional expectation (ICE)

Similar to PDPs, an individual conditional expectation (ICE) plot visualizes the dependence between the feature of interest and the outcome. However, unlike PDPs, which show the average effect of the features of interest, ICE plots visualize the dependence of the prediction on a feature for *each sample separately*, with one line per sample. Only one feature of interest is supported for ICE plots.

We will use the `pdpbox` package to draw ICE plots - for detail see [here](https://github.com/SauceCat/PDPbox).

In [None]:
from pdpbox import pdp, info_plots
pdp_age = pdp.pdp_isolate(model=clf, dataset=X_train, num_grid_points=10, model_features=X_train.columns, n_jobs=-1, feature='Age')

#ICE Plot
fig, axes = pdp.pdp_plot(pdp_age, 'Age', plot_lines=True, center=True, frac_to_plot=0.5, plot_pts_dist=True, figsize=[16, 12],
                         x_quantile=True, show_percentile=True)


## **Part 4:** Permutation feature importance

The idea of permutation feature importance is to **shuffle the values of one feature at a time** as to break its relation with the target variable. We then measure the increase in prediction error. A feature is "more important" if shuffling its values increases the model error, because the model relies more on the feature for the prediction. A feature is "less important" if shuffling its values leaves the model error unchanged, because the model does not find the feature useful for the prediction.

In [None]:
from sklearn.inspection import permutation_importance

pi = permutation_importance(clf, X_test, y_test, n_repeats=30, random_state=SEED)
pi_sorted = pi.importances_mean.argsort()

# Print feature importance 
for i in pi_sorted[::-1]:
    if pi.importances_mean[i] - 2 * pi.importances_std[i] > 0:
        print('{}'.format(X_train.columns[i]).ljust(20) + 'mean = {}'.format(round(pi.importances_mean[i], 2)).ljust(15) + 'std = {}'.format(round(pi.importances_std[i], 4)))
        

In [None]:
# Plot importance 
plt.rcParams['figure.figsize'] = [16, 8]
fig, ax = plt.subplots()
ax.boxplot(pi.importances[pi_sorted].T, vert = False, labels = X_train.columns[pi_sorted])
ax.set_title("Permutation Importances (TEST set)")
fig.tight_layout()
plt.show()

**In summary**, the permutation importance approach tells us how much the prediction error would increase without a particular feature. 

**Advantages**: 

- Intuitive interpretation: feature importance is the increase in model error when the feature's information is destroyed
- Very compressed, global insight into the model
- Permutation also destroys interaction with all other features
- No re-training required

**Disadvantages**:

- Unclear whether you should use training or testing data: experiment with both
- Shuffling features at random requires many repeats to get reliable estimates
- Beware of the feature importance of highly correlated features; these can be unstable

## **Part 5:** Shapley values

Next, we can exploit the game theory concept of Shapley values to unpack the final prediction into its feature components.

### What are Shapley values?

The setup is as follows: a coalition of players cooperates, and obtains a certain overall gain from that cooperation. Since some players may contribute more to the coalition than others or may possess different bargaining power (for example threatening to destroy the whole coalition), what final distribution of generated surplus among the players should arise in any particular game? Or phrased differently: how important is each player to the overall cooperation, and what payoff can he or she reasonably expect? The Shapley value provides one possible answer to this question. (Wikipedia)

Let's consider a concrete example from Molnar (2019). Suppose you want to predict apartment prices. You get the following predictions for two apartments, one that allows cats and one that does not: 

<img src="https://christophm.github.io/interpretable-ml-book/images/shapley-instance-intervention.png" width=600>

Image source: https://christophm.github.io/interpretable-ml-book/images/shapley-instance-intervention.png

Loosely speaking, what's the contribution of the `cat-banned` feature? It's EUR 320,000 - EUR 310,000 = EUR 10,000. When repeat this computation for all possible coalition, i.e. combination of features. For a detailed description of this example see: https://christophm.github.io/interpretable-ml-book/shapley.html

The more formal interpretation of the Shapley value by Molnar goes as follows: "Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean prediction is the estimated Shapley value."

**Gradient boosting machine** methods are state-of-the-art for these types of prediction problems with tabular style input data of many modalities. We will use an implementation called "Tree SHAP" that allows for the exact computation of SHAP values for tree ensemble methods, and has been integrated directly into the C++ LightGBM code base. This allows fast exact computation of SHAP values without sampling and without providing a background dataset (since the background is inferred from the coverage of the trees).

In [None]:
import lightgbm as lgb

# Print the JS visualization code to the notebook
shap.initjs()

d_train = lgb.Dataset(X_train, label=y_train)
d_test = lgb.Dataset(X_test, label=y_test)

params = {
    "max_bin": 512,
    "learning_rate": 0.05,
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "binary_logloss",
    "num_leaves": 10,
    "verbose": -1,
    "min_data": 100,
    "boost_from_average": True
}

model = lgb.train(params, d_train, 10000, valid_sets=[d_test], early_stopping_rounds=50, verbose_eval=1000)

### SHAP summary plot of feature importances

Rather than use a typical feature importance bar chart, we use a density scatter plot of SHAP values for each feature to identify how much impact each feature has on the model output for individuals in the validation dataset. Features are sorted by the **sum of the SHAP value magnitudes across all samples**. It is interesting to note that the relationship feature has more total model impact than the captial gain feature, but for those samples where capital gain matters it has more impact than age. In other words, capital gain effects a few predictions by a large amount, while age effects all predictions by a smaller amount.


In [None]:
# Here we use the Tree SHAP implementation integrated into Light GBM to explain the entire dataset (32561 samples).
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)

shap.summary_plot(shap_values, X)

### Visualize one prediction

In [None]:
# Here we visualize a single prediction
idx = 0  # Change the sample index to experiment
X_display,y_display = shap.datasets.adult(display=True)
shap.force_plot(base_value = explainer.expected_value[1], 
                shap_values = shap_values[1][idx,:], 
                features = X_display.iloc[idx,:])


### Visualize many predictions

In [None]:
shap.force_plot(explainer.expected_value[1], shap_values[1][:1000,:], X_display.iloc[:1000,:])


### Dependence plots

SHAP dependence plots show the effect of a single feature across the whole dataset. They plot a feature's value vs. the SHAP value of that feature across many samples. SHAP dependence plots are similar to partial dependence plots, but **account for the interaction effects present in the features**, and are only defined in regions of the input space supported by data. The vertical dispersion of SHAP values at a single feature value is driven by interaction effects, and another feature is chosen for coloring to highlight possible interactions.


In [None]:
# Plot depence plots for the first X features (vertical dispersion measures the interaction effects)
# Note: the interaction features is selected automatically as the interaction with the largest effect

n_features = 10
i = 0
for name in X_train.columns:
    shap.dependence_plot(name, shap_values[1], X, display_features = X_display)
    i += 1
    if i == n_features: break

**In summary**, SHAP is an easy-to-use package that uses the concept of Shapley values to unpack the conrtibutions of each feature to the prediction. 

**Advantages**: 

- Shapley values guarantee that predictions are fairly distributed among the features, i.e. the *Efficiency* property of Shapley values
- Relatively intuitive interpretation: predictions as a coalition of feature values
- Grounded in extensive theory with four core axioms - efficiency, symmetry, dummy, additivity 

**Disadvantages**: 

- Computing the Shapley values is computationally expensive, i.e. exponential in the number of features :-(
- Can use unrealistic data instances when features are correlated (e.g. height and weight)
- Shapley provides only a value for each feature, not a prediction model like "if this feature was equal to X, then the prediction would be Y"