In [None]:
! pip install flaml[automl]

In [None]:
! pip install shap

In [None]:
import pandas as pd
import numpy as np

from flaml import AutoML
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import xgboost
import shap

import matplotlib.pyplot as plt
import seaborn as sns

This time we load some extra features

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/signalfel/xaiclinic/main/ved_ice_trips_extra_features.csv')

In [None]:
X = data.iloc[:, 1:]
y = data.total_fuel

In [None]:
X.shape

In [None]:
X.head()

In [None]:
# Perform the train-test split. Everyone uses the same random_state so that we
# can easily compare results
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1337)

## AutoML (from flaml)

Let's use AutoML to select a model for our dataset.

For those of you unaware, AutoML will train a bunch of different models and do some hyperparameter tuning automatically. All you have to do is select the type of problem (regression here), loss metric (R2 here) and the time in seconds you want the total training of all models to take. AutoML will do it's best with that time.   

In [None]:
# Initialize an AutoML instance
automl = AutoML()
# Specify automl goal and constraint
automl_settings = {
    "time_budget": 60,  # in seconds
    "metric": "r2",
    "task": "regression",
    "log_file_name": "automl.log",
}

# Train with labeled input data
automl.fit(X_train=X_train, y_train=y_train, **automl_settings)

In [None]:
# Print the best model
print(automl.model.estimator)

# Performance of best model
print(r2_score(y_test, automl.model.predict(X_test)).round(3))

**NOTE**: We don't care what ended up being the best model chosen by the AutoML. This is one of the good things about SHAP! We don't have to! It's model agnostic! 🎉

# **Part 1 - Global feature importance**

In [None]:
# The PermutationExplainer  approximates the Shapley values by iterating through
# permutations of the inputs.
# The arguments are:
# 1. the model in question
# 2. a masker, i.e. a feature matrix from which to get other values to permute
#   among when studying a given feature. Here we choose to use the whole dataset
#   to allow for maximum variability. Some argue that only using the test set
#   makes more sense.
# 3. the names of the features
permutation_explainer = shap.PermutationExplainer(automl.model.predict, X,
                                      feature_names=X.columns)

In [None]:
# We select only a small subset of the test set to calculate the SHAP values
# since it takes quite a long time to calculate the permutations

# Note: we take .head(500) instead of .sample(500) so that it's comparable
# between one another (and X_test is already scrambled)
shaps = permutation_explainer(X_test.head(500))

In [None]:
shap.summary_plot(shaps, feature_names=X.columns)

In [None]:
shap.plots.bar(shaps)

### TODO❗ What does this plot tell us? Which features are the most important ones? Which could maybe be left out?



---



# **Part 2 - Feature pruning and re-training**

### SHAP values can be harder to interpret if there are many correlated features.
This is because the feature importance tends to be spread out among those correlated features.

In [None]:
corr = X.corr()
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)

We can do a hierarchical clustering of the features relative to the target variable, y.
Simply put, it is a way to estimate how much predictive power is added by including the feature.


In [None]:
clustering = shap.utils.hclust(X_train, y_train);

In [None]:
from scipy.cluster import hierarchy

fig = plt.figure(figsize=(10,6))
hierarchy.dendrogram(clustering, labels=X.columns, orientation='right');
plt.xlabel('Cluster distance (loosely 1 - corr)');

NOTE: The hierarchical clustering says quite little about the *actual* predictive power of the feature, as illustrated by the fact that hour_of_day is quite high up in the list in the dendrogram above (c.f. the global feature importance of that feature).

Rather, this method relates the features to one another.
Something more like: "we might not need **n_points**, since whatever predictive power it holds is also captured by **n_harsh_brakes**".

We can also plot a SHAP bar plot using the information from the hierarchical clustering like so:

In [None]:
shap.plots.bar(shaps, clustering=clustering, clustering_cutoff=0.6,
               max_display=8)

### TODO❗ Remove at least 4 features and train a new model. Choose based on both the global feature importances and the information you can gain from the correlation study above.

In [None]:
features_to_drop = []

X_train_slimmed = X_train.drop(columns=features_to_drop)
X_test_slimmed = X_test.drop(columns=features_to_drop)
X_slimmed = X.drop(columns=features_to_drop)

In [None]:
automl_slimmed = AutoML()

automl_slimmed.fit(X_train=X_train_slimmed,
                   y_train=y_train,
                   **automl_settings)

In [None]:
# Print the best model
print(automl_slimmed.model.estimator)

# Performance of best model
print(r2_score(y_test, automl_slimmed.model.predict(X_test_slimmed)).round(3))

### TODO❗ How is the R2-score impacted? Are you able to tell if any of the previously unimportant features are more important now due to pruning correlated features?

In [None]:
permutation_explainer_slimmed = shap.PermutationExplainer(
              automl_slimmed.model.predict, X_slimmed,
              feature_names=X_slimmed.columns)

In [None]:
shaps_slimmed = permutation_explainer_slimmed(X_test_slimmed.head(500))

In [None]:
shap.summary_plot(shaps_slimmed, feature_names=X_slimmed.columns)

# **Part 3 - Understanding individual predictions**

### TODO❗ Find two data instances that have quite different SHAP values and look at the waterfall plots

In [None]:
index_1 = ...
index_2 = ...

In [None]:
shap.waterfall_plot(shaps[index_1])

In [None]:
shap.waterfall_plot(shaps[index_2])

### TODO❗ What is the anatomy of this waterfall plot? In what way is it explaining this particular prediction?

Answer:

### TODO❗ Can you think of any use for the SHAP values other than understanding the model?

Answer:

# **Part 4 - Deep-dive on a feature**

### TODO❗ Use shap.plots.scatter to look at all the SHAP values of a feature.

If you want to color the dots by the value of another feature to look at how those features interact,
use color=shaps[:, feature]

In [None]:
shap.plots.scatter(shaps[:, '<...>'])

# or

# shap.plots.scatter(shaps[:, '<...>'], color = shaps[:, '<...>'])

# **Quick Bonus demo: TreeShap**

Let's just train an XGBoost so that we can try out the TreeExplainer. Mostly just for you to see the big difference in the time it takes to calculate the SHAP values!  

In [None]:
xgb = xgboost.XGBRegressor(n_estimators=500,
                           max_depth=4,
                           min_child_weight=2,
                           random_state=411)
xgb.fit(X_train, y_train);

print(r2_score(xgb.predict(X_test), y_test).round(3))

In [None]:
tree_explainer = shap.TreeExplainer(xgb, X,
                               feature_names=X.columns)

In [None]:
treeshaps = tree_explainer(X_test.head(500))

In [None]:
shap.summary_plot(treeshaps, feature_names=X.columns)