<h1>Machine Learning Explainability - kaggle tutorial</h1>
This notebook is inspired and heavily influenced by the excellent notebook of BEXGBoost, here is the notebook link
<a href=https://www.kaggle.com/bextuychiev/model-explainability-with-shap-only-guide-u-need>Model Explainability with SHAP only guide u need</a>

<h2>Upvote, If you find this notebook helpful!</h2>

In [None]:
import logging
import time
import warnings

import catboost as cb
import datatable as dt
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import shap
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore")
pd.set_option("float_format", "{:.5f}".format)

<h3>Loading Data</h3>

In [None]:
train = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv")

In [None]:
train.columns

In [None]:
train.head(10)

Shape of Train & Test

In [None]:
print(train.shape)
print(test.shape)

Checking for null values.

In [None]:
print(train.isnull().values.any())
print(test.isnull().values.any())

In [None]:
train.groupby('target').mean()

<h1>Permutation Importance</h1>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

X = train.iloc[0:10000,:-1]
features = [i for i in X.columns if i not in ['row_id', 'target']]
lb = LabelEncoder()
y = lb.fit_transform(train['target'])
y = y[0:10000]
X_train, X_valid, y_train, y_valid = train_test_split(X[features], y, random_state=1)
#my_model = RandomForestClassifier(n_estimators=100,
 #                                 random_state=0).fit(train_X, train_y)
del train,test

In [None]:
#import eli5
#from eli5.sklearn import PermutationImportance

#perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
#eli5.show_weights(perm, feature_names = val_X.columns.tolist())

In the previous version of this notebook we could see A2T1G3C4 & A1T1G4C4 have the highest importance.

<h1> Shap Values</h1>

In [None]:
model = xgb.XGBRegressor(n_estimators=1000, tree_method="gpu_hist").fit(
    X_train, y_train
)

In [None]:
# Create a tree explainer
xgb_explainer = shap.TreeExplainer(
    model, X_train, feature_names=X_train.columns.tolist()
)

In [None]:
xgb_explainer

In [None]:
%%time

# Shap values with tree explainer
shap_values = xgb_explainer.shap_values(X_train, y_train)

In [None]:
%%time

# Shap values with XGBoost core model
booster_xgb = model.get_booster()
shap_values_xgb = booster_xgb.predict(xgb.DMatrix(X_train, y_train), pred_contribs=True)

In [None]:
shap_values_xgb = shap_values_xgb[:, :-1]

pd.DataFrame(shap_values_xgb, columns=X_train.columns.tolist()).head()

<h1>Feature Importances with SHAP</h1>

In [None]:
shap.summary_plot(
    shap_values_xgb, X_train, feature_names=X_train.columns, plot_type="bar"
)

A4T2G0C4 & A3T3G2C2 stands out as the driving factor for the dataset.

In [None]:
shap.summary_plot(shap_values_xgb, X_train, feature_names=X_train.columns);

 Interpretation of the above plot:

1)The left vertical axis denotes feature names, ordered based on importance from top to bottom.

2)The horizontal axis represents the magnitude of the SHAP values for predictions.

3)The vertical right axis represents the actual magnitude of a feature as it appears in the dataset and colors the points.

We see that as A3T3G2C2 increases, its effect on the model is more positive. The same is true for A3T3G3C1 feature. The A4T2G0C4 feature is a bit tricky with a cluster of mixed points around the center, as well as a mixed sign as it increases.

<h1>Feature Interactions with Shapley values - Part 2</h1>

One of the most fantastic attributes of SHAP and Shapley values is their ability to find relationships between features accurately.

In [None]:
%%time

# SHAP interactions with XGB
interactions_xgb = booster_xgb.predict(
    xgb.DMatrix(X_train, y_train), pred_interactions=True
)

By setting pred_interactions to True, we get SHAP interaction values in only 10 seconds. It is a 3D array, with the last column axes being the bias terms:

In [None]:
interactions_xgb.shape

In [None]:
def get_top_k_interactions(feature_names, shap_interactions, k):
    # Get the mean absolute contribution for each feature interaction
    aggregate_interactions = np.mean(np.abs(shap_interactions[:, :-1, :-1]), axis=0)
    interactions = []
    for i in range(aggregate_interactions.shape[0]):
        for j in range(aggregate_interactions.shape[1]):
            if j < i:
                interactions.append(
                    (
                        feature_names[i] + "-" + feature_names[j],
                        aggregate_interactions[i][j] * 2,
                    )
                )
    # sort by magnitude
    interactions.sort(key=lambda x: x[1], reverse=True)
    interaction_features, interaction_values = map(tuple, zip(*interactions))

    return interaction_features[:k], interaction_values[:k]


top_10_inter_feats, top_10_inter_vals = get_top_k_interactions(
    X_train.columns, interactions_xgb, 10
)

<b>Now, top_10_inter_feats contains 10 of the strongest interactions between all possible pairs of features:<b>

In [None]:
top_10_inter_feats

<h2>Hope you find this helpful!</h2>

<h1>Work in progress...</h1>