# xgboost local interpretation

- [Online Course](https://www.trainindata.com/p/machine-learning-interpretability)

Here I show how to make local interpretations for a regression model, but the code for classification is the same. We just need to change the model class.

In [1]:
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor

### Load data

In [2]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

# display top 5 rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467


### Split data

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((14448, 6), (6192, 6))

## XGBoost

In [4]:
# fit model

gbm = XGBRegressor(
    importance_type="gain",
    n_estimators=2,
    max_depth=2,
    random_state=3,
)

gbm.fit(X_train, y_train)

## Local explanations

In [5]:
# Display a few observations from test set

X_test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
14740,4.1518,22.0,5.663073,1.075472,1551.0,4.180593
10101,5.7796,32.0,6.107226,0.927739,1296.0,3.020979
20566,4.3487,29.0,5.930712,1.026217,1554.0,2.910112
2670,2.4511,37.0,4.992958,1.316901,390.0,2.746479
15709,5.0049,25.0,4.319261,1.039578,649.0,1.712401


In [6]:
# pick one observation

sample_id = 14740

X_test.loc[sample_id]

MedInc           4.151800
HouseAge        22.000000
AveRooms         5.663073
AveBedrms        1.075472
Population    1551.000000
AveOccup         4.180593
Name: 14740, dtype: float64

In [7]:
# print structure of tree

booster = gbm.get_booster()
booster.dump_model('dump.raw.txt', with_stats = True)

with open('dump.raw.txt') as file:
    print(file.read())

booster[0]:
0:[MedInc<5] yes=1,no=2,missing=2,gain=6050.60205,cover=14448
	1:[MedInc<3.07990003] yes=3,no=4,missing=4,gain=1505.21838,cover=11310
		3:leaf=-0.214339092,cover=5520
		4:leaf=0.00459926156,cover=5790
	2:[MedInc<7.80049992] yes=5,no=6,missing=6,gain=1120.00781,cover=3138
		5:leaf=0.286677152,cover=2599
		6:leaf=0.761799335,cover=539
booster[1]:
0:[MedInc<5.60379982] yes=1,no=2,missing=2,gain=3219.354,cover=14448
	1:[MedInc<3.55110002] yes=3,no=4,missing=4,gain=1047.51404,cover=12391
		3:leaf=-0.130962819,cover=7264
		4:leaf=0.0461317673,cover=5127
	2:[MedInc<6.80749989] yes=5,no=6,missing=6,gain=305.747559,cover=2057
		5:leaf=0.241858423,cover=1125
		6:leaf=0.474598616,cover=932



## Feature contribution

xgboost does not provide the values at the internal nodes, just at the leaves, so we need to calculate them manually :/


In [8]:
# tree one

x1_1 = (.214*5520 + .0046*5790) / 11310
x1_1

0.10680053050397878

In [9]:
x2_1 = (.2866*2599 + .7617*539) / 3138
x2_1

0.36820576800509885

In [10]:
x0_1 = (x1_1 * 11310 + x2_1 * 3138) / 14448
x0_1

0.16357583748615728

In [11]:
# tree two

x1_2 = (.131*7264 + .0461*5127) / 12391
x2_2 = (.2418*1125 + .4745*932) / 2057
x0_2 = (x1_2 * 12391 + x2_2 * 2057) / 14448

x1_2, x2_2, x0_2

(0.09587109192155598, 0.3472333495381624, 0.13165820182724253)

Now, we can proceed as we did for sklearn:

In [12]:
# tree one

# feature contribution = value in child - value in parent

second_split = 0.214 - x1_1
first_split = x1_1 - x0_1

MedianInc_t1 = first_split + second_split

MedianInc_t1

0.05042416251384271

In [13]:
# tree two

second_split = 0.1309 - x1_2
first_split = x1_2 - x0_2

MedianInc_t2 = first_split + second_split

MedianInc_t2

-0.0007582018272425384

In [14]:
# The contributions are cumulative 
# (not the average like in random forests)

MedInc = MedianInc_t1 + MedianInc_t2

MedInc

0.049665960686600175

In [15]:
# the bias

bias =  y_train.mean()

bias

2.0682462451550387

In [16]:
# prediction, calculated manually

bias + MedInc

2.117912205841639

In [17]:
# prediction of xgboost

gbm.predict(X_test.loc[sample_id].to_frame().T)

array([2.1189773], dtype=float32)

## Eli5

If you have problems to import eli5 with the latest version of sklearn, you need to go to the eli5 files that are stored in your local computer, and change a few lines of code as shown in [this PR](https://github.com/eli5-org/eli5/commit/840695d869e47b8e6cc05baca428d24881113fb6)

More details also available [here](https://github.com/eli5-org/eli5/issues/39)

## Local explanations

In [18]:
import eli5

In [19]:
# feature contribution for 1 sample

eli5.show_prediction(
    gbm, 
    X_test.loc[sample_id], 
    feature_names=gbm.feature_names_in_,
)

Contribution?,Feature
0.051,MedInc
-0.0,<BIAS>
