# xgboost local interpretation

- [Online Course](https://www.trainindata.com/p/machine-learning-interpretability)

Here I show how to make local interpretations for a regression model, but the code for classification is the same. We just need to change the model class.

In [1]:
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from xgboost import XGBRegressor

### Load data

In [2]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

# display top 5 rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467


### Split data

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((14448, 6), (6192, 6))

## XGBoost

In [4]:
# fit model

gbm = XGBRegressor(
    importance_type="gain",
    n_estimators=2,
    max_depth=2,
    random_state=3,
)

gbm.fit(X_train, y_train)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_sparse(data):


## Local explanations

In [5]:
# Display a few observations from test set

X_test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
14740,4.1518,22.0,5.663073,1.075472,1551.0,4.180593
10101,5.7796,32.0,6.107226,0.927739,1296.0,3.020979
20566,4.3487,29.0,5.930712,1.026217,1554.0,2.910112
2670,2.4511,37.0,4.992958,1.316901,390.0,2.746479
15709,5.0049,25.0,4.319261,1.039578,649.0,1.712401


In [6]:
# pick one observation

sample_id = 14740

X_test.loc[sample_id]

MedInc           4.151800
HouseAge        22.000000
AveRooms         5.663073
AveBedrms        1.075472
Population    1551.000000
AveOccup         4.180593
Name: 14740, dtype: float64

In [7]:
# print structure of tree

booster = gbm.get_booster()
booster.dump_model('dump.raw.txt', with_stats = True)

with open('dump.raw.txt') as file:
    print(file.read())

booster[0]:
0:[MedInc<5.02865028] yes=1,no=2,missing=1,gain=6054.625,cover=14448
	1:[MedInc<3.06684995] yes=3,no=4,missing=3,gain=1537.88281,cover=11381
		3:leaf=0.254831105,cover=5469
		4:leaf=0.47564581,cover=5912
	2:[MedInc<7.81515026] yes=5,no=6,missing=5,gain=1089.37695,cover=3067
		5:leaf=0.761726439,cover=2532
		6:leaf=1.23420525,cover=535
booster[1]:
0:[MedInc<5.58920002] yes=1,no=2,missing=1,gain=3207.57617,cover=14448
	1:[MedInc<3.54899979] yes=3,no=4,missing=3,gain=1051.9043,cover=12369
		3:leaf=0.197873905,cover=7255
		4:leaf=0.375582844,cover=5114
	2:[MedInc<6.81954956] yes=5,no=6,missing=5,gain=316.617188,cover=2079
		5:leaf=0.568677664,cover=1159
		6:leaf=0.806142509,cover=920



## Feature contribution

xgboost does not provide the values at the internal nodes, just at the leaves, so we need to calculate them manually :/


In [8]:
# tree one

x1_1 = (.475*5912 + .254*5469) / 11381
x1_1

0.36880115982778316

In [9]:
x2_1 = (.761*2532 + 1.234*535) / 3067
x2_1

0.8435089664166938

In [10]:
x0_1 = (x1_1 * 11381 + x2_1 * 3067) / 14448
x0_1

0.4695714285714286

In [11]:
# tree two

x1_2 = (.197*7255 + .375*5114) / 12369
x2_2 = (.568*1159 + .806*920) / 2079
x0_2 = (x1_2 * 12369 + x2_2 * 2079) / 14448

x1_2, x2_2, x0_2

(0.2705946317406419, 0.6733198653198652, 0.32854491971207084)

Now, we can proceed as we did for sklearn:

In [12]:
# tree one

# feature contribution = value in child - value in parent

second_split = .475 - x1_1
first_split = x1_1 - x0_1

MedianInc_t1 = first_split + second_split

MedianInc_t1

0.005428571428571394

In [13]:
# tree two

second_split = .375 - x1_2
first_split = x1_2 - x0_2

MedianInc_t2 = first_split + second_split

MedianInc_t2

0.04645508028792916

In [14]:
# The contributions are cumulative 
# (not the average like in random forests)

MedInc = MedianInc_t1 + MedianInc_t2

MedInc

0.051883651716500556

In [15]:
# the bias

bias =  x0_1 + x0_2

bias

0.7981163482834994

In [16]:
# prediction, calculated manually

bias + MedInc

0.85

In [17]:
# but we need to add the score
# according to xgb documentation:
# base_score is the initial prediction score of all instances, global bias

bias + MedInc + 0.5

1.35

The "base_score" is a funny parameter in xgboost for regression. See for example this discussions:
    
   - [xgboot issue](https://github.com/dmlc/xgboost/issues/799)
   - [stackoverflow](https://stackoverflow.com/questions/47596486/xgboost-the-meaning-of-the-base-score-parameter)

In [18]:
# if you can't find the value, check the parameters
# of your tree

gbm.get_params()

{'objective': 'reg:squarederror',
 'base_score': 0.5,
 'booster': 'gbtree',
 'callbacks': None,
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'early_stopping_rounds': None,
 'enable_categorical': False,
 'eval_metric': None,
 'gamma': 0,
 'gpu_id': -1,
 'grow_policy': 'depthwise',
 'importance_type': 'gain',
 'interaction_constraints': '',
 'learning_rate': 0.300000012,
 'max_bin': 256,
 'max_cat_to_onehot': 4,
 'max_delta_step': 0,
 'max_depth': 2,
 'max_leaves': 0,
 'min_child_weight': 1,
 'missing': nan,
 'monotone_constraints': '()',
 'n_estimators': 2,
 'n_jobs': 0,
 'num_parallel_tree': 1,
 'predictor': 'auto',
 'random_state': 3,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'sampling_method': 'uniform',
 'scale_pos_weight': 1,
 'subsample': 1,
 'tree_method': 'exact',
 'validate_parameters': 1,
 'verbosity': None}

I probably agree with the github discussion, when setting a regression problem, we should set the base_score as the mean of the target in the train set, to have more meaningful interpretations. Otherwise, it will be hard to explain to your stackeholders why you add 0.5.

But I leave this notebook as is, so we can actually have this discussion.

In [19]:
# prediction of xgboost

gbm.predict(X_test.loc[sample_id].to_frame().T)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:


array([1.3512286], dtype=float32)

## Eli5

If you have problems to import eli5 with the latest version of sklearn, you need to go to the eli5 files that are stored in your local computer, and change the following lines:https://github.com/eli5-org/eli5/commit/840695d869e47b8e6cc05baca428d24881113fb6

More details also available here:
https://github.com/eli5-org/eli5/issues/39

In [20]:
import eli5

## Global explanations

In [21]:
eli5.show_weights(gbm, feature_names=gbm.feature_names_in_)

Weight,Feature
1.0,MedInc
0.0,AveOccup
0.0,Population
0.0,AveBedrms
0.0,AveRooms
0.0,HouseAge


In [22]:
pd.Series(gbm.feature_importances_,
          index=gbm.feature_names_in_).sort_values(ascending=False)

MedInc        1.0
HouseAge      0.0
AveRooms      0.0
AveBedrms     0.0
Population    0.0
AveOccup      0.0
dtype: float32

## Local explanations

In [23]:
eli5.show_prediction(
    gbm, 
    X_test.loc[sample_id], 
    feature_names=gbm.feature_names_in_,
)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:


Contribution?,Feature
0.8,<BIAS>
0.052,MedInc


In [24]:
X_test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
14740,4.1518,22.0,5.663073,1.075472,1551.0,4.180593
10101,5.7796,32.0,6.107226,0.927739,1296.0,3.020979
20566,4.3487,29.0,5.930712,1.026217,1554.0,2.910112
2670,2.4511,37.0,4.992958,1.316901,390.0,2.746479
15709,5.0049,25.0,4.319261,1.039578,649.0,1.712401


In [25]:
eli5.show_prediction(
    gbm, 
    X_test.loc[10101], 
    feature_names=gbm.feature_names_in_,
)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:


Contribution?,Feature
0.8,<BIAS>
0.531,MedInc


In [26]:
# if we want to obtain a dataframe instead of the 
# html output

eli5.explain_prediction_df(
    gbm, 
    X_test.loc[10101], 
    feature_names=gbm.feature_names_in_,
)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:


Unnamed: 0,target,feature,weight,value
0,y,<BIAS>,0.799539,1.0
1,y,MedInc,0.530865,5.7796
