# LightGBMs

- [Online Course](https://www.trainindata.com/p/machine-learning-interpretability)

Here I show a model for regression, but it's the same code, apart from the model, if you want to do classification.

In case you have problems importing lightGBMs check out [this solution](https://stackoverflow.com/questions/76610527/cannot-import-lightgbm-error-with-pandas).

In [1]:
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

from lightgbm import LGBMRegressor

### Load data

In [2]:
# load the California House price data from Scikit-learn
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X = X.drop(columns=["Latitude", "Longitude"])

# display top 5 rows
X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467


### Split data

In [3]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((14448, 6), (6192, 6))

## LightGBM

In [4]:
# fit model

gbm = LGBMRegressor(
    importance_type="gain",
    n_estimators=5,
    max_depth=3,
    random_state=3,
)

gbm.fit(X_train, y_train)

## Eli5

If you have problems to import eli5 with the latest version of sklearn, you need to go to the eli5 files that are stored in your local computer, and change a few lines of code as shown in [this PR](https://github.com/eli5-org/eli5/commit/840695d869e47b8e6cc05baca428d24881113fb6)

More details also available [here](https://github.com/eli5-org/eli5/issues/39).

In [5]:
import eli5

## Global explanations

In [6]:
# Feature importance (global)

eli5.show_weights(gbm, feature_names=X_train.columns.to_list())

Weight,Feature
0.8378,MedInc
0.1308,AveOccup
0.0226,AveRooms
0.0088,HouseAge
0.0,Population
0.0,AveBedrms


In [7]:
# the importance from light GBM 
# is not normalized

pd.Series(gbm.feature_importances_,
          index=X_train.columns.to_list()).sort_values(ascending=False)

MedInc        31005.383511
AveOccup       4839.465012
AveRooms        837.246979
HouseAge        326.010893
AveBedrms         0.000000
Population        0.000000
dtype: float64

In [8]:
# sum of all importances

total = pd.Series(gbm.feature_importances_).sum()
total

37008.10639381409

In [9]:
# now the results match

pd.Series(gbm.feature_importances_ / total,
          index=X_train.columns.to_list()).sort_values(ascending=False)

MedInc        0.837800
AveOccup      0.130768
AveRooms      0.022623
HouseAge      0.008809
AveBedrms     0.000000
Population    0.000000
dtype: float64

## Local explanations

In [10]:
# Display a few observations from test set

X_test.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup
14740,4.1518,22.0,5.663073,1.075472,1551.0,4.180593
10101,5.7796,32.0,6.107226,0.927739,1296.0,3.020979
20566,4.3487,29.0,5.930712,1.026217,1554.0,2.910112
2670,2.4511,37.0,4.992958,1.316901,390.0,2.746479
15709,5.0049,25.0,4.319261,1.039578,649.0,1.712401


In [11]:
sample_id = 14740

In [12]:
eli5.show_prediction(
    gbm,
    X_test.loc[sample_id],
    feature_names=X_train.columns.to_list()
)

Contribution?,Feature
2.068,<BIAS>
-0.003,MedInc
-0.082,AveOccup


In [13]:
gbm.predict(X_test.loc[sample_id].to_frame().T)

array([1.98264895])

In [14]:
# manual calculation

# Bias + MedInc + AveOccup

2.068 - 0.003 - 0.082

1.9829999999999999

In case you want to do the exercise manually, you can obtain the values for each tree, node and leaf like this:

In [15]:
booster = gbm.booster_

In [16]:
# print structure of tree

model = booster.dump_model()
model['tree_info']

[{'tree_index': 0,
  'num_leaves': 8,
  'num_cat': 0,
  'shrinkage': 1,
  'tree_structure': {'split_index': 0,
   'split_feature': 0,
   'split_gain': 6048.080078125,
   'threshold': 5.012150000000001,
   'decision_type': '<=',
   'default_left': True,
   'missing_type': 'None',
   'internal_value': 2.06825,
   'internal_weight': 0,
   'internal_count': 14448,
   'left_child': {'split_index': 1,
    'split_feature': 0,
    'split_gain': 1521.56005859375,
    'threshold': 3.0686000000000004,
    'decision_type': '<=',
    'default_left': True,
    'missing_type': 'None',
    'internal_value': 2.03443,
    'internal_weight': 11348,
    'internal_count': 11348,
    'left_child': {'split_index': 5,
     'split_feature': 2,
     'split_gain': 280.5539855957031,
     'threshold': 4.205543464345156,
     'decision_type': '<=',
     'default_left': True,
     'missing_type': 'None',
     'internal_value': 1.99651,
     'internal_weight': 5476,
     'internal_count': 5476,
     'left_child': {'

But the calculation is similar to what we saw for sklearn, so we won't do it manually here. 

LightGBM adds the contribution of each tree to get th final outcome. 

It is simpler than sklearn in that it does not multiply the contribution by the learning rate. 