Fixing treeshap calculation when data is weighted (fixes #5095) #5097

proto-n · 2022-03-26T09:07:50Z

Fixing Issue #5095

The bug is described in the issue. Continuing with the same example, after this commit, we have:

import numpy as np
import lightgbm

X=np.array([[1,2,3,4,5,6]*100, [1, 1.2, 0.3, 0.8, 0.4, 5]*100], dtype=np.float).T
y=X.dot([0.5, 8])
w=np.array([1,1,1,1,1,10]*100, dtype=np.float)

data = lightgbm.Dataset(X)
data.set_label(y)
data.set_weight(w)

model = lightgbm.train(
    {
        'objective':'regression',
        'metric':'mse',
        'boosting':'gbdt',
        'num_leaves': 7,
        'learning_rate':0.001
    },
    data,
    num_boost_round=100,
)

model.predict(X, pred_contrib=True)

resulting in:

array([[-2.23944737,  0.08394158, 31.13999999],
       [-2.17359526,  0.218026  , 31.13999999],
       [-2.40050732, -0.19295458, 31.13999999],
       ...,
       [-2.24611195,  0.08108533, 31.13999999],
       [-2.28244961, -0.13963819, 31.13999999],
       [ 1.07870497,  0.05046017, 31.13999999]])

Which is the correct behavior, now the same as the expanded version (see the related issue for expanded version):

print(np.average(model.predict(X), weights=w)) # 31.139999991989136

Also, the contributions now equal those of the expanded version:

# see related issue for X_expand and model2
model.predict(X, pred_contrib=True)[0] # array([-2.23944737,  0.08394158, 31.13999999])
model2.predict(X_expand, pred_contrib=True)[0] # array([-2.23944737,  0.08394158, 31.13999999])

The commit modifies the codebase in two places. First, it modifies the SHAP calculation to reference a newly created data_weight function. Second, I needed to also fix the weight calculation for internal nodes and the expected value calculation as well, as the weights were calculated incorrectly for some reason, and the expected value didn't consider weights either.

Funny enough, beyond SHAP calculations, I couldn't find any other reference to either expected value or the internal weights storage (other than serialization and cuda->cpu / cpu->cuda transfer). I may have missed something though.

Also, it's not clear to me if anything else (like tests of some kind) is required to merge the PR. Please advise.

ghost · 2022-03-26T09:08:37Z

All CLA requirements met.

proto-n · 2022-03-26T10:40:41Z

Apparently, tests are failing: in the test cases, the contributions don't seem to sum up to the predictions. In my toy example they do:

print(np.linalg.norm(model.predict(X, pred_contrib=True).sum(axis=1)-model.predict(X))) # 6.018678656295655e-13

I will investigate why.

StrikerRUS · 2022-03-26T19:27:13Z

Kindly ping @slundberg for the help.

proto-n · 2022-03-28T13:34:54Z

I've done some digging. The test I investigated is unweighted, and the error seems to come from the fact that in the case of classification, the weights don't seem to be identical to the number of records.

Apparently the left_weight and right_weight arguments don't simply always mean the summed weight of the relevant data subsets. They come from the left_sum_hessian and right_sum_hessian fields of a SplitInfo object.

While this appears to be indeed the summed data weight with a regression task, in the case of a binary classification task (which the failing tests use) it's a different value.

Example tree from (unweighted) regression, note the leaf_weight and leaf_count lines:

Tree=0
num_leaves=6
num_cat=0
split_feature=0 1 1 0 0
split_gain=421979 2288.13 308.167 162 0.500004
threshold=5.5000000000000009 0.6000000000000002 1.1000000000000003 4.5000000000000009 3.5000000000000004
decision_type=2 2 2 2 2
left_child=1 3 4 -1 -3
right_child=-2 2 -4 -5 -6
leaf_value=31.112759993871052 31.151859993298849 31.117359994252521 31.119459994633992 31.114559993108113 31.117259993871052
leaf_weight=100 1000 100 100 100 100
leaf_count=100 1000 100 100 100 100
internal_value=31.14 31.1163 31.118 31.1137 31.1173
internal_weight=1500 500 300 200 200
internal_count=1500 500 300 200 200
is_linear=0
shrinkage=1

Example tree from (unweighted) classification:

Tree=8
num_leaves=2
num_cat=0
split_feature=0
split_gain=1470.49
threshold=5.5000000000000009
decision_type=2
left_child=-1
right_child=-2
leaf_value=-0.0029528957469360378 0.001494046189980353
leaf_weight=111.98318749666214 221.32959961891174
leaf_count=500 1000
internal_value=0
internal_weight=333.313
internal_count=1500
is_linear=0
shrinkage=0.001

Note that while close, the ratio of 111.98318749666214 to 221.32959961891174 is not the same as 500 to 1000. Also note that my code only modifies the calculation of the internal_weight values and doesn't touch the leaf_weight values (so this discrepancy is not on me, but is inherited from the hessian values).

Unfortunately, the splitting part of the codebase is a bit esoteric to me and I'm afraid I'd need a significant time investment to understand it. I wonder if someone familiar with that part could clarify the role of hessians vs weights?

proto-n · 2022-05-10T14:05:28Z

It seems to me there hasn't been any activity regarding this. I'm wondering if I can do anything to help?

jameslamb · 2022-05-10T22:30:32Z

Sorry for the delay and thank you for your help and patience!

@guolinke or @shiyu1994 are you able to help review this PR?

guolinke

LGTM

StrikerRUS · 2022-05-11T21:57:04Z

@guolinke Sorry, I guess you've overlooked the part of this PR about failing tests: #5097 (comment) and #5097 (comment).

guolinke · 2022-05-17T00:50:13Z

It seems the failed tests are related to shap calculation. @proto-n can you help to fix the tests?

proto-n · 2022-05-17T08:21:00Z

@guolinke I would love to, but some of the relevant code falls outside of my expertise. I've detailed the reasons for the failing tests in this comment. TL;DR: the fix works for regression, but the weight variables seem to mean something else for classification (and I'm not sure what, and whether that's correct behavior).

guolinke · 2022-05-18T23:32:59Z

I see, thank you. In classification, the weight is actually hessian, which depends on its gradient, and is not a constant.

Actually, what you need is the data sample weight, not the "weight", which is the sum of hessian , in model file.

proto-n · 2022-05-19T09:01:29Z

I see, thanks for the info!

Correct SHAP calculation needs the actual sum data weight that belongs in the given leaf. This means that unfortunately we either have to repurpose these variables to store them, or we need to allocate additional variables in the tree structure for this reason.

I couldn't find any reference to these variables being used internally, other than in SHAP calculation. Still, they end up being serialized, printed, publicly accessible from e.g. the python API, etc., so repurposing them would be kind of a breaking change. Though I'm not sure the relevant documentation was very clear from the get-go.

What should be the course of action here?

guolinke · 2022-05-20T05:27:35Z

It is not trivial to store data weights in tree structure, as it will affect the training efficiency.

Although hessian is not weight, but it is very close. So I prefer to make the hessian-based SHAP. So we can believe current output is correct, and fix ground truth values in tests. Besides, documents about the hessian based SHAP should be added as well.

proto-n · 2022-05-20T08:04:54Z

The thing is, this way the contribution values will violate the rule that sum(contributions)=prediction, which is one of the (four) fundamental requirements of SHAP. So I don't particularly like the idea, but ultimately it shouldn't be my decision.

guolinke · 2022-05-21T02:32:42Z

I am not familiar with SHAP, so cannot make the decision. Kindly ping @slundberg for help in data weights.

StrikerRUS · 2022-05-21T22:30:00Z

Maybe it's worth creating a cross-reference issue in the shap repository to make maintainers there aware of the problem?

Seems that there are already a lot of issues about using sample weights with shap, e.g. shap/shap#2508, shap/shap#1328, shap/shap#2395.

proto-n · 2022-05-23T11:41:17Z

Sorry for the possibly dumb question, but how do I ping @slundberg for help?

jameslamb · 2022-06-04T01:40:22Z

Sorry for the possibly dumb question, but how do I ping

Not a dumb question! @-mentions on GitHub are the main way we alert external contributors. To be clear, @guolinke wasn't instructing you to ping @slundberg ... he was sending such a notification via his comment.

jameslamb · 2022-09-03T03:19:05Z

It has been 3 months since the last activity on this PR. What needs to happen for it to move forward? We need an opinion from someone who is an expert in SHAP?

If that's the case, maybe we could tag in our friends from XGBoost for an opinion, or look at how XGBoost handles this?

proto-n · 2022-09-03T10:26:03Z

It has been 3 months since the last activity on this PR. What needs to happen for it to move forward? We need an opinion from someone who is an expert in SHAP?

If that's the case, maybe we could tag in our friends from XGBoost for an opinion, or look at how XGBoost handles this?

Unfortunately I don't think XGBoost can help here, since we know the problem and the possible solutions, it's just that both are inconvenient.

The problem is that LightGBM doesn't store the exact per-leaf (and per-node) sum data weights in the model in the case of classification. Instead, it stores the hessians which are close but not the same. Storing the actual weights is not trivial according to @guolinke. However, these are required for correct SHAP calculation.

Somebody (I'm not sure who) has to make a decision: either SHAP can be calculated based on hessians (which results in breaking one of the fundamental properties of SHAP), or the actual weights have to be stored somehow (which is a nontrivial modification).

fixed treeshap calculation when data is weighted (microsoft#5095)

e9b3aeb

proto-n requested review from guolinke, btrotta, shiyu1994, hzy46 and tongwu-sh as code owners March 26, 2022 09:07

jameslamb added in progress fix labels Mar 26, 2022

jameslamb changed the title ~~Fixing treeshap calculation when data is weighted (issue #5095)~~ Fixing treeshap calculation when data is weighted (fixes #5095) Mar 26, 2022

jameslamb added the awaiting review label Mar 30, 2022

guolinke previously approved these changes May 11, 2022

View reviewed changes

proto-n marked this pull request as draft May 23, 2022 11:39

fixed type for weight function

af878c2

proto-n dismissed guolinke’s stale review via af878c2 April 1, 2025 07:35

Fixing treeshap calculation when data is weighted (fixes #5095) #5097

Are you sure you want to change the base?

Fixing treeshap calculation when data is weighted (fixes #5095) #5097

Uh oh!

Conversation

proto-n commented Mar 26, 2022

Uh oh!

ghost commented Mar 26, 2022 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

proto-n commented Mar 26, 2022

Uh oh!

StrikerRUS commented Mar 26, 2022

Uh oh!

proto-n commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

proto-n commented May 10, 2022

Uh oh!

jameslamb commented May 10, 2022

Uh oh!

guolinke left a comment

Choose a reason for hiding this comment

Uh oh!

StrikerRUS commented May 11, 2022

Uh oh!

guolinke commented May 17, 2022

Uh oh!

proto-n commented May 17, 2022

Uh oh!

guolinke commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

proto-n commented May 19, 2022

Uh oh!

guolinke commented May 20, 2022

Uh oh!

proto-n commented May 20, 2022

Uh oh!

guolinke commented May 21, 2022

Uh oh!

StrikerRUS commented May 21, 2022

Uh oh!

proto-n commented May 23, 2022

Uh oh!

jameslamb commented Jun 4, 2022

Uh oh!

jameslamb commented Sep 3, 2022

Uh oh!

proto-n commented Sep 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ghost commented Mar 26, 2022 •

edited by ghost

Loading

proto-n commented Mar 28, 2022 •

edited

Loading

guolinke commented May 18, 2022 •

edited

Loading

proto-n commented Sep 3, 2022 •

edited

Loading