Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing treeshap calculation when data is weighted (fixes #5095) #5097

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

proto-n
Copy link

@proto-n proto-n commented Mar 26, 2022

Fixing Issue #5095

The bug is described in the issue. Continuing with the same example, after this commit, we have:

import numpy as np
import lightgbm

X=np.array([[1,2,3,4,5,6]*100, [1, 1.2, 0.3, 0.8, 0.4, 5]*100], dtype=np.float).T
y=X.dot([0.5, 8])
w=np.array([1,1,1,1,1,10]*100, dtype=np.float)

data = lightgbm.Dataset(X)
data.set_label(y)
data.set_weight(w)

model = lightgbm.train(
    {
        'objective':'regression',
        'metric':'mse',
        'boosting':'gbdt',
        'num_leaves': 7,
        'learning_rate':0.001
    },
    data,
    num_boost_round=100,
)

model.predict(X, pred_contrib=True)

resulting in:

array([[-2.23944737,  0.08394158, 31.13999999],
       [-2.17359526,  0.218026  , 31.13999999],
       [-2.40050732, -0.19295458, 31.13999999],
       ...,
       [-2.24611195,  0.08108533, 31.13999999],
       [-2.28244961, -0.13963819, 31.13999999],
       [ 1.07870497,  0.05046017, 31.13999999]])

Which is the correct behavior, now the same as the expanded version (see the related issue for expanded version):

print(np.average(model.predict(X), weights=w)) # 31.139999991989136

Also, the contributions now equal those of the expanded version:

# see related issue for X_expand and model2
model.predict(X, pred_contrib=True)[0] # array([-2.23944737,  0.08394158, 31.13999999])
model2.predict(X_expand, pred_contrib=True)[0] # array([-2.23944737,  0.08394158, 31.13999999])

The commit modifies the codebase in two places. First, it modifies the SHAP calculation to reference a newly created data_weight function. Second, I needed to also fix the weight calculation for internal nodes and the expected value calculation as well, as the weights were calculated incorrectly for some reason, and the expected value didn't consider weights either.

Funny enough, beyond SHAP calculations, I couldn't find any other reference to either expected value or the internal weights storage (other than serialization and cuda->cpu / cpu->cuda transfer). I may have missed something though.

Also, it's not clear to me if anything else (like tests of some kind) is required to merge the PR. Please advise.

@ghost
Copy link

ghost commented Mar 26, 2022

CLA assistant check
All CLA requirements met.

@proto-n
Copy link
Author

proto-n commented Mar 26, 2022

Apparently, tests are failing: in the test cases, the contributions don't seem to sum up to the predictions. In my toy example they do:

print(np.linalg.norm(model.predict(X, pred_contrib=True).sum(axis=1)-model.predict(X))) # 6.018678656295655e-13

I will investigate why.

@jameslamb jameslamb changed the title Fixing treeshap calculation when data is weighted (issue #5095) Fixing treeshap calculation when data is weighted (fixes #5095) Mar 26, 2022
@StrikerRUS
Copy link
Collaborator

Kindly ping @slundberg for the help.

@proto-n
Copy link
Author

proto-n commented Mar 28, 2022

I've done some digging. The test I investigated is unweighted, and the error seems to come from the fact that in the case of classification, the weights don't seem to be identical to the number of records.

Apparently the left_weight and right_weight arguments don't simply always mean the summed weight of the relevant data subsets. They come from the left_sum_hessian and right_sum_hessian fields of a SplitInfo object.

While this appears to be indeed the summed data weight with a regression task, in the case of a binary classification task (which the failing tests use) it's a different value.

Example tree from (unweighted) regression, note the leaf_weight and leaf_count lines:

Tree=0
num_leaves=6
num_cat=0
split_feature=0 1 1 0 0
split_gain=421979 2288.13 308.167 162 0.500004
threshold=5.5000000000000009 0.6000000000000002 1.1000000000000003 4.5000000000000009 3.5000000000000004
decision_type=2 2 2 2 2
left_child=1 3 4 -1 -3
right_child=-2 2 -4 -5 -6
leaf_value=31.112759993871052 31.151859993298849 31.117359994252521 31.119459994633992 31.114559993108113 31.117259993871052
leaf_weight=100 1000 100 100 100 100
leaf_count=100 1000 100 100 100 100
internal_value=31.14 31.1163 31.118 31.1137 31.1173
internal_weight=1500 500 300 200 200
internal_count=1500 500 300 200 200
is_linear=0
shrinkage=1

Example tree from (unweighted) classification:

Tree=8
num_leaves=2
num_cat=0
split_feature=0
split_gain=1470.49
threshold=5.5000000000000009
decision_type=2
left_child=-1
right_child=-2
leaf_value=-0.0029528957469360378 0.001494046189980353
leaf_weight=111.98318749666214 221.32959961891174
leaf_count=500 1000
internal_value=0
internal_weight=333.313
internal_count=1500
is_linear=0
shrinkage=0.001

Note that while close, the ratio of 111.98318749666214 to 221.32959961891174 is not the same as 500 to 1000. Also note that my code only modifies the calculation of the internal_weight values and doesn't touch the leaf_weight values (so this discrepancy is not on me, but is inherited from the hessian values).

Unfortunately, the splitting part of the codebase is a bit esoteric to me and I'm afraid I'd need a significant time investment to understand it. I wonder if someone familiar with that part could clarify the role of hessians vs weights?

@proto-n
Copy link
Author

proto-n commented May 10, 2022

It seems to me there hasn't been any activity regarding this. I'm wondering if I can do anything to help?

@jameslamb
Copy link
Collaborator

Sorry for the delay and thank you for your help and patience!

@guolinke or @shiyu1994 are you able to help review this PR?

Copy link
Collaborator

@guolinke guolinke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StrikerRUS
Copy link
Collaborator

@guolinke Sorry, I guess you've overlooked the part of this PR about failing tests: #5097 (comment) and #5097 (comment).

@guolinke
Copy link
Collaborator

It seems the failed tests are related to shap calculation. @proto-n can you help to fix the tests?

@proto-n
Copy link
Author

proto-n commented May 17, 2022

@guolinke I would love to, but some of the relevant code falls outside of my expertise. I've detailed the reasons for the failing tests in this comment. TL;DR: the fix works for regression, but the weight variables seem to mean something else for classification (and I'm not sure what, and whether that's correct behavior).

@guolinke
Copy link
Collaborator

guolinke commented May 18, 2022

I see, thank you. In classification, the weight is actually hessian, which depends on its gradient, and is not a constant.

Actually, what you need is the data sample weight, not the "weight", which is the sum of hessian , in model file.

@proto-n
Copy link
Author

proto-n commented May 19, 2022

I see, thanks for the info!

Correct SHAP calculation needs the actual sum data weight that belongs in the given leaf. This means that unfortunately we either have to repurpose these variables to store them, or we need to allocate additional variables in the tree structure for this reason.

I couldn't find any reference to these variables being used internally, other than in SHAP calculation. Still, they end up being serialized, printed, publicly accessible from e.g. the python API, etc., so repurposing them would be kind of a breaking change. Though I'm not sure the relevant documentation was very clear from the get-go.

What should be the course of action here?

@guolinke
Copy link
Collaborator

It is not trivial to store data weights in tree structure, as it will affect the training efficiency.

Although hessian is not weight, but it is very close. So I prefer to make the hessian-based SHAP. So we can believe current output is correct, and fix ground truth values in tests. Besides, documents about the hessian based SHAP should be added as well.

@proto-n
Copy link
Author

proto-n commented May 20, 2022

The thing is, this way the contribution values will violate the rule that sum(contributions)=prediction, which is one of the (four) fundamental requirements of SHAP. So I don't particularly like the idea, but ultimately it shouldn't be my decision.

@guolinke
Copy link
Collaborator

I am not familiar with SHAP, so cannot make the decision. Kindly ping @slundberg for help in data weights.

@StrikerRUS
Copy link
Collaborator

Maybe it's worth creating a cross-reference issue in the shap repository to make maintainers there aware of the problem?

Seems that there are already a lot of issues about using sample weights with shap, e.g. shap/shap#2508, shap/shap#1328, shap/shap#2395.

@proto-n proto-n marked this pull request as draft May 23, 2022 11:39
@proto-n
Copy link
Author

proto-n commented May 23, 2022

Sorry for the possibly dumb question, but how do I ping @slundberg for help?

@jameslamb
Copy link
Collaborator

Sorry for the possibly dumb question, but how do I ping

Not a dumb question! @-mentions on GitHub are the main way we alert external contributors. To be clear, @guolinke wasn't instructing you to ping @slundberg ... he was sending such a notification via his comment.

@jameslamb
Copy link
Collaborator

It has been 3 months since the last activity on this PR. What needs to happen for it to move forward? We need an opinion from someone who is an expert in SHAP?

If that's the case, maybe we could tag in our friends from XGBoost for an opinion, or look at how XGBoost handles this?

@proto-n
Copy link
Author

proto-n commented Sep 3, 2022

It has been 3 months since the last activity on this PR. What needs to happen for it to move forward? We need an opinion from someone who is an expert in SHAP?

If that's the case, maybe we could tag in our friends from XGBoost for an opinion, or look at how XGBoost handles this?

Unfortunately I don't think XGBoost can help here, since we know the problem and the possible solutions, it's just that both are inconvenient.

The problem is that LightGBM doesn't store the exact per-leaf (and per-node) sum data weights in the model in the case of classification. Instead, it stores the hessians which are close but not the same. Storing the actual weights is not trivial according to @guolinke. However, these are required for correct SHAP calculation.

Somebody (I'm not sure who) has to make a decision: either SHAP can be calculated based on hessians (which results in breaking one of the fundamental properties of SHAP), or the actual weights have to be stored somehow (which is a nontrivial modification).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants