Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about missingness #175

Open
psinger opened this issue Jul 23, 2018 · 9 comments

Comments

Projects
None yet
5 participants
@psinger
Copy link

commented Jul 23, 2018

Hope this is the right place to ask a question like this, but I am trying to get my hand around the missing property. In one of the papers it is written that

If the simplified inputs represent feature presence, then missingness requires features missing in the original input to have no impact.

What exactly does missing mean in this context? If a feature is zero, it of course also can have an impact on the model, and actually shap values also are non-zero for zero-input features.

Thanks in advance for an explanation!

@slundberg

This comment has been minimized.

Copy link
Owner

commented Jul 23, 2018

Yes, this is the right place. The missingness property is really just a minor book-keeping property to close a loop hole. It is required since local accuracy is specified as a linear model and x' could in theory have some zero entries (meaning the input is already missing as defined by h_x). These zero entries mean that local accuracy would still hold no matter what phi values correspond to those entries, so to have a unique solution we need to constrain them to be 0 (which is what we want since they are missing already and so have no impact). In practice for SHAP we will never consider a feature to already be perfectly missing unless that feature's value is constant over the whole background dataset.

@psinger

This comment has been minimized.

Copy link
Author

commented Jul 24, 2018

Thanks. How is h_x actually defined and what role does it play in calculating the shap values?

@slundberg

This comment has been minimized.

Copy link
Owner

commented Jul 24, 2018

h_x is used to connect SHAP with LIME, for SHAP it essentially replaces each missing input with a missing at random value indicator (meaning the variable should be integrated out). But this is just a long way of saying that f(h_x(z')) = E[f(x) | x_S]whereSis the set of non-zero entries in the binary vectorz'`.

@liuyanguu

This comment has been minimized.

Copy link

commented Sep 1, 2018

I happen to have the same question and I don't really know how is the SHAP value produced for missing (NA) feature values... I compared the SHAP values for the NA feature values and non-NA feature values, the SHAP values for NA feature values are roughly 10-times smaller on magnitude. I guess the reason traces back to xgboost being a sparsity aware algorithm?

@slundberg

This comment has been minimized.

Copy link
Owner

commented Sep 1, 2018

@liuyanguu NA in XGBoost is just routed down one of the two branches whenever you are splitting on that feature. So as far as XGBoost is concerned, NA doesn't mean missing, it is just a special value to split on (which you might use to represent missing). Since NA is really just a special indicator, it does not always mean missing at random (which is what SHAP means by missing). When Tree SHAP means missing, it means integrating the variable out by following both branches of the tree splits weighted by the number of training samples that went each way.

If NA does mark samples that are nearly missing at random, then their SHAP values will be very small, since the result of following the NA path will be similar to the expected output if you integrated over all paths accessible by toggling that feature.

@IvanUkhov

This comment has been minimized.

Copy link

commented Oct 3, 2018

@slundberg, as you mentioned, XGBoost has no problem with missing values, which, in my case, are encoded as NaNs. I wonder if Shap can also gracefully handle examples containing such values.

I see some NaNs in the output of shap_values. Is it expected? If not, what would you suggest doing?

@slundberg

This comment has been minimized.

Copy link
Owner

commented Oct 3, 2018

@IvanUkhov that is not expected. I need to debug that. I think it is happening because of numerical precision issues on deep trees. See #152

@xervanyo

This comment has been minimized.

Copy link

commented Mar 4, 2019

Hello,
I am currently trying to simulate the Tree SHAP missingness of a variable, as @slundberg mentionned "integrating the variable out by following both branches of the tree splits weighted by the number of training samples that went each way."

Do you have an easy way to program that, because I am really struggling.

Thank you

@slundberg

This comment has been minimized.

Copy link
Owner

commented Mar 11, 2019

@xervanyo Here is an example:

def _conditional_expectation(tree, S, x):
tree_ind = 0
def R(node_ind):
f = tree.features[tree_ind, node_ind]
lc = tree.children_left[tree_ind, node_ind]
rc = tree.children_right[tree_ind, node_ind]
if lc < 0:
return tree.values[tree_ind, node_ind]
elif f in S:
if x[f] <= tree.thresholds[tree_ind, node_ind]:
return R(lc)
else:
return R(rc)
else:
lw = tree.node_sample_weight[tree_ind, lc]
rw = tree.node_sample_weight[tree_ind, rc]
return (R(lc) * lw + R(rc) * rw) / (lw + rw)
out = 0.0
l = tree.values.shape[0] if tree.tree_limit is None else tree.tree_limit
for i in range(l):
tree_ind = i
out += R(0)
return out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.