New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does SHAP give very high importance to outliers? #960
Comments
Good questions. In the linear model SHAP does indeed give high importance to outlier feature values. This is correct (in my opinion) because the linear model also gives very high importance to those values. You can always multiply a feature by 100 and then divide its corresponding coefficient by 100 and leave a linear model unchanged. Since this does not change the model outputs it should also not change the explanations when those explanations are in the units of the model output (as opposed to the units of the feature). LIME tabular is actually also likely to change in response to outliers, but perhaps less so because of some details about binning etc. Also not that tree-based model are less sensitive to outliers than linear models so the effect will be less there. |
Thanks for your reply @slundberg . The explanation I gave in the form of the linear equation just to make it more interpretable while discussing. But if you see the notebooks' result you can find that those results were generated from the randomforest tree model. There also I find that introducing outlier will give that feature high SHAP value which means higher feature importance. Could you please confirm the following things?
|
There are a few tricky things to keep in mind here, and the best answer will be for me to finish up my tutorial on SHAP that I have been working on. But a few shorter answers here:
|
Thanks @slundberg again for your immediate reply.
|
These are good questions that really need a longer answer than can fit here. I pushed a draft version of a tutorial-style notebook that starts by exploring SHAP on simple linear models. Check it out and see if it helps here, I wrote some of it with the questions you had here in mind :) |
Hi @slundberg, One last question I have. I am solving one NLP regression problem, where given one text response I need to predict the score. Along with the predictions I need to calculate enablers and disablers. Enablers are the words (or phrases or sentences) that have positive effect on the output score i.e. inclusion of this word will increase the output score. Disablers are the words (or phrases or sentences) that have negative effect on the output score i.e. inclusion of this word will decrease the output score. Now can we treat words with positive SHAP values as enablers? If so, can we say that a word with higher positive SHAP value is more enabler with respect to another word with lesser positive SHAP value? Can we say the same for disablers also? |
This issue has been inactive for two years, so it's been automatically marked as 'stale'. We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open. If there's no activity in the next 90 days the issue will be closed. |
This issue has been automatically closed due to lack of recent activity. Your input is important to us! Please feel free to open a new issue if the problem persists or becomes relevant again. |
Hi @slundberg , I have some confusion while calculating the SHAP values to calculate feature importance. You have already mentioned that for linear regression the SHAP value equation is,
shap_values = regression_coefficients * (X - X.mean(0))
Seeing that equation I find that the feature importance of a data point is directly proportional to the absolute value of that feature for the data point. This means if there is any outlier (assuming very large value in X) with respect to any feature then that feature may get important most of the time because of its large X value.
For example, we have a fitted regression line: y_hat = b0 + b1 * x1 + b2 * x2, where b0 =3, b1=0.08 and b2=0.9 gives, y_hat = 3 + 0.08 * x1 + 0.9 x2
Now let's say we have one data point where x1 = 900 and x2 = 5 for which I want to calculate the SHAP values,
Here we see that in spite of the b1 having a very less value, feature 1 comes most important because of the very large value of x1 and also in spite of b2 having very large value, feature 2 comes less important with respect to feature 1 because of it's less absolute value for that data point.
To describe the issue, I have compared the case with both SHAP and LIME.
Capital Loss
from 0 to 20,000 (some arbitrary large number). Doing so, I found that because of the large value ofCapital Loss
, it becomes the most important variable for SHAP as it considers the absolute value of the feature. But, LIME calculates the feature importances by the regression coefficients that is the reason this outlier did not affect much but the overall result changes because of the new values of the regression coefficients because of the introduction of the outlier.Is my thought process correct? If yes, is it a good measure to consider the absolute value of a feature to calculate it's feature importance? If not then it would be really helpful if you would help me describing the difference between these two results.
The text was updated successfully, but these errors were encountered: