Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is summary plot only for classification? #41

Closed
jayden526 opened this issue Mar 6, 2018 · 8 comments
Closed

is summary plot only for classification? #41

jayden526 opened this issue Mar 6, 2018 · 8 comments

Comments

@jayden526
Copy link

Thank you for this amazing work. Just wondering, I want to identify variable importance using the summary plot. But my model is a tree-based regressor. I am not sure if I understand the paper correctly, I found all examples calculating shap values are all classifications. Could you please help clarify this, can this be used in regression? Thank you so much!

@JuanCorp
Copy link
Contributor

JuanCorp commented Mar 6, 2018

I've used shap and summary plot for the house list price problem before, which is a regression, and the explanations work just fine, and adjust to what I would expect from a logical standpoint. For example, construction area, distance to certain places of interest, and house geographical sector were all top features. I don't have the plot at hand, but a mini app that uses an XGBoost model for house list price prediction (at least in my city), is available in my profile, albeit with some fixes that I need to do for it.

From what I've understood, the shapley values for each feature is the same as a weight or coefficient, like in regression.There's also the bias or intercept. This bias is the base value for the predictions of the model, for example, the average price of all houses in the dataset. For a single data point, each coefficient represents the impact of the feature on the final prediction. These coefficients and intercept are added, then the sigmoid function is applied to the result of the sum. The result of the sigmoid function is the prediction that the original model gave, which is a probability between 0 and 1. For regression models, the process is the same, except that the sigmoid step is skipped, since the output isn't between 0 and 1, but continuous.

@slundberg Can give you better details though, so you should wait for his output.

@jayden526
Copy link
Author

Thank you @JuanCorp, I think you are right. Even for classification the log odds needs to be computed in order to find the probability. The syntax I tried is referred to the classification example:

shap_values = shap.KernelExplainer(randomforest.predict, X_train).shap_values(X_test)
shap.summary_plot(shap_values, X_test)

is this the same as yours? at least now I can get the shap values.
@slundberg Would you mind to clarify the shap_values in regressions? If it is already mentioned in your paper, please let me know, I can check that! thank you.

@jayden526
Copy link
Author

Sorry for asking again, I sometimes have runtime error when I used different number of samples in my X_test (sometimes is ok, sometimes if I only use 100 sample of the test, this error occurs),

Exception in thread Thread-15
RuntimeError: Set changed size during iteration

Could you help me with this? Thank you!

@slundberg
Copy link
Collaborator

@jayden526 SHAP values work well with regressions, in fact the Boston housing example in the read-me is a least squares regression. The SHAP values are in the same units as the model output (for tree SHAP in XGBoost this is before the link function (such as a logistic). So if you are predicting dollars, then the units of the SHAP values will be in dollars and will sum to the output of the model.

As for the error, if there is a simple example of how you got it, please post it and I'll fix it.

FYI...If you are using a tree model I would suggest using XGBoost and getting the exact shap values vs using the model agnostic Kernel SHAP on scikit.

@jayden526
Copy link
Author

@slundberg Thank you so much! I will definitely try with Xgboost to see whether it works for me.

@slundberg
Copy link
Collaborator

sounds good

@andymancodes
Copy link

@slundberg Hi, thanks for the great package! I am not getting how to use my own dataset with shap? What is the use of *shap.dataset and how can I use my own datasets in the form of (X, y) with SHAP? Thanks :)

@slundberg
Copy link
Collaborator

Do you have a model and a dataset or just a dataset representing the output of the model? Perhaps clarifying what doesn't make sense about the examples in the README would be helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants