-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is summary plot only for classification? #41
Comments
I've used shap and summary plot for the house list price problem before, which is a regression, and the explanations work just fine, and adjust to what I would expect from a logical standpoint. For example, construction area, distance to certain places of interest, and house geographical sector were all top features. I don't have the plot at hand, but a mini app that uses an XGBoost model for house list price prediction (at least in my city), is available in my profile, albeit with some fixes that I need to do for it. From what I've understood, the shapley values for each feature is the same as a weight or coefficient, like in regression.There's also the bias or intercept. This bias is the base value for the predictions of the model, for example, the average price of all houses in the dataset. For a single data point, each coefficient represents the impact of the feature on the final prediction. These coefficients and intercept are added, then the sigmoid function is applied to the result of the sum. The result of the sigmoid function is the prediction that the original model gave, which is a probability between 0 and 1. For regression models, the process is the same, except that the sigmoid step is skipped, since the output isn't between 0 and 1, but continuous. @slundberg Can give you better details though, so you should wait for his output. |
Thank you @JuanCorp, I think you are right. Even for classification the log odds needs to be computed in order to find the probability. The syntax I tried is referred to the classification example: shap_values = shap.KernelExplainer(randomforest.predict, X_train).shap_values(X_test) is this the same as yours? at least now I can get the shap values. |
Sorry for asking again, I sometimes have runtime error when I used different number of samples in my X_test (sometimes is ok, sometimes if I only use 100 sample of the test, this error occurs), Exception in thread Thread-15 Could you help me with this? Thank you! |
@jayden526 SHAP values work well with regressions, in fact the Boston housing example in the read-me is a least squares regression. The SHAP values are in the same units as the model output (for tree SHAP in XGBoost this is before the link function (such as a logistic). So if you are predicting dollars, then the units of the SHAP values will be in dollars and will sum to the output of the model. As for the error, if there is a simple example of how you got it, please post it and I'll fix it. FYI...If you are using a tree model I would suggest using XGBoost and getting the exact shap values vs using the model agnostic Kernel SHAP on scikit. |
@slundberg Thank you so much! I will definitely try with Xgboost to see whether it works for me. |
sounds good |
@slundberg Hi, thanks for the great package! I am not getting how to use my own dataset with shap? What is the use of *shap.dataset and how can I use my own datasets in the form of (X, y) with SHAP? Thanks :) |
Do you have a model and a dataset or just a dataset representing the output of the model? Perhaps clarifying what doesn't make sense about the examples in the README would be helpful. |
Thank you for this amazing work. Just wondering, I want to identify variable importance using the summary plot. But my model is a tree-based regressor. I am not sure if I understand the paper correctly, I found all examples calculating shap values are all classifications. Could you please help clarify this, can this be used in regression? Thank you so much!
The text was updated successfully, but these errors were encountered: