New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with Pyspark GBTClassifier #884
Comments
@QuentinAmbard |
That's almost the code from the unit test, so yes it should run without error. |
This also happen to me. It used to work with RandomForest in version 0.30.2. I'll try to revert and check again .. |
Hello, I`m having the same problem with Shap version: shap-0.32.1 I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree() |
@caspiDoron You may try version 0.30.2. it works for me. |
Thanks but it seems to work only for DT, Random forest failing on the tests: AssertionError: SHAP values don't sum to model output for class0! GBT is not supported which is the one i use... |
I just re-run the unit test and something is broken indeed. It's a new check that has been added and calls the predict function. |
Thank you @QuentinAmbard it is working with this workaround. |
Great!
|
Thanks for checking into this @QuentinAmbard! I just pushed an updated tolerance check for additivity to address #887, but I suspect this might be a true error that this new check uncovered. Happy to help work through it on the PR |
…cies #884 add spark in setup.py tests and fix spark issue with additivity check
Hi. I can run the example code properly. Why shap_values can understand the difference between when the model was trained (features in 1 column) and when the model was to be explained? Thank you sir. |
shap_values takes a pandas Dataframe containing one column per feature. Once the model is trained shap will explain it using shap_values(...). You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion. |
Thanks @QuentinAmbard. I still wonder how shap_values knows that each column in Pandas Dataframe equal to which element of Spark Dataframe (when the model was trained). |
I'm using the index of the features, I assume the order of the pandas column must be the same as the features added in the vector assembler of your spark dataframe. Probably worth mentioning it in the documentation. https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L951 |
Hi All, |
@QuentinAmbard
if model is sklearn.ensemble.GradientBoostingClassifier no problem but when is equal to pyspark.ml.classification.GBTClassifier obtain this error:
Any suggestion? |
The explainer can't be serialized probably because we are keeping spark references inside. I'll try to have a look. |
i try passing model to function calculate_shap and compute the explainer inside partition but obtain the same error |
I am trying to get shap to work with a pyspark GBT classifier. I got my features as a numpy array X and then tried (as in the example):
It gave the following error:
I did not provide a background dataset, so I don't understand why it wants me to provide a larger one. Also, the matrix X contains my entire training dataset, so I don't understand how it could not cover all the the leaves in the model. Am I doing something obviously wrong? Then, when I tried using feature_perturbation="interventional", it gave a different error:
Also, if running the predictions with spark is complicated to implement, it might be worth adding the ability of the user to supply the expected predictions for validation. |
You should get this error when your tree is built with a leaf without data inside. If you get this error, I assume you are using shap on a model built with a small data size? |
You mean that not all paths to the leaf are the same length, so some of what would otherwise be intermediate nodes have no children (leaving those phantom child nodes empty)? Also, I created Issue #1192 to track spark predictions. |
Are there news about the error of @amandolesi? Any new suggestion? Thank you. |
I run into the same problem, did you find out the solution to it? |
I didn't had time to search what's causing this exactly these last weeks. I suspect there is a reference to spark kept somewhere and it breaks the serialization of the tree explainer with a spark model. I'll have a look when I got some time, but it shouldn't be a big deal, especially if the serialisation is working with other models. |
Hello Quentin, to recap and also give you some additional feedback, I performed some tests using a local standalone instance of spark. As you mentioned a serialization error, I tried pickling a 'pyspark.ml.classification.RandomForestClassificationModel' object, basically a fitted pyspark random forest and I got a Py4J error, the same that @amandolesi reported above. In explainer/tree.py, TreeExplainer class, row 695:
so this serialization problem propagates. I tried commenting out "self.original_model = model" and I was then able to pickle the TreeExplainer object with a PySpark model. Of course it is a workaround but predictions are not implemented with PySpark yet, so commenting that line for the time being should not be an issue, what do you think about it? |
Thanks @guidiandrea ! Absolutely that's what I had in mind too, but still haven't find time to do the change :/ The original_model was indeed kept in order to implement predictions (#1192) but I think we should find another way to avoid breaking serialisation with spark models. |
Here you are: Thank you @QuentinAmbard! |
@slundberg I think we can now close this issue as everything should be solved with #1313 |
Looks like this should be closed @allard-jeff |
I am still having `NotImplementedError: CategoricalSplit are not yet implemented" error. I am using shap==0.39.0 and Spark 3 I also got the same error using decision trees as well. Code:
|
@chengyin38 The issue is that Shap can not handle categorical splits. So, in the Pyspark pre-processing you really need to drop the meta data from the data frame that pyspark will implicitly use to determine a feature is a categorical variable. This is done it seems as df= df.rdd.toDF(). String indexing without categorical splits might not be an optimal approach to the modeling however. |
I will also note that it was necessary in my experience to remove the vectors of the one hot encoding. I broke them out into binary features. After that (along with the step above), Shap was able to run effectively on a Pyspark tree model. |
@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?
|
Any idea how to fix this?
I am trying to run this type of code:
Update: |
Any progress with this problem? @QuentinAmbard |
This issue has been inactive for two years, so it's been automatically marked as 'stale'. We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open. If there's no activity in the next 90 days the issue will be closed. |
@QuentinAmbard
I just installed Shap from PyPi (0.32.0) and running a version of your test still produces the same error - shown below. Is there something that I am missing in the use of Shap with a pyspark model?
The text was updated successfully, but these errors were encountered: