Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with Pyspark GBTClassifier #884

Open
allard-jeff opened this issue Nov 7, 2019 · 36 comments · May be fixed by #2700 or #3384
Open

Error with Pyspark GBTClassifier #884

allard-jeff opened this issue Nov 7, 2019 · 36 comments · May be fixed by #2700 or #3384
Labels
stale Indicates that there has been no recent activity on an issue

Comments

@allard-jeff
Copy link

@QuentinAmbard

I just installed Shap from PyPi (0.32.0) and running a version of your test still produces the same error - shown below. Is there something that I am missing in the use of Shap with a pyspark model?

import pyspark
print(pyspark.__version__)
import shap
print(shap.__version__)
import sklearn.datasets
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier, GBTClassifier
import pandas as pd

iris_sk = sklearn.datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris_sk['data'], iris_sk['target']], columns= iris_sk['feature_names'] + ['target'])[:100]
spark = SparkSession.builder.config(conf=SparkConf().set("spark.master", "local[*]")).getOrCreate()

col = ["sepal_length","sepal_width","petal_length","petal_width","type"]
iris = spark.createDataFrame(iris, col)
iris = VectorAssembler(inputCols=col[:-1],outputCol="features").transform(iris)
iris = StringIndexer(inputCol="type", outputCol="label").fit(iris).transform(iris)

classifier = GBTClassifier(labelCol="label", featuresCol="features")
model = classifier.fit(iris)
explainer = shap.TreeExplainer(model)
X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
shap_values = explainer.shap_values(X)

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-31-f47b3a56c25f> in <module>
     23 explainer = shap.TreeExplainer(model)
     24 X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
---> 25 shap_values = explainer.shap_values(X)

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate, check_additivity)
    283 
    284         if check_additivity and self.model_output == "margin":
--> 285             self.assert_additivity(out, self.model.predict(X))
    286 
    287         return out

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in predict(self, X, y, output, tree_limit)
    785             import pyspark
    786             #TODO support predict for pyspark
--> 787             raise NotImplementedError("Predict with pyspark isn't implemented")
    788 
    789         # see if we have a default tree_limit in place.

NotImplementedError: Predict with pyspark isn't implemented
@allard-jeff
Copy link
Author

@QuentinAmbard
Have anyone else ran this code successfully?

@QuentinAmbard
Copy link
Contributor

That's almost the code from the unit test, so yes it should run without error.
I'll try to debug that this week, maybe there is an issue with 0.32.0 ...

@Ekkalak-T
Copy link

Ekkalak-T commented Nov 11, 2019

This also happen to me. It used to work with RandomForest in version 0.30.2.

I'll try to revert and check again ..

@caspiDoron
Copy link

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

@Ekkalak-T
Copy link

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

@caspiDoron
Copy link

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.
I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

Thanks but it seems to work only for DT, Random forest failing on the tests: AssertionError: SHAP values don't sum to model output for class0!

GBT is not supported which is the one i use...

@QuentinAmbard
Copy link
Contributor

I just re-run the unit test and something is broken indeed.
As a workaround you can set check_additivity=False when computing the shap_values

It's a new check that has been added and calls the predict function.
I suspect this hasn't been catch in the unit tests because spark isn't in the env and the test is ignored in this case.

@caspiDoron
Copy link

Thank you @QuentinAmbard it is working with this workaround.

@QuentinAmbard
Copy link
Contributor

Great!
I suggest we do the following:

  1. Create a small fix to disable check_additivity for spark models (I'll commit that soon as a fix to this issue)
  2. Make sure the tests are launched with the spark lib in the env to prevent from this kind of issues (will create a new issue to fix that)
  3. More long term / viable: implement the prediction with spark (I'll create a new feature too)

@slundberg
Copy link
Collaborator

Thanks for checking into this @QuentinAmbard! I just pushed an updated tolerance check for additivity to address #887, but I suspect this might be a true error that this new check uncovered. Happy to help work through it on the PR

slundberg added a commit that referenced this issue Dec 6, 2019
…cies

#884 add spark in setup.py tests and fix spark issue with additivity check
@ppakawatk
Copy link

Hi. I can run the example code properly.
But I'm not fully understand how shap_values works actually.
Can anyone please explain why shap_values takes 'X' as data in from of features in each column (i.e. sepal_length, sepal_width, petal_length, petal_width, separately in each column), while GBTClassifier model actually takes features in 1 column (named 'features').

Why shap_values can understand the difference between when the model was trained (features in 1 column) and when the model was to be explained?

Thank you sir.

@QuentinAmbard
Copy link
Contributor

shap_values takes a pandas Dataframe containing one column per feature.
GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

@ppakawatk
Copy link

shap_values takes a pandas Dataframe containing one column per feature.
GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

Thanks @QuentinAmbard. I still wonder how shap_values knows that each column in Pandas Dataframe equal to which element of Spark Dataframe (when the model was trained).

@QuentinAmbard
Copy link
Contributor

I'm using the index of the features, I assume the order of the pandas column must be the same as the features added in the vector assembler of your spark dataframe. Probably worth mentioning it in the documentation.

https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L951

@sacmax
Copy link

sacmax commented Feb 25, 2020

Hi All,
Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

@amandolesi
Copy link

amandolesi commented Apr 24, 2020

@QuentinAmbard
Using iris example i try to parallelize shap values calculation in this way:

iris_shap=iris.drop('type','features','label').repartition(10)
X_columns=iris_shap.columns
explainer = shap.TreeExplainer(model)
 
def calculate_shap(rows,X_columns,explainer):
  a=pd.DataFrame(rows,columns=X_columns)
  shap_values = explainer.shap_values(a)
  return [Row(*( [float(f) for f in shap_values[i]])) for i in range(len(shap_values))]
 
iris_shap.rdd.mapPartitions(lambda j:calculate_shap(j,X_columns,explainer)).toDF(X_columns)

if model is sklearn.ensemble.GradientBoostingClassifier no problem but when is equal to pyspark.ml.classification.GBTClassifier obtain this error:

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o135.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Any suggestion?

@QuentinAmbard
Copy link
Contributor

The explainer can't be serialized probably because we are keeping spark references inside. I'll try to have a look.
As workaround you can recompute the explainer in each partition maybe ?

@amandolesi
Copy link

i try passing model to function calculate_shap and compute the explainer inside partition but obtain the same error

@annagarkar
Copy link

annagarkar commented Apr 30, 2020

@QuentinAmbard

I am trying to get shap to work with a pyspark GBT classifier. I got my features as a numpy array X and then tried (as in the example):

>>> model = pyspark.ml.classification.GBTClassificationModel.load("/path/to/trained/model")
>>> explainer = shap.TreeExplainer(model)
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
>>> sv = explainer.shap_values(X)

It gave the following error:

Traceback (most recent call last):
File "", line 1, in
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 304, in shap_values
assert self.model.fully_defined_weighting, "The background dataset you provided does not cover all the leaves in the model, "
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

I did not provide a background dataset, so I don't understand why it wants me to provide a larger one. Also, the matrix X contains my entire training dataset, so I don't understand how it could not cover all the the leaves in the model. Am I doing something obviously wrong?

Then, when I tried using feature_perturbation="interventional", it gave a different error:

>>> explainer = shap.TreeExplainer(model, data=X)
Traceback (most recent call last):
File "", line 1, in
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 151, in init
self.expected_value = self.model.predict(self.data).mean(0)
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 972, in predict
raise NotImplementedError("Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.")
NotImplementedError: Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.

Also, if running the predictions with spark is complicated to implement, it might be worth adding the ability of the user to supply the expected predictions for validation.

@QuentinAmbard
Copy link
Contributor

You should get this error when your tree is built with a leaf without data inside. If you get this error, I assume you are using shap on a model built with a small data size?
Can you open another issue to track the implementation of the predictions with spark to make it works with interventional ?

@annagarkar
Copy link

annagarkar commented Apr 30, 2020

You mean that not all paths to the leaf are the same length, so some of what would otherwise be intermediate nodes have no children (leaving those phantom child nodes empty)?

Also, I created Issue #1192 to track spark predictions.

@MatteoManzari
Copy link

@QuentinAmbard

Are there news about the error of @amandolesi? Any new suggestion?

Thank you.

@jennyivy
Copy link

jennyivy commented Jul 2, 2020

Hi All,
Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

I run into the same problem, did you find out the solution to it?

@QuentinAmbard
Copy link
Contributor

I didn't had time to search what's causing this exactly these last weeks. I suspect there is a reference to spark kept somewhere and it breaks the serialization of the tree explainer with a spark model. I'll have a look when I got some time, but it shouldn't be a big deal, especially if the serialisation is working with other models.

@guidiandrea
Copy link

@QuentinAmbard

Hello Quentin, to recap and also give you some additional feedback, I performed some tests using a local standalone instance of spark.

As you mentioned a serialization error, I tried pickling a 'pyspark.ml.classification.RandomForestClassificationModel' object, basically a fitted pyspark random forest and I got a Py4J error, the same that @amandolesi reported above.

In explainer/tree.py, TreeExplainer class, row 695:

elif "pyspark.ml" in str(type(model)):
            assert_import("pyspark")
            self.original_model = model

so this serialization problem propagates. I tried commenting out "self.original_model = model" and I was then able to pickle the TreeExplainer object with a PySpark model. Of course it is a workaround but predictions are not implemented with PySpark yet, so commenting that line for the time being should not be an issue, what do you think about it?

@QuentinAmbard
Copy link
Contributor

Thanks @guidiandrea ! Absolutely that's what I had in mind too, but still haven't find time to do the change :/ The original_model was indeed kept in order to implement predictions (#1192) but I think we should find another way to avoid breaking serialisation with spark models.
Would you like to do the PR ?

@guidiandrea
Copy link

Here you are:
#1307

Thank you @QuentinAmbard!

@QuentinAmbard
Copy link
Contributor

@slundberg I think we can now close this issue as everything should be solved with #1313

@antonwnk
Copy link

Looks like this should be closed @allard-jeff

@chengyin38
Copy link

chengyin38 commented Jul 20, 2021

I am still having `NotImplementedError: CategoricalSplit are not yet implemented" error. I am using shap==0.39.0 and Spark 3

I also got the same error using decision trees as well.

Code:

pipeline = Pipeline(stages=[string_indexer, vector_assembler, model])
pipeline_model = pipeline.fit(train_df)
explainer = shap.explainers.Tree(pipeline_model.stages[-1])

@AllardJM
Copy link

AllardJM commented Oct 5, 2021

@chengyin38 The issue is that Shap can not handle categorical splits. So, in the Pyspark pre-processing you really need to drop the meta data from the data frame that pyspark will implicitly use to determine a feature is a categorical variable. This is done it seems as df= df.rdd.toDF(). String indexing without categorical splits might not be an optimal approach to the modeling however.

@AllardJM
Copy link

AllardJM commented Oct 6, 2021

I will also note that it was necessary in my experience to remove the vectors of the one hot encoding. I broke them out into binary features. After that (along with the step above), Shap was able to run effectively on a Pyspark tree model.

@AllardJM
Copy link

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?

    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

@AnastasiaProkaieva
Copy link

AnastasiaProkaieva commented Jan 20, 2022

Any idea how to fix this?

Model type not yet supported by TreeExplainer: <class ‘sparkdl.xgboost.xgboost_core.XgboostRegressorModel’>

I am trying to run this type of code:

xgboost = XgboostRegressor(**params)
pipeline = Pipeline(stages=[stringIndexer, vecAssembler, xgboost])
pipelineModel = pipeline.fit(trainDF)
explainer = shap.TreeExplainer(pipelineModel.stages[-1])

Update:
shap.TreeExplainer(pipelineModel.stages[-1].get_booster())
does the trick!

@BDon-Tan
Copy link

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?

    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

Any progress with this problem? @QuentinAmbard

@weishengtoh weishengtoh linked a pull request Sep 22, 2022 that will close this issue
@mriomoreno mriomoreno linked a pull request Nov 13, 2023 that will close this issue
Copy link

This issue has been inactive for two years, so it's been automatically marked as 'stale'.

We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open.

If there's no activity in the next 90 days the issue will be closed.

@github-actions github-actions bot added the stale Indicates that there has been no recent activity on an issue label Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Indicates that there has been no recent activity on an issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.