Error with Pyspark GBTClassifier #884

allard-jeff · 2019-11-07T02:15:48Z

I just installed Shap from PyPi (0.32.0) and running a version of your test still produces the same error - shown below. Is there something that I am missing in the use of Shap with a pyspark model?

import pyspark
print(pyspark.__version__)
import shap
print(shap.__version__)
import sklearn.datasets
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier, GBTClassifier
import pandas as pd

iris_sk = sklearn.datasets.load_iris()
iris = pd.DataFrame(data= np.c_[iris_sk['data'], iris_sk['target']], columns= iris_sk['feature_names'] + ['target'])[:100]
spark = SparkSession.builder.config(conf=SparkConf().set("spark.master", "local[*]")).getOrCreate()

col = ["sepal_length","sepal_width","petal_length","petal_width","type"]
iris = spark.createDataFrame(iris, col)
iris = VectorAssembler(inputCols=col[:-1],outputCol="features").transform(iris)
iris = StringIndexer(inputCol="type", outputCol="label").fit(iris).transform(iris)

classifier = GBTClassifier(labelCol="label", featuresCol="features")
model = classifier.fit(iris)
explainer = shap.TreeExplainer(model)
X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
shap_values = explainer.shap_values(X)


---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-31-f47b3a56c25f> in <module>
     23 explainer = shap.TreeExplainer(model)
     24 X = pd.DataFrame(data=iris_sk.data, columns=iris_sk.feature_names)[:100] # pylint: disable=E1101
---> 25 shap_values = explainer.shap_values(X)

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in shap_values(self, X, y, tree_limit, approximate, check_additivity)
    283 
    284         if check_additivity and self.model_output == "margin":
--> 285             self.assert_additivity(out, self.model.predict(X))
    286 
    287         return out

/mnt1/anaconda3/lib/python3.7/site-packages/shap/explainers/tree.py in predict(self, X, y, output, tree_limit)
    785             import pyspark
    786             #TODO support predict for pyspark
--> 787             raise NotImplementedError("Predict with pyspark isn't implemented")
    788 
    789         # see if we have a default tree_limit in place.

NotImplementedError: Predict with pyspark isn't implemented

The text was updated successfully, but these errors were encountered:

allard-jeff · 2019-11-10T01:22:44Z

@QuentinAmbard
Have anyone else ran this code successfully?

QuentinAmbard · 2019-11-10T10:44:50Z

That's almost the code from the unit test, so yes it should run without error.
I'll try to debug that this week, maybe there is an issue with 0.32.0 ...

Ekkalak-T · 2019-11-11T10:33:14Z

This also happen to me. It used to work with RandomForest in version 0.30.2.

I'll try to revert and check again ..

caspiDoron · 2019-11-17T08:09:48Z

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

Ekkalak-T · 2019-11-17T08:16:18Z

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.

I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

caspiDoron · 2019-11-17T13:28:19Z

Hello, I`m having the same problem with Shap version: shap-0.32.1
I also tried previous version but I get the same error since the commit which added GBT and RF.
I made a test to check the unit test and copy all function: test_pyspark_classifier_decision_tree()
And run it in my environment and I get the same error.
Could be that the build was done without unit testing?

@caspiDoron You may try version 0.30.2. it works for me.

Thanks but it seems to work only for DT, Random forest failing on the tests: AssertionError: SHAP values don't sum to model output for class0!

GBT is not supported which is the one i use...

QuentinAmbard · 2019-11-17T17:31:06Z

I just re-run the unit test and something is broken indeed.
As a workaround you can set check_additivity=False when computing the shap_values

It's a new check that has been added and calls the predict function.
I suspect this hasn't been catch in the unit tests because spark isn't in the env and the test is ignored in this case.

caspiDoron · 2019-11-19T12:51:15Z

Thank you @QuentinAmbard it is working with this workaround.

QuentinAmbard · 2019-11-19T13:04:41Z

Great!
I suggest we do the following:

Create a small fix to disable check_additivity for spark models (I'll commit that soon as a fix to this issue)
Make sure the tests are launched with the spark lib in the env to prevent from this kind of issues (will create a new issue to fix that)
More long term / viable: implement the prediction with spark (I'll create a new feature too)

slundberg · 2019-11-21T02:10:27Z

Thanks for checking into this @QuentinAmbard! I just pushed an updated tolerance check for additivity to address #887, but I suspect this might be a true error that this new check uncovered. Happy to help work through it on the PR

…cies #884 add spark in setup.py tests and fix spark issue with additivity check

ppakawatk · 2019-12-12T10:43:09Z

Hi. I can run the example code properly.
But I'm not fully understand how shap_values works actually.
Can anyone please explain why shap_values takes 'X' as data in from of features in each column (i.e. sepal_length, sepal_width, petal_length, petal_width, separately in each column), while GBTClassifier model actually takes features in 1 column (named 'features').

Why shap_values can understand the difference between when the model was trained (features in 1 column) and when the model was to be explained?

Thank you sir.

QuentinAmbard · 2019-12-12T13:50:39Z

shap_values takes a pandas Dataframe containing one column per feature.
GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

ppakawatk · 2019-12-12T14:45:52Z

shap_values takes a pandas Dataframe containing one column per feature.
GBTClassifier is a spark classifier taking a spark Dataframe to be trained. Spark works with 1 column containing an array with all the features you are using (that's what is doing the VectorAssembler)

Once the model is trained shap will explain it using shap_values(...).

You have to convert your data into a pandas dataframe to explain it. If your dataset is too big you can easily create a spark Pandas UDF to run the shap_values in a distributed fashion.

Thanks @QuentinAmbard. I still wonder how shap_values knows that each column in Pandas Dataframe equal to which element of Spark Dataframe (when the model was trained).

QuentinAmbard · 2019-12-12T15:00:57Z

I'm using the index of the features, I assume the order of the pandas column must be the same as the features added in the vector assembler of your spark dataframe. Probably worth mentioning it in the documentation.

https://github.com/slundberg/shap/blob/master/shap/explainers/tree.py#L951

sacmax · 2020-02-25T15:32:07Z

Hi All,
Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

amandolesi · 2020-04-24T08:30:09Z

@QuentinAmbard
Using iris example i try to parallelize shap values calculation in this way:

iris_shap=iris.drop('type','features','label').repartition(10)
X_columns=iris_shap.columns
explainer = shap.TreeExplainer(model)
 
def calculate_shap(rows,X_columns,explainer):
  a=pd.DataFrame(rows,columns=X_columns)
  shap_values = explainer.shap_values(a)
  return [Row(*( [float(f) for f in shap_values[i]])) for i in range(len(shap_values))]
 
iris_shap.rdd.mapPartitions(lambda j:calculate_shap(j,X_columns,explainer)).toDF(X_columns)

if model is sklearn.ensemble.GradientBoostingClassifier no problem but when is equal to pyspark.ml.classification.GBTClassifier obtain this error:

PicklingError: Could not serialize object: Py4JError: An error occurred while calling o135.__getstate__. Trace:
py4j.Py4JException: Method __getstate__([]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Any suggestion?

QuentinAmbard · 2020-04-24T09:20:13Z

The explainer can't be serialized probably because we are keeping spark references inside. I'll try to have a look.
As workaround you can recompute the explainer in each partition maybe ?

amandolesi · 2020-04-24T09:28:18Z

i try passing model to function calculate_shap and compute the explainer inside partition but obtain the same error

annagarkar · 2020-04-30T01:28:02Z

@QuentinAmbard

I am trying to get shap to work with a pyspark GBT classifier. I got my features as a numpy array X and then tried (as in the example):

>>> model = pyspark.ml.classification.GBTClassificationModel.load("/path/to/trained/model")
>>> explainer = shap.TreeExplainer(model)
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
>>> sv = explainer.shap_values(X)

It gave the following error:

Traceback (most recent call last):
File "", line 1, in
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 304, in shap_values
assert self.model.fully_defined_weighting, "The background dataset you provided does not cover all the leaves in the model, "
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

I did not provide a background dataset, so I don't understand why it wants me to provide a larger one. Also, the matrix X contains my entire training dataset, so I don't understand how it could not cover all the the leaves in the model. Am I doing something obviously wrong?

Then, when I tried using feature_perturbation="interventional", it gave a different error:

>>> explainer = shap.TreeExplainer(model, data=X)
Traceback (most recent call last):
File "", line 1, in
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 151, in init
self.expected_value = self.model.predict(self.data).mean(0)
File "/lib/python3.7/site-packages/shap/explainers/tree.py", line 972, in predict
raise NotImplementedError("Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.")
NotImplementedError: Predict with pyspark isn't implemented. Don't run 'interventional' as feature_perturbation.

Also, if running the predictions with spark is complicated to implement, it might be worth adding the ability of the user to supply the expected predictions for validation.

QuentinAmbard · 2020-04-30T08:57:58Z

You should get this error when your tree is built with a leaf without data inside. If you get this error, I assume you are using shap on a model built with a small data size?
Can you open another issue to track the implementation of the predictions with spark to make it works with interventional ?

annagarkar · 2020-04-30T16:58:33Z

You mean that not all paths to the leaf are the same length, so some of what would otherwise be intermediate nodes have no children (leaving those phantom child nodes empty)?

Also, I created Issue #1192 to track spark predictions.

MatteoManzari · 2020-05-08T09:00:58Z

@QuentinAmbard

Are there news about the error of @amandolesi? Any new suggestion?

Thank you.

jennyivy · 2020-07-02T07:58:27Z

Hi All,
Is there any solution for NotImplementedError: CategoricalSplit are not yet implemented" in pyspark?

I run into the same problem, did you find out the solution to it?

QuentinAmbard · 2020-07-02T09:46:24Z

I didn't had time to search what's causing this exactly these last weeks. I suspect there is a reference to spark kept somewhere and it breaks the serialization of the tree explainer with a spark model. I'll have a look when I got some time, but it shouldn't be a big deal, especially if the serialisation is working with other models.

guidiandrea · 2020-07-09T15:56:31Z

@QuentinAmbard

Hello Quentin, to recap and also give you some additional feedback, I performed some tests using a local standalone instance of spark.

As you mentioned a serialization error, I tried pickling a 'pyspark.ml.classification.RandomForestClassificationModel' object, basically a fitted pyspark random forest and I got a Py4J error, the same that @amandolesi reported above.

In explainer/tree.py, TreeExplainer class, row 695:

elif "pyspark.ml" in str(type(model)):
            assert_import("pyspark")
            self.original_model = model

so this serialization problem propagates. I tried commenting out "self.original_model = model" and I was then able to pickle the TreeExplainer object with a PySpark model. Of course it is a workaround but predictions are not implemented with PySpark yet, so commenting that line for the time being should not be an issue, what do you think about it?

QuentinAmbard · 2020-07-09T16:07:06Z

Thanks @guidiandrea ! Absolutely that's what I had in mind too, but still haven't find time to do the change :/ The original_model was indeed kept in order to implement predictions (#1192) but I think we should find another way to avoid breaking serialisation with spark models.
Would you like to do the PR ?

guidiandrea · 2020-07-09T16:15:55Z

Here you are:
#1307

Thank you @QuentinAmbard!

QuentinAmbard · 2020-07-10T20:28:36Z

@slundberg I think we can now close this issue as everything should be solved with #1313

antonwnk · 2021-06-24T06:41:50Z

Looks like this should be closed @allard-jeff

chengyin38 · 2021-07-20T20:08:16Z

I am still having `NotImplementedError: CategoricalSplit are not yet implemented" error. I am using shap==0.39.0 and Spark 3

I also got the same error using decision trees as well.

Code:

pipeline = Pipeline(stages=[string_indexer, vector_assembler, model])
pipeline_model = pipeline.fit(train_df)
explainer = shap.explainers.Tree(pipeline_model.stages[-1])

AllardJM · 2021-10-05T03:09:04Z

@chengyin38 The issue is that Shap can not handle categorical splits. So, in the Pyspark pre-processing you really need to drop the meta data from the data frame that pyspark will implicitly use to determine a feature is a categorical variable. This is done it seems as df= df.rdd.toDF(). String indexing without categorical splits might not be an optimal approach to the modeling however.

AllardJM · 2021-10-06T18:34:04Z

I will also note that it was necessary in my experience to remove the vectors of the one hot encoding. I broke them out into binary features. After that (along with the step above), Shap was able to run effectively on a Pyspark tree model.

AllardJM · 2022-01-15T21:25:16Z

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?

    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

AnastasiaProkaieva · 2022-01-20T18:07:22Z

Any idea how to fix this?

Model type not yet supported by TreeExplainer: <class ‘sparkdl.xgboost.xgboost_core.XgboostRegressorModel’>

I am trying to run this type of code:

xgboost = XgboostRegressor(**params)
pipeline = Pipeline(stages=[stringIndexer, vecAssembler, xgboost])
pipelineModel = pipeline.fit(trainDF)
explainer = shap.TreeExplainer(pipelineModel.stages[-1])

Update:
shap.TreeExplainer(pipelineModel.stages[-1].get_booster())
does the trick!

BDon-Tan · 2022-03-25T11:47:42Z

@QuentinAmbard I am finding that with Shap 0.39.0 this error continues with pyspark GBT. The weird issue is that this error only seems to happen when a saved GBT is loaded. If I use the original model in memory, the error does not occur. Any ideas?
    assert self.model.fully_defined_weighting, "The background dataset you provided does " \
AssertionError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, or using feature_perturbation="interventional".

Any progress with this problem? @QuentinAmbard

github-actions · 2024-03-27T02:33:45Z

This issue has been inactive for two years, so it's been automatically marked as 'stale'.

We value your input! If this issue is still relevant, please leave a comment below. This will remove the 'stale' label and keep it open.

If there's no activity in the next 90 days the issue will be closed.

QuentinAmbard mentioned this issue Nov 19, 2019

#884 add spark in setup.py tests and fix spark issue with additivity check #905

Merged

QuentinAmbard mentioned this issue Dec 2, 2019

A Spark version in plan? #38

Open

slundberg added a commit that referenced this issue Dec 6, 2019

Merge pull request #905 from QuentinAmbard/add_spark_in_test_dependan…

d5753fa

…cies #884 add spark in setup.py tests and fix spark issue with additivity check

guidiandrea mentioned this issue Jul 9, 2020

Update tree.py #1307

Closed

QuentinAmbard mentioned this issue Jul 10, 2020

Make shap explainer serializable with spark models #1313

Merged

weishengtoh linked a pull request Sep 22, 2022 that will close this issue

Fix PySpark GBT Issue #2700

Open

mriomoreno linked a pull request Nov 13, 2023 that will close this issue

Fix PySpark loaded models #3384

Open

github-actions bot added the stale Indicates that there has been no recent activity on an issue label Mar 27, 2024

Error with Pyspark GBTClassifier #884

Error with Pyspark GBTClassifier #884

Comments

allard-jeff commented Nov 7, 2019

allard-jeff commented Nov 10, 2019

QuentinAmbard commented Nov 10, 2019

Ekkalak-T commented Nov 11, 2019 • edited

caspiDoron commented Nov 17, 2019

Ekkalak-T commented Nov 17, 2019

caspiDoron commented Nov 17, 2019

QuentinAmbard commented Nov 17, 2019

caspiDoron commented Nov 19, 2019

QuentinAmbard commented Nov 19, 2019

slundberg commented Nov 21, 2019

ppakawatk commented Dec 12, 2019

QuentinAmbard commented Dec 12, 2019

ppakawatk commented Dec 12, 2019

QuentinAmbard commented Dec 12, 2019

sacmax commented Feb 25, 2020

amandolesi commented Apr 24, 2020 • edited

QuentinAmbard commented Apr 24, 2020

amandolesi commented Apr 24, 2020

annagarkar commented Apr 30, 2020 • edited

QuentinAmbard commented Apr 30, 2020

annagarkar commented Apr 30, 2020 • edited

MatteoManzari commented May 8, 2020

jennyivy commented Jul 2, 2020

QuentinAmbard commented Jul 2, 2020

guidiandrea commented Jul 9, 2020

QuentinAmbard commented Jul 9, 2020

guidiandrea commented Jul 9, 2020

QuentinAmbard commented Jul 10, 2020

antonwnk commented Jun 24, 2021

chengyin38 commented Jul 20, 2021 • edited

AllardJM commented Oct 5, 2021

AllardJM commented Oct 6, 2021

AllardJM commented Jan 15, 2022

AnastasiaProkaieva commented Jan 20, 2022 • edited

BDon-Tan commented Mar 25, 2022

github-actions bot commented Mar 27, 2024

Ekkalak-T commented Nov 11, 2019 •

edited

amandolesi commented Apr 24, 2020 •

edited

annagarkar commented Apr 30, 2020 •

edited

annagarkar commented Apr 30, 2020 •

edited

chengyin38 commented Jul 20, 2021 •

edited

AnastasiaProkaieva commented Jan 20, 2022 •

edited