Failing additivity check in shap_values of CausalForestDML #445

kennethverstraete · 2021-03-29T13:08:20Z

When I input the exact same training set I used to train my CausalForestDML and call shap_values I get the following error:

Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. This check failed because for one of the samples the sum of the SHAP values was -0.335371, while the model output was -0.344933. If this difference is acceptable you can set check_additivity=False to disable this check.

So the data matrix has the correct shape. So an option would be to put check_additivity=False as the error suggests, but it is not possible to pass this argument to shap_values since this is not implemented.

The text was updated successfully, but these errors were encountered:

vsyrgkanis · 2021-03-29T13:52:11Z

The addititivity assumption will fail for causalforest dml. I thought that shap only raises a warning if this is violated. Does it now raise an error and not return anything?

The reason why additivity fails is because causalforestdml performs a subtly different way of aggregating predictions across trees. It does not just average the tree predictions, but it uses the trees to average the jacobian of the moment and the moment itself and then inverts the average jacobian and multiplies with the average moment. This is different than averaging the predictions of each tree, which is what additivity in shap is checking.

However, shap_values should still be qualitatively good metrics of feature importance, despite this slight inconsistency.

kennethverstraete · 2021-03-29T14:11:35Z

It currently throws an exception so no results are returned.

And thank you for this amazing package!

vsyrgkanis · 2021-03-29T14:47:41Z

Can you add here a code snippet of how you calculate shap values?

vsyrgkanis · 2021-03-29T14:49:38Z

Also maybe the versions of econml and shap that you have in your system

vsyrgkanis · 2021-03-29T14:50:26Z

Also does this notebook run for you or do you get the error here too:
https://github.com/microsoft/EconML/blob/master/notebooks/Interpretability%20with%20SHAP.ipynb

kennethverstraete · 2021-03-29T15:03:22Z

econml is version 0.9.0 and SHAP is 0.38.1

I use the following code:

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

The notebook you gave works for me though.

vsyrgkanis · 2021-03-29T15:10:29Z

weird that the notebook runs but not when applied to your data.

I checked with these specs on the notebook and still things work. and I have the same version as you. So it only leaves sth about the data being wrong, which is weird.

vsyrgkanis · 2021-03-29T15:32:26Z

what are the dimensions of y_train.ravel(), T_train and X_train_imputed
Also do you have only binary treatment or categorical?

kennethverstraete · 2021-03-29T15:35:19Z

y_train.ravel(): (1331,)
T_train: (1331,)
X_train_imputed: (1331, 66)
It's binary treatment.

vsyrgkanis · 2021-03-29T15:36:33Z

And what is your system specs (e.g. mac/windows/linux, python version, gpu enabled)

kennethverstraete · 2021-03-29T15:39:36Z

Windows 10,
Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
No GPU

vsyrgkanis · 2021-03-29T15:41:45Z

could you maybe update econml to 0.10 to see if the issue persists.
pip install -U econml

kennethverstraete · 2021-03-29T15:52:44Z

Still the same exception:
Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. This check failed because for one of the samples the sum of the SHAP values was -0.078317, while the model output was -0.068970. If this difference is acceptable you can set check_additivity=False to disable this check.

If I use the data from the example notebook in the same datatypes as my code, it works.

vsyrgkanis · 2021-03-29T15:55:03Z

Can you try using the mse criterion.

it could be because het creates some invalid leafs and gives nan values for some trees if you try to predict based on a single tree

vsyrgkanis · 2021-03-29T15:58:17Z

Or alternatively place a ‘min_var_fraction_leaf’ constraint of sth like .1 to ensure that all leafs contain both treatments.

vsyrgkanis · 2021-03-29T16:12:17Z

Also maybe try calling “tune” before fit, to optimally tune the hyperparams to your data.

vsyrgkanis · 2021-03-29T16:30:25Z

Try also adding some min_samples_leaf to your model_t, or running gcv to optimally select hyperparams for each of model_y and model_t.

Your current model_t, might be overfitting a lot and predicting extreme probabilities.

See for instance, how we perform hyperparameter tuning for model_y and model_t here:
https://github.com/microsoft/EconML/blob/master/notebooks/ForestLearners%20Basic%20Example.ipynb

kennethverstraete · 2021-03-29T19:09:24Z

I used the mse criterion, added the 'min_var_fractin_leaf' constraint, called tune before fitting, also used the cross-validation for model_y and model_t hyperparameter tuning; exactly like in the notebook. Still the additivity exception unfortunately...

vsyrgkanis · 2021-03-29T19:12:07Z

:) !

One more try!: Can you make sure that you don't have any NaNs in any of your variables?
Also for simplicity try passing numpy arrays and not pandas dataframes.

vsyrgkanis · 2021-03-29T19:14:58Z

Also one final thing:
use
min_var_leaf_on_val=True, min_var_fraction_leaf>0.1
and don't run tune.

vsyrgkanis · 2021-03-29T19:54:51Z

I truly hope we can get to the bottom of this. You might be hitting some edge case that our tests are not catching and want to make sure we address it.

Here is one other thing to do to debug:
Try using only 1 tree, i.e. n_estimators=1, and see if the error still persists.
Then also plot the tree, using sklearn plot_tree

from sklearn.tree import plot_tree
plt.figure(figsize=(15, 5))
plot_tree(est[0][0])
plt.show()

Do you see anything funny? If you are able to post the tree here, please do so.

kennethverstraete · 2021-03-29T20:08:25Z

For CausalForestDML(criterion='mse',max_samples=0.45, min_var_fraction_leaf=0.2, min_var_leaf_on_val=True, model_t=model_t, model_y=model_y, discrete_treatment=True, cv=5, n_estimators=1, subforest_size=1, inference=False):

vsyrgkanis · 2021-03-29T20:11:13Z

Thanks! and still you get the error when you apply SHAP with this single tree?

kennethverstraete · 2021-03-29T20:13:38Z

No error with a single tree!

vsyrgkanis · 2021-03-29T20:15:15Z

ok then try two trees, if you get an error, plot both trees (e.g. [plot_tree(est[t][0]) for t in len(est)]

kennethverstraete · 2021-03-29T20:24:08Z

It starts to error from 12 trees on, do you want me to plot all trees?

vsyrgkanis · 2021-03-29T20:25:00Z

Just plot the last one then

vsyrgkanis · 2021-03-29T20:25:14Z

I guess I'm tying to look for trees that contain some NaN values in the leafs...

kennethverstraete · 2021-03-29T20:28:27Z

Nevermind, sometimes it works, sometimes it doesn't. I don't think it depends on the number of trees then. But I found one with an error with 1 tree:

So this is a CausalForestDML with 1 estimator on which shap_values fails with the additivity check.

kennethverstraete · 2021-03-29T20:34:03Z

No NaNs in all the trees I find of 1 estimator that fail on shap_values

vsyrgkanis · 2021-03-29T22:06:55Z

I'm out of ideas. If you could replicate this with synthetic data that you can share, then we can explore further.

Alternatively if you could also try to with a different setup (e.g. on a different machine or on a colab notebook) to see whether it's your machine setup or the data that triggers it, that could narrow it a bit.

vsyrgkanis · 2021-03-29T22:56:12Z

Hm oh try this one: fit_intercept=False

vsyrgkanis · 2021-03-30T00:45:10Z

@kennethverstraete in your code snippet, what is cmodel!!!, why isn't it cfdml?

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

kennethverstraete · 2021-03-30T06:10:04Z

@kennethverstraete in your code snippet, what is cmodel!!!, why isn't it cfdml?

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

That was a bad copy-paste into github! I used cmodel but changed it to cfdml to make it easier for you but forgot to change it in the third line.

vsyrgkanis · 2021-03-30T21:07:37Z

Here is one last try! This is a way to get shap_values on your own outside of the econml package: (here I'm only getting the shap_values for the first 100 samples, but you can pass the whole X).

background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100])
shap.plots.beeswarm(shap_values)

And then also try this, that asks to not check for the additivity constraint:

background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100], check_additivity=False)
shap.plots.beeswarm(shap_values)

Also would be great if you could print the "explainer" object and see what explainer was chosen and make sure that a shap.explainers._tree.Tree object was chosen.

kennethverstraete · 2021-04-01T11:30:39Z

I will try this later today or tomorrow, I'll keep you up to date!

kennethverstraete · 2021-04-01T16:51:17Z

So the first fails (as expected I think), the second one works!

Also would be great if you could print the "explainer" object and see what explainer was chosen and make sure that a shap.explainers._tree.Tree object was chosen.

When I print the explainer object I get <shap.explainers._tree.Tree at 0x254e5dcf610>

vsyrgkanis · 2021-04-01T16:55:56Z

Interesting! At least for now you have something that you can go with and get shap values.

But I've still to figure out what the problem might be... and why it only arises for your data set on your machine and not for the synthetic dataset we have in the notebook.

If you would be willing to try to generate some synthetic dataset that maybe looks similar to your real one (in terms of treatment imbalances and feature marginal dists and normalized outcomes), for which you could replicate the error behavior and which you could share with us, it would help tremendously in making sure that this edge case is not hit by any other user and that you can use the integrated shap_values methods.

kennethverstraete · 2021-04-01T17:00:00Z

I will definitely try that and let you know if I can replicate it with synthetic data! Thank for the quick help, I really appreciate it :)

stevenfelix · 2021-04-29T19:20:36Z

I am experiencing same issues raised by @kennethverstraete. Using version 0.9.1, and cannot update to 10 for some reason. The work-around that allows for check_additivity=False does work for me too.

jcreinhold mentioned this issue Apr 4, 2024

shap_values for tree-based models doesn't set check_additivity=False as expected #866

Closed

Failing additivity check in shap_values of CausalForestDML #445

Failing additivity check in shap_values of CausalForestDML #445

Comments

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021 • edited Loading

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021 • edited Loading

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

kennethverstraete commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 29, 2021

vsyrgkanis commented Mar 30, 2021

kennethverstraete commented Mar 30, 2021

vsyrgkanis commented Mar 30, 2021 • edited Loading

kennethverstraete commented Apr 1, 2021

kennethverstraete commented Apr 1, 2021

vsyrgkanis commented Apr 1, 2021

kennethverstraete commented Apr 1, 2021

stevenfelix commented Apr 29, 2021

kennethverstraete commented Mar 29, 2021 •

edited

Loading

kennethverstraete commented Mar 29, 2021 •

edited

Loading

vsyrgkanis commented Mar 30, 2021 •

edited

Loading