Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing additivity check in shap_values of CausalForestDML #445

Open
kennethverstraete opened this issue Mar 29, 2021 · 40 comments
Open

Failing additivity check in shap_values of CausalForestDML #445

kennethverstraete opened this issue Mar 29, 2021 · 40 comments

Comments

@kennethverstraete
Copy link

When I input the exact same training set I used to train my CausalForestDML and call shap_values I get the following error:

Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. This check failed because for one of the samples the sum of the SHAP values was -0.335371, while the model output was -0.344933. If this difference is acceptable you can set check_additivity=False to disable this check.

So the data matrix has the correct shape. So an option would be to put check_additivity=False as the error suggests, but it is not possible to pass this argument to shap_values since this is not implemented.

@vsyrgkanis
Copy link
Collaborator

The addititivity assumption will fail for causalforest dml. I thought that shap only raises a warning if this is violated. Does it now raise an error and not return anything?

The reason why additivity fails is because causalforestdml performs a subtly different way of aggregating predictions across trees. It does not just average the tree predictions, but it uses the trees to average the jacobian of the moment and the moment itself and then inverts the average jacobian and multiplies with the average moment. This is different than averaging the predictions of each tree, which is what additivity in shap is checking.

However, shap_values should still be qualitatively good metrics of feature importance, despite this slight inconsistency.

@kennethverstraete
Copy link
Author

It currently throws an exception so no results are returned.

And thank you for this amazing package!

@vsyrgkanis
Copy link
Collaborator

Can you add here a code snippet of how you calculate shap values?

@vsyrgkanis
Copy link
Collaborator

Also maybe the versions of econml and shap that you have in your system

@vsyrgkanis
Copy link
Collaborator

Also does this notebook run for you or do you get the error here too:
https://github.com/microsoft/EconML/blob/master/notebooks/Interpretability%20with%20SHAP.ipynb

@kennethverstraete
Copy link
Author

kennethverstraete commented Mar 29, 2021

econml is version 0.9.0 and SHAP is 0.38.1

I use the following code:

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

The notebook you gave works for me though.

@vsyrgkanis
Copy link
Collaborator

weird that the notebook runs but not when applied to your data.

I checked with these specs on the notebook and still things work. and I have the same version as you. So it only leaves sth about the data being wrong, which is weird.

@vsyrgkanis
Copy link
Collaborator

what are the dimensions of y_train.ravel(), T_train and X_train_imputed
Also do you have only binary treatment or categorical?

@kennethverstraete
Copy link
Author

y_train.ravel(): (1331,)
T_train: (1331,)
X_train_imputed: (1331, 66)
It's binary treatment.

@vsyrgkanis
Copy link
Collaborator

And what is your system specs (e.g. mac/windows/linux, python version, gpu enabled)

@kennethverstraete
Copy link
Author

Windows 10,
Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
No GPU

@vsyrgkanis
Copy link
Collaborator

could you maybe update econml to 0.10 to see if the issue persists.
pip install -U econml

@kennethverstraete
Copy link
Author

kennethverstraete commented Mar 29, 2021

Still the same exception:
Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. This check failed because for one of the samples the sum of the SHAP values was -0.078317, while the model output was -0.068970. If this difference is acceptable you can set check_additivity=False to disable this check.

If I use the data from the example notebook in the same datatypes as my code, it works.

@vsyrgkanis
Copy link
Collaborator

Can you try using the mse criterion.

it could be because het creates some invalid leafs and gives nan values for some trees if you try to predict based on a single tree

@vsyrgkanis
Copy link
Collaborator

Or alternatively place a ‘min_var_fraction_leaf’ constraint of sth like .1 to ensure that all leafs contain both treatments.

@vsyrgkanis
Copy link
Collaborator

Also maybe try calling “tune” before fit, to optimally tune the hyperparams to your data.

@vsyrgkanis
Copy link
Collaborator

Try also adding some min_samples_leaf to your model_t, or running gcv to optimally select hyperparams for each of model_y and model_t.

Your current model_t, might be overfitting a lot and predicting extreme probabilities.

See for instance, how we perform hyperparameter tuning for model_y and model_t here:
https://github.com/microsoft/EconML/blob/master/notebooks/ForestLearners%20Basic%20Example.ipynb

@kennethverstraete
Copy link
Author

I used the mse criterion, added the 'min_var_fractin_leaf' constraint, called tune before fitting, also used the cross-validation for model_y and model_t hyperparameter tuning; exactly like in the notebook. Still the additivity exception unfortunately...

@vsyrgkanis
Copy link
Collaborator

:) !

One more try!: Can you make sure that you don't have any NaNs in any of your variables?
Also for simplicity try passing numpy arrays and not pandas dataframes.

@vsyrgkanis
Copy link
Collaborator

Also one final thing:
use
min_var_leaf_on_val=True, min_var_fraction_leaf>0.1
and don't run tune.

@vsyrgkanis
Copy link
Collaborator

I truly hope we can get to the bottom of this. You might be hitting some edge case that our tests are not catching and want to make sure we address it.

Here is one other thing to do to debug:
Try using only 1 tree, i.e. n_estimators=1, and see if the error still persists.
Then also plot the tree, using sklearn plot_tree

from sklearn.tree import plot_tree
plt.figure(figsize=(15, 5))
plot_tree(est[0][0])
plt.show()

Do you see anything funny? If you are able to post the tree here, please do so.

@kennethverstraete
Copy link
Author

For CausalForestDML(criterion='mse',max_samples=0.45, min_var_fraction_leaf=0.2, min_var_leaf_on_val=True, model_t=model_t, model_y=model_y, discrete_treatment=True, cv=5, n_estimators=1, subforest_size=1, inference=False):
tree

@vsyrgkanis
Copy link
Collaborator

Thanks! and still you get the error when you apply SHAP with this single tree?

@kennethverstraete
Copy link
Author

No error with a single tree!

@vsyrgkanis
Copy link
Collaborator

ok then try two trees, if you get an error, plot both trees (e.g. [plot_tree(est[t][0]) for t in len(est)]

@kennethverstraete
Copy link
Author

It starts to error from 12 trees on, do you want me to plot all trees?

@vsyrgkanis
Copy link
Collaborator

Just plot the last one then

@vsyrgkanis
Copy link
Collaborator

I guess I'm tying to look for trees that contain some NaN values in the leafs...

@kennethverstraete
Copy link
Author

Nevermind, sometimes it works, sometimes it doesn't. I don't think it depends on the number of trees then. But I found one with an error with 1 tree:
tree

So this is a CausalForestDML with 1 estimator on which shap_values fails with the additivity check.

@kennethverstraete
Copy link
Author

No NaNs in all the trees I find of 1 estimator that fail on shap_values

@vsyrgkanis
Copy link
Collaborator

I'm out of ideas. If you could replicate this with synthetic data that you can share, then we can explore further.

Alternatively if you could also try to with a different setup (e.g. on a different machine or on a colab notebook) to see whether it's your machine setup or the data that triggers it, that could narrow it a bit.

@vsyrgkanis
Copy link
Collaborator

Hm oh try this one: fit_intercept=False

@vsyrgkanis
Copy link
Collaborator

@kennethverstraete in your code snippet, what is cmodel!!!, why isn't it cfdml?

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

@kennethverstraete
Copy link
Author

@kennethverstraete in your code snippet, what is cmodel!!!, why isn't it cfdml?

cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed)

That was a bad copy-paste into github! I used cmodel but changed it to cfdml to make it easier for you but forgot to change it in the third line.

@vsyrgkanis
Copy link
Collaborator

vsyrgkanis commented Mar 30, 2021

Here is one last try! This is a way to get shap_values on your own outside of the econml package: (here I'm only getting the shap_values for the first 100 samples, but you can pass the whole X).

background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100])
shap.plots.beeswarm(shap_values)

And then also try this, that asks to not check for the additivity constraint:

background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100], check_additivity=False)
shap.plots.beeswarm(shap_values)

Also would be great if you could print the "explainer" object and see what explainer was chosen and make sure that a shap.explainers._tree.Tree object was chosen.

@kennethverstraete
Copy link
Author

I will try this later today or tomorrow, I'll keep you up to date!

@kennethverstraete
Copy link
Author

So the first fails (as expected I think), the second one works!

Also would be great if you could print the "explainer" object and see what explainer was chosen and make sure that a shap.explainers._tree.Tree object was chosen.

When I print the explainer object I get <shap.explainers._tree.Tree at 0x254e5dcf610>

@vsyrgkanis
Copy link
Collaborator

Interesting! At least for now you have something that you can go with and get shap values.

But I've still to figure out what the problem might be... and why it only arises for your data set on your machine and not for the synthetic dataset we have in the notebook.

If you would be willing to try to generate some synthetic dataset that maybe looks similar to your real one (in terms of treatment imbalances and feature marginal dists and normalized outcomes), for which you could replicate the error behavior and which you could share with us, it would help tremendously in making sure that this edge case is not hit by any other user and that you can use the integrated shap_values methods.

@kennethverstraete
Copy link
Author

I will definitely try that and let you know if I can replicate it with synthetic data! Thank for the quick help, I really appreciate it :)

@stevenfelix
Copy link

I am experiencing same issues raised by @kennethverstraete. Using version 0.9.1, and cannot update to 10 for some reason. The work-around that allows for check_additivity=False does work for me too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants