-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing additivity check in shap_values of CausalForestDML #445
Comments
The addititivity assumption will fail for causalforest dml. I thought that shap only raises a warning if this is violated. Does it now raise an error and not return anything? The reason why additivity fails is because causalforestdml performs a subtly different way of aggregating predictions across trees. It does not just average the tree predictions, but it uses the trees to average the jacobian of the moment and the moment itself and then inverts the average jacobian and multiplies with the average moment. This is different than averaging the predictions of each tree, which is what additivity in shap is checking. However, shap_values should still be qualitatively good metrics of feature importance, despite this slight inconsistency. |
It currently throws an exception so no results are returned. And thank you for this amazing package! |
Can you add here a code snippet of how you calculate shap values? |
Also maybe the versions of econml and shap that you have in your system |
Also does this notebook run for you or do you get the error here too: |
econml is version 0.9.0 and SHAP is 0.38.1 I use the following code:
The notebook you gave works for me though. |
weird that the notebook runs but not when applied to your data. I checked with these specs on the notebook and still things work. and I have the same version as you. So it only leaves sth about the data being wrong, which is weird. |
what are the dimensions of y_train.ravel(), T_train and X_train_imputed |
y_train.ravel(): (1331,) |
And what is your system specs (e.g. mac/windows/linux, python version, gpu enabled) |
Windows 10, |
could you maybe update econml to 0.10 to see if the issue persists. |
Still the same exception: If I use the data from the example notebook in the same datatypes as my code, it works. |
Can you try using the mse criterion. it could be because het creates some invalid leafs and gives nan values for some trees if you try to predict based on a single tree |
Or alternatively place a ‘min_var_fraction_leaf’ constraint of sth like .1 to ensure that all leafs contain both treatments. |
Also maybe try calling “tune” before fit, to optimally tune the hyperparams to your data. |
Try also adding some min_samples_leaf to your model_t, or running gcv to optimally select hyperparams for each of model_y and model_t. Your current model_t, might be overfitting a lot and predicting extreme probabilities. See for instance, how we perform hyperparameter tuning for model_y and model_t here: |
I used the mse criterion, added the 'min_var_fractin_leaf' constraint, called tune before fitting, also used the cross-validation for model_y and model_t hyperparameter tuning; exactly like in the notebook. Still the additivity exception unfortunately... |
:) ! One more try!: Can you make sure that you don't have any NaNs in any of your variables? |
Also one final thing: |
I truly hope we can get to the bottom of this. You might be hitting some edge case that our tests are not catching and want to make sure we address it. Here is one other thing to do to debug: from sklearn.tree import plot_tree
plt.figure(figsize=(15, 5))
plot_tree(est[0][0])
plt.show() Do you see anything funny? If you are able to post the tree here, please do so. |
Thanks! and still you get the error when you apply SHAP with this single tree? |
No error with a single tree! |
ok then try two trees, if you get an error, plot both trees (e.g. [plot_tree(est[t][0]) for t in len(est)] |
It starts to error from 12 trees on, do you want me to plot all trees? |
Just plot the last one then |
I guess I'm tying to look for trees that contain some NaN values in the leafs... |
No NaNs in all the trees I find of 1 estimator that fail on shap_values |
I'm out of ideas. If you could replicate this with synthetic data that you can share, then we can explore further. Alternatively if you could also try to with a different setup (e.g. on a different machine or on a colab notebook) to see whether it's your machine setup or the data that triggers it, that could narrow it a bit. |
Hm oh try this one: |
@kennethverstraete in your code snippet, what is cmodel!!!, why isn't it cfdml? cfdml = CausalForestDML(criterion='het', model_t=RandomForestClassifier(n_estimators=100, random_state=rs), model_y=RandomForestRegressor(n_estimators=100, random_state=rs), discrete_treatment=True, cv=5, n_estimators=500)
cfdml.fit(Y=y_train.ravel(), T=T_train, X=X_train_imputed)
cfdml_shaps = cmodel.shap_values(X_train_imputed) |
That was a bad copy-paste into github! I used cmodel but changed it to cfdml to make it easier for you but forgot to change it in the third line. |
Here is one last try! This is a way to get background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100])
shap.plots.beeswarm(shap_values) And then also try this, that asks to not check for the additivity constraint: background = shap.maskers.Independent(X_train_imputed, max_samples=200)
explainer = shap.Explainer(cfdml.model_cate.estimators_[0], background)
shap_values = explainer(X_train_imputed[:100], check_additivity=False)
shap.plots.beeswarm(shap_values) Also would be great if you could print the "explainer" object and see what explainer was chosen and make sure that a |
I will try this later today or tomorrow, I'll keep you up to date! |
So the first fails (as expected I think), the second one works!
When I print the explainer object I get |
Interesting! At least for now you have something that you can go with and get shap values. But I've still to figure out what the problem might be... and why it only arises for your data set on your machine and not for the synthetic dataset we have in the notebook. If you would be willing to try to generate some synthetic dataset that maybe looks similar to your real one (in terms of treatment imbalances and feature marginal dists and normalized outcomes), for which you could replicate the error behavior and which you could share with us, it would help tremendously in making sure that this edge case is not hit by any other user and that you can use the integrated shap_values methods. |
I will definitely try that and let you know if I can replicate it with synthetic data! Thank for the quick help, I really appreciate it :) |
I am experiencing same issues raised by @kennethverstraete. Using version 0.9.1, and cannot update to 10 for some reason. The work-around that allows for |
When I input the exact same training set I used to train my CausalForestDML and call
shap_values
I get the following error:Exception: Additivity check failed in TreeExplainer! Please ensure the data matrix you passed to the explainer is the same shape that the model was trained on. If your data shape is correct then please report this on GitHub. This check failed because for one of the samples the sum of the SHAP values was -0.335371, while the model output was -0.344933. If this difference is acceptable you can set check_additivity=False to disable this check.
So the data matrix has the correct shape. So an option would be to put
check_additivity=False
as the error suggests, but it is not possible to pass this argument toshap_values
since this is not implemented.The text was updated successfully, but these errors were encountered: