Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging BorutaPy and Boruta-Shap #80

Closed
ThomasBury opened this issue Jul 1, 2020 · 3 comments
Closed

Merging BorutaPy and Boruta-Shap #80

ThomasBury opened this issue Jul 1, 2020 · 3 comments

Comments

@ThomasBury
Copy link

Hi,

Ekeany just wrote a package (https://github.com/Ekeany/Boruta-Shap) with the same modifications that I made in the PR (#77).

To avoid duplicates and for the benefit of the community, it makes sense to merge those contributions. For the sake of clarity, under the scikit-learn-contrib umbrella.

Thanks!
KR

@danielhomola
Copy link
Collaborator

Sorry, completely forgot about this.

I applaud your effort, but I'm worried about a couple of things here, maybe you can shine some light on these:

  • Is there any evidence that using shap like this yields a better feature selector than the base boruta? any paper on this?
  • AFAIK, the tree based method in shap (which is one of the main contributions of the paper as it's fast) is also shown to be incorrect: https://christophm.github.io/interpretable-ml-book/shap.html#treeshap
  • So we're left with the slow method. Boruta is already slow as it re-fits the model several times, now we add on top of this latency. Am I missing something here?
  • I'd avoid adding dependencies (shap, matplotlib) to the package, so this would need to be something provable better to justify the coupling of these..

As I see in your other comment, you've

@ThomasBury
Copy link
Author

No worries (a part of your post is missing, but here are some thoughts)

  • SHAP and Shapley values ain't perfect, I'm aware of that. The main drawback, in my opinion, is the linear attribution: not taking into account multivariate (interactions) effects. That being said, the impurity based feature importance returns even more biased results and should be avoided (since it will favour numeric and large cardinality predictors, even if those are made up and random, see the dummy test I made or the scikit-learn, could be nice to be inline with scikit-learn). The aforementioned F.I. have been tested in this medium blog
  • Regarding the computational time, I suggest you change the default from random forest to lightgbm. In the tests I made it's up to 10x faster. So even with shap.imp or permutation importance, it's equivalent of running RF+builtin F.I. Moreover, using boosting-based feat. sel. with a boosting model returns more consistent results than RF feat.selc + boosting.
  • Permutation importance is an alternative, I don't remember by heart the computational cost, but with lightgbm it's still OK.
  • I'd not consider matplotlib as a hard dependency, at least not on the same foot than SHAP, but that's up to you. Visualization is always a good idea in my opinion. Mine is not necessarily the best, the Boruta-Shap package has a nice one.

If you want so, you can choose from the Boruta-Shap package or some of the material of my PR (or not at all, but then we'll be left with different implementations of the globally same algorithm).

@ThomasBury
Copy link
Author

Hi,

Actually, the Shapley and permutation importances must be computed on unseen data, which is the case in my PR, not in Boruta_shap (I raised an issue on the boruta_shap github repo).
KR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants