Merging BorutaPy and Boruta-Shap #80

ThomasBury · 2020-07-01T08:53:48Z

Hi,

Ekeany just wrote a package (https://github.com/Ekeany/Boruta-Shap) with the same modifications that I made in the PR (#77).

To avoid duplicates and for the benefit of the community, it makes sense to merge those contributions. For the sake of clarity, under the scikit-learn-contrib umbrella.

Thanks!
KR

danielhomola · 2020-07-01T14:08:30Z

Sorry, completely forgot about this.

I applaud your effort, but I'm worried about a couple of things here, maybe you can shine some light on these:

Is there any evidence that using shap like this yields a better feature selector than the base boruta? any paper on this?
AFAIK, the tree based method in shap (which is one of the main contributions of the paper as it's fast) is also shown to be incorrect: https://christophm.github.io/interpretable-ml-book/shap.html#treeshap
So we're left with the slow method. Boruta is already slow as it re-fits the model several times, now we add on top of this latency. Am I missing something here?
I'd avoid adding dependencies (shap, matplotlib) to the package, so this would need to be something provable better to justify the coupling of these..

As I see in your other comment, you've

ThomasBury · 2020-07-02T08:14:11Z

No worries (a part of your post is missing, but here are some thoughts)

SHAP and Shapley values ain't perfect, I'm aware of that. The main drawback, in my opinion, is the linear attribution: not taking into account multivariate (interactions) effects. That being said, the impurity based feature importance returns even more biased results and should be avoided (since it will favour numeric and large cardinality predictors, even if those are made up and random, see the dummy test I made or the scikit-learn, could be nice to be inline with scikit-learn). The aforementioned F.I. have been tested in this medium blog
Regarding the computational time, I suggest you change the default from random forest to lightgbm. In the tests I made it's up to 10x faster. So even with shap.imp or permutation importance, it's equivalent of running RF+builtin F.I. Moreover, using boosting-based feat. sel. with a boosting model returns more consistent results than RF feat.selc + boosting.
Permutation importance is an alternative, I don't remember by heart the computational cost, but with lightgbm it's still OK.
I'd not consider matplotlib as a hard dependency, at least not on the same foot than SHAP, but that's up to you. Visualization is always a good idea in my opinion. Mine is not necessarily the best, the Boruta-Shap package has a nice one.

If you want so, you can choose from the Boruta-Shap package or some of the material of my PR (or not at all, but then we'll be left with different implementations of the globally same algorithm).

ThomasBury · 2020-07-08T10:01:15Z

Hi,

Actually, the Shapley and permutation importances must be computed on unseen data, which is the case in my PR, not in Boruta_shap (I raised an issue on the boruta_shap github repo).
KR

ThomasBury mentioned this issue Jul 2, 2020

[INF] Merge to BorutaPy (sklearn-contrib) Ekeany/Boruta-Shap#12

Closed

2 tasks

ThomasBury closed this as completed Dec 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging BorutaPy and Boruta-Shap #80

Merging BorutaPy and Boruta-Shap #80

ThomasBury commented Jul 1, 2020

danielhomola commented Jul 1, 2020

ThomasBury commented Jul 2, 2020

ThomasBury commented Jul 8, 2020

Merging BorutaPy and Boruta-Shap #80

Merging BorutaPy and Boruta-Shap #80

Comments

ThomasBury commented Jul 1, 2020

danielhomola commented Jul 1, 2020

ThomasBury commented Jul 2, 2020

ThomasBury commented Jul 8, 2020