catboost-plot-pimp-shapimp #77

ThomasBury · 2020-05-14T17:18:18Z

Modifications are:

catboost checks (doesn't allow to change random seed after fitting), cases for catboost estimator
plot: boxplot of the var. imp history (confirmed, tentative and rejected are colour-coded)
categorical feature for boosting, plot (mimicking more or less the R one), OHE is known to lead to deep and unstable trees. LightGBM and CatBoost have native methods to handle cat. pred, this requires to declare which columns are cat (lightgbm has an auto mode but the columns should be integer encoded and dtype set to category)
permutation importance and SHAP importance, (the impurity based var.imp being biased towards large card. and numerical predictors and computed on the train set)
testing: modif_tests.py for classification and regression testing (including a random cat. pred with large cardinality)

thanks
KR

ThomasBury · 2020-07-08T10:13:13Z

To summarize, this PR solves/enhances:

The categorical features (they are detected, encoded. The tree-based models are working better with integer encoding rather than with OHE, which leads to deep and unstable trees). If Catboost is used, then the cat.pred (if any) are set up
Work with Catboost sklearn API
Allow using sample_weight, for applications like Poisson regression or any requiring weights
3 different feature importances: native, SHAP and permutation. Native being the least consistent (because of the imp. biased towards numerical and large cardinality categorical) but the fastest of the 3.
Using lightGBM as default speed up by an order of magnitude the running time
Visualization like in the R package

brunofacca · 2020-08-02T14:56:03Z

Hi @danielhomola. First, let me thank you for this great library.

The changes in this PR would be very useful to me. Do you plan to merge?

PS: here is a writeup about the limitations of gini as a feature importance metric.

ThomasBury · 2020-08-03T20:02:56Z

@brunofacca Meanwhile they consider to merge it or not, you can have a look at the Boruta_shap package. It implements almost all the features of my PR (I also discussed with its author to fix some issues and the possible merge with Boruta_py, which could be beneficial to avoid any confusion and be under the scikit contrib flag). I hope it helps.

brunofacca · 2020-08-03T20:52:49Z

Thank you @ThomasBury, for this PR and for the tip. I'm actually testing the Boruta_shap package and it looks great except that it still has low test coverage and a smaller number of contributors. Of course, that is likely to improve as the project matures.

brunofacca · 2020-08-21T23:25:38Z

Did the maintainers consider merging these changes? Is there a decision? Thank you.

ThomasBury · 2020-11-28T09:44:35Z

Did the maintainers consider merging these changes? Is there a decision? Thank you.

@brunofacca if you're still interested, I packaged 3 All Relevant FS methods here: https://github.com/ThomasBury/arfs still incubating but there are some (unit)tests and the doc is there (I guess there are too many dependencies to be compliant with the scikit-learn contrib requirements). I'm supervising a master thesis, the goal being to study the properties of those methods (so the package is likely to evolve over time).

danielhomola · 2020-11-28T09:54:31Z

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

ThomasBury · 2020-11-29T17:51:19Z

That's a nice package @ThomasBury, wish you all the best with it! I much prefer separating these ideas from the original implementation to keep things simple and closer to unix tooling philosophy.. I'll close this PR now if you don't mind.

Ok thanks for the kind words. Would you be interested in a PR in pure sklearn (so only sklearn estimators, native and permutation importance)? That would mean the package to be more than boruta, a sklearn-flavoured all relevant FS package, instead of the opposite strategy of having different packages and a wrapper on top of them, if compliant with sk-contrib requirements? Or perhaps just the PR with the permutation importance (slower but more accurate and still relevant for small/mid-size data sets)?
If not, no worries (I prefer having your opinion before starting anything ^^)

brunofacca · 2020-11-30T10:27:57Z

Thank you, @ThomasBury! That's a very nice library and it's is likely to fill what I consider to be a gap in the current ML tooling: I've tried dozens of feature selection strategies (including those that are considered "state of the art") and none of them were effective for a high-dimensional dataset where even the most relevant features are quite noisy. Your strategy of running XGBoost classifier on the entire data set ten times (on BoostARoota), for example, is a nice step towards a more effective feature selection strategy for data with a low signal-to-noise ratio.

Would be very interesting to eventually see a comparison between classification performance with subsets of features selected by each of those 3 strategies. I will also give them a try in the near future.

danielhomola · 2021-01-25T08:54:30Z

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

we add it as a difference from the original paper in the readme
if it's optional and easy to switch off
if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

ThomasBury · 2021-08-27T11:24:39Z

Hi Thomas, really sorry, totally forgot about your question. I'd be happy to review PR with permutation importance if

we add it as a difference from the original paper in the readme

if it's optional and easy to switch off

if we at least have a notebook on some toy data (iris or whatever) with the original and your version and show the difference..

Hi @danielhomola,

I submitted a PR:

optional pimp via an argument
If shap is installed (soft dependency, try-catch import) shap feature importance can be used. Not required for installing boruta
no hard dependencies added, all optional and compatibility check using try-catch on import + error message
chart like the original, only if matplotlib is installed (try-catch import, mpl is not mandatory for installing boruta)
add sample_weight support
categorical feature support (based on the lightGBM recommendation)
Notebook with examples, with genuine predictors and artificial ones to check if the right ones are accepted/rejected
Message added to the Readme.

With this, you pretty much have the same version than in the ARFS package but without any hard dependencies.

KR
Thomas

Thomas Bury added 4 commits May 14, 2020 18:56

catboost-plot-pimp-shapimp

1bb3c53

correcting typos

bbea1b7

fix the fig size + tight layout

e8062d7

fixing lgb clf shap imp

6a05d37

This was referenced Jul 1, 2020

[INF] Merge to BorutaPy (sklearn-contrib) Ekeany/Boruta-Shap#12

Closed

Merging BorutaPy and Boruta-Shap #80

Closed

Thomas Bury and others added 5 commits July 8, 2020 11:35

changes for lightgbm seed

2081d39

naming of var consistent with master

a6c4b8a

splitting import into different lines

e278764

import warnings

132e265

Merge branch 'master' into shap_imp

1b74b1c

Thomas Bury added 2 commits August 28, 2020 15:05

move weight in fit method

504abf5

moved weight in fit method

b0082c9

danielhomola closed this Nov 28, 2020

ThomasBury mentioned this pull request Aug 20, 2021

Implements sample_weight and optional permutation and SHAP importance, categorical features, boxplot #100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catboost-plot-pimp-shapimp #77

catboost-plot-pimp-shapimp #77

ThomasBury commented May 14, 2020

ThomasBury commented Jul 8, 2020

brunofacca commented Aug 2, 2020 •

edited

ThomasBury commented Aug 3, 2020

brunofacca commented Aug 3, 2020

brunofacca commented Aug 21, 2020

ThomasBury commented Nov 28, 2020 •

edited

danielhomola commented Nov 28, 2020

ThomasBury commented Nov 29, 2020

brunofacca commented Nov 30, 2020

danielhomola commented Jan 25, 2021

ThomasBury commented Aug 27, 2021 •

edited

catboost-plot-pimp-shapimp #77

catboost-plot-pimp-shapimp #77

Conversation

ThomasBury commented May 14, 2020

ThomasBury commented Jul 8, 2020

brunofacca commented Aug 2, 2020 • edited

ThomasBury commented Aug 3, 2020

brunofacca commented Aug 3, 2020

brunofacca commented Aug 21, 2020

ThomasBury commented Nov 28, 2020 • edited

danielhomola commented Nov 28, 2020

ThomasBury commented Nov 29, 2020

brunofacca commented Nov 30, 2020

danielhomola commented Jan 25, 2021

ThomasBury commented Aug 27, 2021 • edited

brunofacca commented Aug 2, 2020 •

edited

ThomasBury commented Nov 28, 2020 •

edited

ThomasBury commented Aug 27, 2021 •

edited