RFECV float (percentage) step does not reduce features in each iteration as described #10368

hermidalc · 2017-12-24T15:41:33Z

Hi scikit-learn dev team,

Apologies if I am not understanding how the step feature of RFECV works. With RFECV, when you specify an integer step, you see in verbose mode that it steps down by exactly the number of features until zero, e.g. step=100:

...
Fitting estimator with 1013 features.
Fitting estimator with 913 features.
Fitting estimator with 813 features.
Fitting estimator with 713 features.
Fitting estimator with 613 features.
Fitting estimator with 513 features.
Fitting estimator with 413 features.
Fitting estimator with 313 features.
Fitting estimator with 213 features.
Fitting estimator with 113 features.
Fitting estimator with 13 features.

But with a float (percentage) step it doesn't do what you'd expect. If I specify a step=0.1, I would expect it to reduce the number of features by exactly 10% in each iteration such that it's reducing by fewer and being being more fine-tuned as the number of features decreases. But it seems to be doing quite the opposite where it removes more features than 10% as it gets closer to zero:

Fitting estimator with 54613 features.
Fitting estimator with 49152 features.
Fitting estimator with 43691 features.
Fitting estimator with 38230 features.
Fitting estimator with 32769 features.
Fitting estimator with 27308 features.
Fitting estimator with 21847 features.
Fitting estimator with 16386 features.
Fitting estimator with 10925 features.
Fitting estimator with 5464 features.
Fitting estimator with 3 features.

This doesn't seem to be consistent with the integer step specification (also in general I don't think users would want percentage step to behave in this way). The reason I would like percentage step to work as mentioned above is that there are use cases where one starts with lot of features and would like to fine-tune each RFECV iteration and remove fewer features as it gets closer to zero.

The text was updated successfully, but these errors were encountered:

jnothman · 2017-12-24T20:06:43Z

It is reducing by the same number in each step. Thus choosing .1 means 11 iterations. A clarification to the documentation is welcome, in the first instance. Maybe it's worth considering removing fewer features in each subsequent step? we could consider adding a separate parameter to control this, but only if the literature or empirical demonstration showed this was especially helpful.

hermidalc · 2017-12-25T06:13:18Z

Hi there - the step float option behavior that I thought it should be doing is exactly what Guyon et al did and recommended doing in the SVM-RFE paper that RFECV is based on. In section 5. Experimental Results in their set of experiments to test and compare the algorithm to others they do their runs removing 1/2 the remaining features (genes) in each iteration:

... From each gene expression value, we subtracted its mean and divided the result by its standard deviation. We used the Recursive Feature Elimination (RFE) method, as explained in Section 3. We eliminated chunks of genes at a time. At the first iteration, we reached the number of genes, which is the closest power of 2. At subsequent iterations, we eliminated half of the remaining genes. ...

Which would be a step=0.5 using the option the way I thought it should work. Most importantly in section 6.1 Computational Considerations they recommend stepping in this way:

All our feature selection experiments using various classifiers (SVM, LDA, MSE) indicated that better features are obtained by using RFE than by using the weights of a single classifier (see Section 6.2 for details). Similarly, better results are obtained by eliminating one feature at a time than by eliminating chunks of features. However, there are only significant differences for the smaller subset of genes (less than 100). This suggests that, without trading accuracy for speed, one can use RFE by removing chunks of features in the first few iterations and then remove one feature at a time once the feature set reaches a few hundreds. This may become necessary if the number of genes increases to millions, as is expected to happen in the near future.

Having the RFECV step float option work as I described is very valuable to life sciences researchers like myself who use scikit-learn. In genomics we can now measure 50-100k or more features depending on the technology used and, for classification as Guyon et al. described above, to discriminate between two classes it typically comes down to a very small subset of features (genes) that are relevant for classification. The vast majority of features are not important, so one can save greatly on the number of iterations without losing accuracy by initially eliminating larger numbers of features and as the remaining set becomes smaller eliminating fewer and fewer. This can easily be achieved by specifying a fixed percentage of remaining features that will be removed in each iteration (the way I thought the step float option should work).

I've attached the SVM-RFE Guyon et al. paper for your convenience - A_1012487302797.pdf

hermidalc · 2017-12-25T07:06:14Z

Hi again - I believe the easiest way to modify the step functionality to work the way described is the following. I copied the first part of the step if statement here, lines 155-156:

scikit-learn/sklearn/feature_selection/rfe.py

Lines 155 to 156 in 5311c81

    
           if 0.0 < self.step < 1.0: 
        
               step = int(max(1, self.step * n_features))

and I put that into the while loop below it replacing n_features by np.sum(support_):

# Elimination
while np.sum(support_) > n_features_to_select:
    # Remaining features
    features = np.arange(n_features)[support_]
   
    if 0.0 < self.step < 1.0: 
        step = int(max(1, self.step * np.sum(support_)))  
   
    # Rank the remaining features

Am I breaking or missing anything by doing that? I know you would want to provide a new option for this functionality, I'm just testing it with a local fix.

jnothman · 2017-12-25T08:27:42Z

it's easy to fix computationally, but we try to maintain backwards compatibility from release to release, so a solution, IMO, will involve adding a parameter.

hermidalc · 2017-12-25T16:59:33Z

Hi - I can create a pull request with a better feature option addition and code than the above local fix I did to test it. What should we call this option?

jnothman · 2017-12-31T05:47:51Z

not sure right now. happy for you to propose something as you are obviously more familiar with the algorithm and its literature. thanks

hermidalc · 2018-03-07T02:48:51Z

What are your thoughts on having a boolean parameter reduce_step (optional, default False) where if set to True and step is set to a float then it turns on the behavior I describe here. Just checking to see if that is appropriate for sklearn and good design (or not)

FrancisHChen · 2018-09-12T13:16:26Z

I would also suggest having an option with a decayed number of features to drop at each step.

The purpose of RFE is to

'find the optimal number of features'.

If the number of input features is large, but the best subset is unsure, setting the 'step' to a too small integer is computationally inefficient, while dropping too many features would very likely skip the best.

In general, we want to be aggressive at the beginning however conservative as approaching the optimal. It is like other iterative methods.

Making the float step as a fraction of the number of features at each step is much more meaningful by logic. I don't think it needs any rigorous academic support.

jnothman · 2018-09-13T07:54:20Z

I don't mind that kind of parameter. PR welcome

hermidalc · 2018-09-13T10:45:41Z

Hi @chenhuiyou and @jnothman - I've been using my own customized code since the op, I will submit a pr. As discussed I changed the behavior so that a float step between 0. and 1. would take take remove that percentage of the features that remained at each step (so decaying and not fixed). I also introduced a new feature number parameter where, if specified, stepping would revert to one feature at a time no matter what was set before. Both these changes follow the original Guyon et al SVM-RFE recommendations.

malikyousef · 2019-08-25T03:45:17Z

So what we have at the end ? is it the standard svm-rfe?

hermidalc · 2019-08-25T13:53:42Z

So what we have at the end ? is it the standard svm-rfe?

Not sure exactly what you mean but the original SVM-RFE algorithm paper which I referenced in the thread describes doing exactly what I’m working on in the pull request. So short answer is yes it will be standard SVM-RFE

adamyli · 2020-07-24T19:47:50Z

Was this ever implemented for the main sklearn RFE package? Been needing the same thing where I would like the same percentage of features to be eliminated at each iteration. I've tried editing the code locally but it won't pip install on my environment so I was wondering if this PR worked out.

adamyli · 2020-07-24T20:47:52Z

So what we have at the end ? is it the standard svm-rfe?

Not sure exactly what you mean but the original SVM-RFE algorithm paper which I referenced in the thread describes doing exactly what I’m working on in the pull request. So short answer is yes it will be standard SVM-RFE

Regarding the paper you linked above by Guyon et al, I echo your sentiments completely as I'm currently working with large methylation data sets that need feature selecting. How did Guyon et al. do it without sklearn? Is there a way to perform if if the PR does not go through?
Thanks!

hermidalc mentioned this issue Nov 13, 2018

RFE/RFECV step enhancements #12578

Closed

hermidalc mentioned this issue Mar 18, 2019

RFE/RFECV step enhancements #13470

Closed

cmarmo added module:feature_selection help wanted labels Dec 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFECV float (percentage) step does not reduce features in each iteration as described #10368

RFECV float (percentage) step does not reduce features in each iteration as described #10368

hermidalc commented Dec 24, 2017 •

edited

jnothman commented Dec 24, 2017 via email

hermidalc commented Dec 25, 2017 •

edited

hermidalc commented Dec 25, 2017 •

edited

jnothman commented Dec 25, 2017 via email

hermidalc commented Dec 25, 2017

jnothman commented Dec 31, 2017 via email

hermidalc commented Mar 7, 2018

FrancisHChen commented Sep 12, 2018 •

edited

jnothman commented Sep 13, 2018 via email

hermidalc commented Sep 13, 2018

malikyousef commented Aug 25, 2019 •

edited

hermidalc commented Aug 25, 2019

adamyli commented Jul 24, 2020

adamyli commented Jul 24, 2020

RFECV float (percentage) step does not reduce features in each iteration as described #10368

RFECV float (percentage) step does not reduce features in each iteration as described #10368

Comments

hermidalc commented Dec 24, 2017 • edited

jnothman commented Dec 24, 2017 via email

hermidalc commented Dec 25, 2017 • edited

hermidalc commented Dec 25, 2017 • edited

jnothman commented Dec 25, 2017 via email

hermidalc commented Dec 25, 2017

jnothman commented Dec 31, 2017 via email

hermidalc commented Mar 7, 2018

FrancisHChen commented Sep 12, 2018 • edited

jnothman commented Sep 13, 2018 via email

hermidalc commented Sep 13, 2018

malikyousef commented Aug 25, 2019 • edited

hermidalc commented Aug 25, 2019

adamyli commented Jul 24, 2020

adamyli commented Jul 24, 2020

hermidalc commented Dec 24, 2017 •

edited

hermidalc commented Dec 25, 2017 •

edited

hermidalc commented Dec 25, 2017 •

edited

FrancisHChen commented Sep 12, 2018 •

edited

malikyousef commented Aug 25, 2019 •

edited