Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFECV float (percentage) step does not reduce features in each iteration as described #10368

Open
hermidalc opened this issue Dec 24, 2017 · 14 comments

Comments

@hermidalc
Copy link
Contributor

hermidalc commented Dec 24, 2017

Hi scikit-learn dev team,

Apologies if I am not understanding how the step feature of RFECV works. With RFECV, when you specify an integer step, you see in verbose mode that it steps down by exactly the number of features until zero, e.g. step=100:

...
Fitting estimator with 1013 features.
Fitting estimator with 913 features.
Fitting estimator with 813 features.
Fitting estimator with 713 features.
Fitting estimator with 613 features.
Fitting estimator with 513 features.
Fitting estimator with 413 features.
Fitting estimator with 313 features.
Fitting estimator with 213 features.
Fitting estimator with 113 features.
Fitting estimator with 13 features.

But with a float (percentage) step it doesn't do what you'd expect. If I specify a step=0.1, I would expect it to reduce the number of features by exactly 10% in each iteration such that it's reducing by fewer and being being more fine-tuned as the number of features decreases. But it seems to be doing quite the opposite where it removes more features than 10% as it gets closer to zero:

Fitting estimator with 54613 features.
Fitting estimator with 49152 features.
Fitting estimator with 43691 features.
Fitting estimator with 38230 features.
Fitting estimator with 32769 features.
Fitting estimator with 27308 features.
Fitting estimator with 21847 features.
Fitting estimator with 16386 features.
Fitting estimator with 10925 features.
Fitting estimator with 5464 features.
Fitting estimator with 3 features.

This doesn't seem to be consistent with the integer step specification (also in general I don't think users would want percentage step to behave in this way). The reason I would like percentage step to work as mentioned above is that there are use cases where one starts with lot of features and would like to fine-tune each RFECV iteration and remove fewer features as it gets closer to zero.

@jnothman
Copy link
Member

jnothman commented Dec 24, 2017 via email

@hermidalc
Copy link
Contributor Author

hermidalc commented Dec 25, 2017

Hi there - the step float option behavior that I thought it should be doing is exactly what Guyon et al did and recommended doing in the SVM-RFE paper that RFECV is based on. In section 5. Experimental Results in their set of experiments to test and compare the algorithm to others they do their runs removing 1/2 the remaining features (genes) in each iteration:

... From each gene expression value, we subtracted its mean and divided the result by its standard deviation. We used the Recursive Feature Elimination (RFE) method, as explained in Section 3. We eliminated chunks of genes at a time. At the first iteration, we reached the number of genes, which is the closest power of 2. At subsequent iterations, we eliminated half of the remaining genes. ...

Which would be a step=0.5 using the option the way I thought it should work. Most importantly in section 6.1 Computational Considerations they recommend stepping in this way:

All our feature selection experiments using various classifiers (SVM, LDA, MSE) indicated that better features are obtained by using RFE than by using the weights of a single classifier (see Section 6.2 for details). Similarly, better results are obtained by eliminating one feature at a time than by eliminating chunks of features. However, there are only significant differences for the smaller subset of genes (less than 100). This suggests that, without trading accuracy for speed, one can use RFE by removing chunks of features in the first few iterations and then remove one feature at a time once the feature set reaches a few hundreds. This may become necessary if the number of genes increases to millions, as is expected to happen in the near future.

Having the RFECV step float option work as I described is very valuable to life sciences researchers like myself who use scikit-learn. In genomics we can now measure 50-100k or more features depending on the technology used and, for classification as Guyon et al. described above, to discriminate between two classes it typically comes down to a very small subset of features (genes) that are relevant for classification. The vast majority of features are not important, so one can save greatly on the number of iterations without losing accuracy by initially eliminating larger numbers of features and as the remaining set becomes smaller eliminating fewer and fewer. This can easily be achieved by specifying a fixed percentage of remaining features that will be removed in each iteration (the way I thought the step float option should work).

I've attached the SVM-RFE Guyon et al. paper for your convenience - A_1012487302797.pdf

@hermidalc
Copy link
Contributor Author

hermidalc commented Dec 25, 2017

Hi again - I believe the easiest way to modify the step functionality to work the way described is the following. I copied the first part of the step if statement here, lines 155-156:

if 0.0 < self.step < 1.0:
step = int(max(1, self.step * n_features))

and I put that into the while loop below it replacing n_features by np.sum(support_):

# Elimination
while np.sum(support_) > n_features_to_select:
    # Remaining features
    features = np.arange(n_features)[support_]
   
    if 0.0 < self.step < 1.0: 
        step = int(max(1, self.step * np.sum(support_)))  
   
    # Rank the remaining features

Am I breaking or missing anything by doing that? I know you would want to provide a new option for this functionality, I'm just testing it with a local fix.

@jnothman
Copy link
Member

jnothman commented Dec 25, 2017 via email

@hermidalc
Copy link
Contributor Author

Hi - I can create a pull request with a better feature option addition and code than the above local fix I did to test it. What should we call this option?

@jnothman
Copy link
Member

jnothman commented Dec 31, 2017 via email

@hermidalc
Copy link
Contributor Author

What are your thoughts on having a boolean parameter reduce_step (optional, default False) where if set to True and step is set to a float then it turns on the behavior I describe here. Just checking to see if that is appropriate for sklearn and good design (or not)

@FrancisHChen
Copy link

FrancisHChen commented Sep 12, 2018

I would also suggest having an option with a decayed number of features to drop at each step.

The purpose of RFE is to

'find the optimal number of features'.

If the number of input features is large, but the best subset is unsure, setting the 'step' to a too small integer is computationally inefficient, while dropping too many features would very likely skip the best.

In general, we want to be aggressive at the beginning however conservative as approaching the optimal. It is like other iterative methods.

Making the float step as a fraction of the number of features at each step is much more meaningful by logic. I don't think it needs any rigorous academic support.

@jnothman
Copy link
Member

jnothman commented Sep 13, 2018 via email

@hermidalc
Copy link
Contributor Author

Hi @chenhuiyou and @jnothman - I've been using my own customized code since the op, I will submit a pr. As discussed I changed the behavior so that a float step between 0. and 1. would take take remove that percentage of the features that remained at each step (so decaying and not fixed). I also introduced a new feature number parameter where, if specified, stepping would revert to one feature at a time no matter what was set before. Both these changes follow the original Guyon et al SVM-RFE recommendations.

@malikyousef
Copy link

malikyousef commented Aug 25, 2019

So what we have at the end ? is it the standard svm-rfe?

@hermidalc
Copy link
Contributor Author

So what we have at the end ? is it the standard svm-rfe?

Not sure exactly what you mean but the original SVM-RFE algorithm paper which I referenced in the thread describes doing exactly what I’m working on in the pull request. So short answer is yes it will be standard SVM-RFE

@adamyli
Copy link

adamyli commented Jul 24, 2020

Was this ever implemented for the main sklearn RFE package? Been needing the same thing where I would like the same percentage of features to be eliminated at each iteration. I've tried editing the code locally but it won't pip install on my environment so I was wondering if this PR worked out.

@adamyli
Copy link

adamyli commented Jul 24, 2020

So what we have at the end ? is it the standard svm-rfe?

Not sure exactly what you mean but the original SVM-RFE algorithm paper which I referenced in the thread describes doing exactly what I’m working on in the pull request. So short answer is yes it will be standard SVM-RFE

Regarding the paper you linked above by Guyon et al, I echo your sentiments completely as I'm currently working with large methylation data sets that need feature selecting. How did Guyon et al. do it without sklearn? Is there a way to perform if if the PR does not go through?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants