New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFECV float (percentage) step does not reduce features in each iteration as described #10368
Comments
It is reducing by the same number in each step. Thus choosing .1 means 11
iterations. A clarification to the documentation is welcome, in the first
instance.
Maybe it's worth considering removing fewer features in each subsequent
step? we could consider adding a separate parameter to control this, but
only if the literature or empirical demonstration showed this was
especially helpful.
|
Hi there - the step float option behavior that I thought it should be doing is exactly what Guyon et al did and recommended doing in the SVM-RFE paper that RFECV is based on. In section 5. Experimental Results in their set of experiments to test and compare the algorithm to others they do their runs removing 1/2 the remaining features (genes) in each iteration:
Which would be a step=0.5 using the option the way I thought it should work. Most importantly in section 6.1 Computational Considerations they recommend stepping in this way:
Having the RFECV step float option work as I described is very valuable to life sciences researchers like myself who use scikit-learn. In genomics we can now measure 50-100k or more features depending on the technology used and, for classification as Guyon et al. described above, to discriminate between two classes it typically comes down to a very small subset of features (genes) that are relevant for classification. The vast majority of features are not important, so one can save greatly on the number of iterations without losing accuracy by initially eliminating larger numbers of features and as the remaining set becomes smaller eliminating fewer and fewer. This can easily be achieved by specifying a fixed percentage of remaining features that will be removed in each iteration (the way I thought the step float option should work). I've attached the SVM-RFE Guyon et al. paper for your convenience - A_1012487302797.pdf |
Hi again - I believe the easiest way to modify the step functionality to work the way described is the following. I copied the first part of the step if statement here, lines 155-156: scikit-learn/sklearn/feature_selection/rfe.py Lines 155 to 156 in 5311c81
and I put that into the while loop below it replacing n_features by np.sum(support_):
Am I breaking or missing anything by doing that? I know you would want to provide a new option for this functionality, I'm just testing it with a local fix. |
it's easy to fix computationally, but we try to maintain backwards
compatibility from release to release, so a solution, IMO, will involve
adding a parameter.
|
Hi - I can create a pull request with a better feature option addition and code than the above local fix I did to test it. What should we call this option? |
not sure right now. happy for you to propose something as you are obviously
more familiar with the algorithm and its literature. thanks
|
What are your thoughts on having a boolean parameter reduce_step (optional, default False) where if set to True and step is set to a float then it turns on the behavior I describe here. Just checking to see if that is appropriate for sklearn and good design (or not) |
I would also suggest having an option with a decayed number of features to drop at each step. The purpose of RFE is to
If the number of input features is large, but the best subset is unsure, setting the 'step' to a too small integer is computationally inefficient, while dropping too many features would very likely skip the best. In general, we want to be aggressive at the beginning however conservative as approaching the optimal. It is like other iterative methods. Making the float step as a fraction of the number of features at each step is much more meaningful by logic. I don't think it needs any rigorous academic support. |
I don't mind that kind of parameter. PR welcome
|
Hi @chenhuiyou and @jnothman - I've been using my own customized code since the op, I will submit a pr. As discussed I changed the behavior so that a float step between 0. and 1. would take take remove that percentage of the features that remained at each step (so decaying and not fixed). I also introduced a new feature number parameter where, if specified, stepping would revert to one feature at a time no matter what was set before. Both these changes follow the original Guyon et al SVM-RFE recommendations. |
So what we have at the end ? is it the standard svm-rfe? |
Not sure exactly what you mean but the original SVM-RFE algorithm paper which I referenced in the thread describes doing exactly what I’m working on in the pull request. So short answer is yes it will be standard SVM-RFE |
Was this ever implemented for the main sklearn RFE package? Been needing the same thing where I would like the same percentage of features to be eliminated at each iteration. I've tried editing the code locally but it won't pip install on my environment so I was wondering if this PR worked out. |
Regarding the paper you linked above by Guyon et al, I echo your sentiments completely as I'm currently working with large methylation data sets that need feature selecting. How did Guyon et al. do it without sklearn? Is there a way to perform if if the PR does not go through? |
Hi scikit-learn dev team,
Apologies if I am not understanding how the step feature of RFECV works. With RFECV, when you specify an integer step, you see in verbose mode that it steps down by exactly the number of features until zero, e.g. step=100:
But with a float (percentage) step it doesn't do what you'd expect. If I specify a step=0.1, I would expect it to reduce the number of features by exactly 10% in each iteration such that it's reducing by fewer and being being more fine-tuned as the number of features decreases. But it seems to be doing quite the opposite where it removes more features than 10% as it gets closer to zero:
This doesn't seem to be consistent with the integer step specification (also in general I don't think users would want percentage step to behave in this way). The reason I would like percentage step to work as mentioned above is that there are use cases where one starts with lot of features and would like to fine-tune each RFECV iteration and remove fewer features as it gets closer to zero.
The text was updated successfully, but these errors were encountered: