RFECV is broken #2403

pprett opened this Issue Aug 28, 2013 · 6 comments

4 participants

scikit-learn member

RFECV has some issues

  1. scores.shape[1] is chosen as n_features which is the number of evaluations in the worst case (step size 1); if a different step size is chosen not all cells are filled. Cells are initialized with zero; for each evaluation the score is added to the cell - scores can be either all negative or all positive (depending on the scorer); finally the max is chosen: this is a bug in the case of negative scores and step != 1.

  2. best score is the sum of all CV evaluations not the mean - this is confusing because the best score cannot be related to the other CV experiments.

  3. the number of final features selected is k -- which seems to refer to the RFE.ranking_ of features where ties have the same rank. Imagine a ranking where are two blocks of tied features [1,1,1,2,2,2]. Since k is at most 1 only two features n_features_to_select = k+1 could be selected at most ... maybe I'm not getting the whole picture but it smells fishy (ping @amueller @NicolasTr )

scikit-learn member

problem 1) could be adressed by initializing with nan and using masked arrays. I will submit a PR later.


A follow-up: after fit, grid_scores_ property contains sum of best scores, divided by number of folds minus one. This is confusing, for example in docs, when number folds is two, it results in sum of accuracies of both, and in values greater than 1. See this Q on SO.


I was working with RFECV for a few hours and noticed these issues. Some of these have already been mentioned but I reiterate them for completeness and elaboration.

  1. [Critical] grid_scores_ is of shape [n_features] but the documentation states it is [n_subsets_of_features]. Note that with a normalized value of step (i.e.
if 0.0 < self.step < 1.0:
    step = int(self.step * n_features)
    step = int(self.step)

n_subsets_of_features is int(np.ceil((n_features-n_features_to_select)/float(step))) + 1.

scores can be initialized with scores = np.zeros(n_subsets_of_features) on line 331 as a quick fix to address this.

  1. [Critical] n_features_to_select on line line 358 is set to be k+1. As mentioned, this is incorrect where step != 1. With a normalized step, the correct value is n_features_to_select = step*k + 1 depends on k and would be step*k + 1 if it weren't for the endpoints. That is, for k=0, n_features_to_select is always 1. However, depending on if n_features is divisible by step, the subsequent n_features_to_select varies. Could derive a closed-form formula for this or use the sum where ranking_ < k or something similar.

  2. [done] [Critical] On line 371, self.grid_scores_ = scores / n is wrong since n = len(cv) - 1 after the execution of the loop on line 334. Quick fix can be self.grid_scores_ = scores / len(cv).

  3. [Minor] The order of grid_scores_ should be reversed. Currently, it is ascending in the number of features included but since this is for feature elimination analysis rather than that of feature inclusion, it should probably be reversed. The plot http://scikit-learn.org/stable/auto_examples/plot_rfe_with_cross_validation.html#example-plot-rfe-with-cross-validation-py might suggest it is illustrating the latter. The distinction is important because the method for observing the effects of adding features incrementally would be different, since we have no a priori knowledge of the coefficients of the features and can only use univariate feature selection metrics like chi-squared or information gain.

  4. [Improvement] For each subset of features, we fit the model to it at least 2*len(cv) times (so a total of 2*len(cv)*n_subsets_of_features times). Once on line 339 and again in the loop beginning on line 341. The constant factor of 2 is negligible in most circumstances but in this case, each step is a non-trivial model fitting operation and a significant performance increase can be obtained by refactoring the code so we're not performing so much redundant work.

@amueller amueller added this to the 0.15.1 milestone Jul 18, 2014
scikit-learn member

If anyone has time, it would be good to go through the issues again and see which were fixed in #3824

scikit-learn member

From @ltiao's list,
1. and 2. was fixed, 3. I wouldn't fix, and 4. I don't think can be done.
Hum thinking about it 4. is possible, but needs a significant amount of refactoring.

scikit-learn member

I'm convinced all issues are fixed, so I'm closing this.

@amueller amueller closed this Mar 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment