Skip to content

Commit

Permalink
[MRG+2] Fix edge case of tied CV scores in RFECV (#9222)
Browse files Browse the repository at this point in the history
* Fix edge case of tied CV scores in RFECV

In the feature_selection module, RFECV selects the model with the
highest cross-validation score. In the event of CV score ties, one
expects RFECV to return the best model with the fewest features.
This fix addresses such an edge case where two or more models have
identical cross-validation scores.

* Adding an entry to what's new addressing bug fix in RFECV edge case

* Re-add what's new entry

* Use double backticks in whats_new entry
  • Loading branch information
nickypie authored and amueller committed May 24, 2018
1 parent 68b981f commit 1755b89
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 1 deletion.
5 changes: 5 additions & 0 deletions doc/whats_new/v0.20.rst
Expand Up @@ -471,6 +471,11 @@ Preprocessing
``inverse_transform`` on unseen labels. :issue:`9816` by :user:`Charlie Newey
<newey01c>`.

Feature selection

- Fixed computation of ``n_features_to_compute`` for edge case with tied
CV scores in :class:`RFECV`. :issue:`9222` by `Nick Hoh <nickypie>`.

Model evaluation and meta-estimators

- Add improved error message in :func:`model_selection.cross_val_score` when
Expand Down
4 changes: 3 additions & 1 deletion sklearn/feature_selection/rfe.py
Expand Up @@ -449,8 +449,10 @@ def fit(self, X, y, groups=None):
for train, test in cv.split(X, y, groups))

scores = np.sum(scores, axis=0)
scores_rev = scores[::-1]
argmax_idx = len(scores) - np.argmax(scores_rev) - 1
n_features_to_select = max(
n_features - (np.argmax(scores) * step),
n_features - (argmax_idx * step),
n_features_to_select)

# Re-execute an elimination with best_k over the whole set
Expand Down
5 changes: 5 additions & 0 deletions sklearn/feature_selection/tests/test_rfe.py
Expand Up @@ -167,6 +167,11 @@ def test_scorer(estimator, X, y):
scoring=test_scorer)
rfecv.fit(X, y)
assert_array_equal(rfecv.grid_scores_, np.ones(len(rfecv.grid_scores_)))
# In the event of cross validation score ties, the expected behavior of
# RFECV is to return the FEWEST features that maximize the CV score.
# Because test_scorer always returns 1.0 in this example, RFECV should
# reduce the dimensionality to a single feature (i.e. n_features_ = 1)
assert_equal(rfecv.n_features_, 1)

# Same as the first two tests, but with step=2
rfecv = RFECV(estimator=SVC(kernel="linear"), step=2, cv=5)
Expand Down

0 comments on commit 1755b89

Please sign in to comment.