New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Replacing grid_scores_ by cv_results_ in _rfe.py #16961
[MRG] Replacing grid_scores_ by cv_results_ in _rfe.py #16961
Conversation
Merging changes from the main repository
Merging changes from the main repository
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
sklearn/feature_selection/_rfe.py
Outdated
grid_size = len(self.cv_results_) - 2 | ||
return np.asarray( | ||
[self.cv_results_["split{}_score".format(i)] | ||
for i in range(grid_size)]).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you wrong indented... and newline missing.
sklearn/feature_selection/_rfe.py:607:13: E128 continuation line under-indented for visual indent
for i in range(grid_size)]).T ^
sklearn/feature_selection/_rfe.py:607:42: W292 no newline at end of file
for i in range(grid_size)]).T ^
Exited with code exit status 1
sklearn/feature_selection/_rfe.py
Outdated
grid_scores = scores[::-1] / cv.get_n_splits(X, y, groups) | ||
self.cv_results_ = {} | ||
for i in range(grid_scores.shape[0]): | ||
key = "split{}_score".format(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key = "split%d_score" % i
Merging changes from the main repository
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs test for grid_scores_
and cv_results_
.
i think you test fail because you need rebase. check this great tut by my dear friend https://www.youtube.com/watch?v=Gjd44YpucEA |
@arka204 are you still working on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sphinx warning in the documentation is preventing your build from finalization.
Once the sphinx warning fixed and if you think that this PR is ready for review, do you mind changing the title from [WIP] to [MRG] as specified in the documentation? Thanks!
sklearn/feature_selection/_rfe.py
Outdated
@@ -457,6 +458,24 @@ class RFECV(RFE): | |||
``grid_scores_[i]`` corresponds to | |||
the CV score of the i-th subset of features. | |||
|
|||
.. deprecated:: 0.23 | |||
The `grid_scores_` attribute is deprecated in version 0.23 in favor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `grid_scores_` attribute is deprecated in version 0.23 in favor | |
The `grid_scores_` attribute is deprecated in version 0.23 in favor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This indentation and the next one will fix the 'unexpected unindent' sphinx warning (see artifacts)
sklearn/feature_selection/_rfe.py
Outdated
@@ -457,6 +458,24 @@ class RFECV(RFE): | |||
``grid_scores_[i]`` corresponds to | |||
the CV score of the i-th subset of features. | |||
|
|||
.. deprecated:: 0.23 | |||
The `grid_scores_` attribute is deprecated in version 0.23 in favor | |||
of `cv_results_` and will be removed in version 0.25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of `cv_results_` and will be removed in version 0.25 | |
of `cv_results_` and will be removed in version 0.25 |
sklearn/feature_selection/_rfe.py
Outdated
|
||
split(i)_score : float | ||
corresponds to the CV score of the i-th subset of features | ||
|
||
mean_score : float | ||
mean of split(i)_score values in dict | ||
|
||
std_score : float | ||
std of split(i)_score values in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the dictionary keys could be listed as a bullet list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with this for now, we use this formatting for other places where we return dicts (fetch_openml
).
…ps://github.com/arka204/scikit-learn into Replacing-grid_scores_-by-cv_results-in-_rfe.py
Can I have Your opinion on this @jnothman, @thomasjpfan? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a nontrivial test to make sure cv_results_["mean_score"]
and cv_results_["std_score"]
are computed correctly.
sklearn/feature_selection/_rfe.py
Outdated
"The grid_scores_ attribute is deprecated in version 0.24 in favor " | ||
"of cv_results_ and will be removed in version 0.25" | ||
) | ||
@property # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@property # type: ignore | |
@property |
sklearn/feature_selection/_rfe.py
Outdated
|
||
split(i)_score : float | ||
corresponds to the CV score of the i-th subset of features | ||
|
||
mean_score : float | ||
mean of split(i)_score values in dict | ||
|
||
std_score : float | ||
std of split(i)_score values in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am okay with this for now, we use this formatting for other places where we return dicts (fetch_openml
).
sklearn/feature_selection/_rfe.py
Outdated
A dict with keys: | ||
|
||
split(i)_score : float | ||
corresponds to the CV score of the i-th subset of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corresponds to the CV score of the i-th subset of features | |
corresponds to the CV score of the i-th subset of features |
sklearn/feature_selection/_rfe.py
Outdated
corresponds to the CV score of the i-th subset of features | ||
|
||
mean_score : float | ||
mean of split(i)_score values in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean of split(i)_score values in dict | |
mean of split(i)_score values |
sklearn/feature_selection/_rfe.py
Outdated
mean of split(i)_score values in dict | ||
|
||
std_score : float | ||
std of split(i)_score values in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std of split(i)_score values in dict | |
standard deviation of split(i)_score values in dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good!
We can have test_rfecv
be the only one that explicitly checks for the deprecation warning.
The other tests can be decorated with ignore_warnings
and remove the pytest.warns
. For example:
# TODO: Remove in 0.25 when grid_scores_ is deprecated
@ignore_warnings(category=FutureWarning)
def test_rfecv_cv_results_size()
sklearn/feature_selection/_rfe.py
Outdated
return self | ||
|
||
# mypy error: Decorated property not supported | ||
@deprecated( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deprecated( | |
@deprecated( # type: ignore |
with pytest.warns(FutureWarning, match=msg): | ||
assert len(rfecv.grid_scores_) == score_len | ||
|
||
assert (len(rfecv.cv_results_) - 2) == score_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert (len(rfecv.cv_results_) - 2) == score_len | |
assert len(rfecv.cv_results_) - 2 == score_len |
formula1(n_features, n_features_to_select, step)) | ||
assert (rfecv.grid_scores_.shape[0] == | ||
assert ((len(rfecv.cv_results_) - 2) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert ((len(rfecv.cv_results_) - 2) == | |
assert (len(rfecv.cv_results_) - 2 == |
|
||
assert (rfecv_cv_results_.keys() == rfecv.cv_results_.keys()) | ||
for key in rfecv_cv_results_.keys(): | ||
assert (rfecv_cv_results_[key] == rfecv.cv_results_[key]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparing floats:
assert (rfecv_cv_results_[key] == rfecv.cv_results_[key]) | |
assert rfecv_cv_results_[key] == pytest.approx(rfecv.cv_results_[key]) |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The computation of cv_results_
seems a little off.
for i in range(grid_scores.shape[0]): | ||
key = "split%d_score" % i | ||
self.cv_results_[key] = grid_scores[i] | ||
self.cv_results_["mean_score"] = np.mean(grid_scores, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scores
are already the sum along the 0 axis. So this mean, would not be over the splits.
We need to keep the original scores the was returned by:
scikit-learn/sklearn/feature_selection/_rfe.py
Lines 578 to 582 in 728b413
scores = parallel( | |
func(rfe, self.estimator, X, y, train, test, scorer) | |
for train, test in cv.split(X, y, groups)) | |
scores = np.sum(scores, axis=0) |
so we can compute the mean and std correctly.
(Also the grid_scores_
are already the means of each split)
If we hold on to the scores:
scores = parallel(
func(rfe, self.estimator, X, y, train, test, scorer)
for train, test in cv.split(X, y, groups))
scores = np.array(scores)
scores_sum = np.sum(scores, axis=0) # technically could use mean here
...
# reverse to stay consistent with before
scores_rev = scores[:, ::-1]
self.cv_results_ = {}
self.cv_results_["mean_score"] = np.mean(scores_rev, axis=0)
self.cv_results_["std_score"] = np.std(scores_rev, axis=0)
for i in range(scores.shape[0]):
self.cv_results_[f"split{i}_score"] = scores_rev[i]
And then grid_score_
is just self.cv_results_["mean_score"]
.
values = np.asarray( | ||
[rfecv.cv_results_["split{}_score".format(i)] | ||
for i in range(results_size - 2)]).T | ||
assert rfecv.cv_results_["mean_score"] == np.mean(values) | ||
assert rfecv.cv_results_["std_score"] == np.std(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like 'mean_score'
is a single number and not a vector of means for each feature subsets.
for key in rfecv.cv_results_.keys(): | ||
if key == 'std_score': | ||
assert (rfecv.cv_results_[key] == 0) | ||
else: | ||
assert (rfecv.cv_results_[key] == 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets remove this and use test_std_and_mean
to explicitly test this.
Hi @arka204 , would you be able to finish this pull request? Thanks! |
Working on it in #20161. |
Closed by #20161. |
Reference Issues/PRs
Partially fixes #11198
Based on #16392
What does this implement/fix? Explain your changes.
This PR replaces
grid_scores_
withcv_results_
in _rfy.py.Also adds temporary property
grid_scores
.Any other comments?
I plan to change tests in a similar way (replacing grid_scores_ with code of property
grid_scores_
) after confirming that it is correct way to do so.Am I mistaken or are those
grid_scores
a one-dimensional array?