Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Replacing grid_scores_ by cv_results_ in _rfe.py #16961

Conversation

arka204
Copy link
Contributor

@arka204 arka204 commented Apr 18, 2020

Reference Issues/PRs

Partially fixes #11198
Based on #16392

What does this implement/fix? Explain your changes.

This PR replaces grid_scores_ with cv_results_ in _rfy.py.
Also adds temporary property grid_scores.

Any other comments?

I plan to change tests in a similar way (replacing grid_scores_ with code of property grid_scores_) after confirming that it is correct way to do so.
Am I mistaken or are those grid_scores a one-dimensional array?

Merging changes from the main repository
Copy link

@KumarGanesha1996 KumarGanesha1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

grid_size = len(self.cv_results_) - 2
return np.asarray(
[self.cv_results_["split{}_score".format(i)]
for i in range(grid_size)]).T

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you wrong indented... and newline missing.

sklearn/feature_selection/_rfe.py:607:13: E128 continuation line under-indented for visual indent
for i in range(grid_size)]).T ^
sklearn/feature_selection/_rfe.py:607:42: W292 no newline at end of file
for i in range(grid_size)]).T ^

Exited with code exit status 1

grid_scores = scores[::-1] / cv.get_n_splits(X, y, groups)
self.cv_results_ = {}
for i in range(grid_scores.shape[0]):
key = "split{}_score".format(i)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key = "split%d_score" % i

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs test for grid_scores_ and cv_results_.

@KumarGanesha1996
Copy link

i think you test fail because you need rebase. check this great tut by my dear friend https://www.youtube.com/watch?v=Gjd44YpucEA

@KumarGanesha1996
Copy link

@arka204 are you still working on this?

Copy link
Member

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sphinx warning in the documentation is preventing your build from finalization.
Once the sphinx warning fixed and if you think that this PR is ready for review, do you mind changing the title from [WIP] to [MRG] as specified in the documentation? Thanks!

@@ -457,6 +458,24 @@ class RFECV(RFE):
``grid_scores_[i]`` corresponds to
the CV score of the i-th subset of features.

.. deprecated:: 0.23
The `grid_scores_` attribute is deprecated in version 0.23 in favor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `grid_scores_` attribute is deprecated in version 0.23 in favor
The `grid_scores_` attribute is deprecated in version 0.23 in favor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This indentation and the next one will fix the 'unexpected unindent' sphinx warning (see artifacts)

@@ -457,6 +458,24 @@ class RFECV(RFE):
``grid_scores_[i]`` corresponds to
the CV score of the i-th subset of features.

.. deprecated:: 0.23
The `grid_scores_` attribute is deprecated in version 0.23 in favor
of `cv_results_` and will be removed in version 0.25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
of `cv_results_` and will be removed in version 0.25
of `cv_results_` and will be removed in version 0.25

Comment on lines 467 to 475

split(i)_score : float
corresponds to the CV score of the i-th subset of features

mean_score : float
mean of split(i)_score values in dict

std_score : float
std of split(i)_score values in dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the dictionary keys could be listed as a bullet list?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with this for now, we use this formatting for other places where we return dicts (fetch_openml).

@arka204 arka204 changed the title [WIP] Replacing grid_scores_ by cv_results_ in _rfe.py [MRG] Replacing grid_scores_ by cv_results_ in _rfe.py May 30, 2020
@arka204
Copy link
Contributor Author

arka204 commented Jun 6, 2020

Can I have Your opinion on this @jnothman, @thomasjpfan?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a nontrivial test to make sure cv_results_["mean_score"] and cv_results_["std_score"] are computed correctly.

sklearn/feature_selection/_rfe.py Show resolved Hide resolved
"The grid_scores_ attribute is deprecated in version 0.24 in favor "
"of cv_results_ and will be removed in version 0.25"
)
@property # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@property # type: ignore
@property

Comment on lines 467 to 475

split(i)_score : float
corresponds to the CV score of the i-th subset of features

mean_score : float
mean of split(i)_score values in dict

std_score : float
std of split(i)_score values in dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with this for now, we use this formatting for other places where we return dicts (fetch_openml).

A dict with keys:

split(i)_score : float
corresponds to the CV score of the i-th subset of features
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
corresponds to the CV score of the i-th subset of features
corresponds to the CV score of the i-th subset of features

corresponds to the CV score of the i-th subset of features

mean_score : float
mean of split(i)_score values in dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mean of split(i)_score values in dict
mean of split(i)_score values

mean of split(i)_score values in dict

std_score : float
std of split(i)_score values in dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std of split(i)_score values in dict
standard deviation of split(i)_score values in dict

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good!

We can have test_rfecv be the only one that explicitly checks for the deprecation warning.

The other tests can be decorated with ignore_warnings and remove the pytest.warns. For example:

# TODO: Remove in 0.25 when grid_scores_ is deprecated
@ignore_warnings(category=FutureWarning)
def test_rfecv_cv_results_size()

return self

# mypy error: Decorated property not supported
@deprecated(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@deprecated(
@deprecated( # type: ignore

with pytest.warns(FutureWarning, match=msg):
assert len(rfecv.grid_scores_) == score_len

assert (len(rfecv.cv_results_) - 2) == score_len
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert (len(rfecv.cv_results_) - 2) == score_len
assert len(rfecv.cv_results_) - 2 == score_len

formula1(n_features, n_features_to_select, step))
assert (rfecv.grid_scores_.shape[0] ==
assert ((len(rfecv.cv_results_) - 2) ==
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert ((len(rfecv.cv_results_) - 2) ==
assert (len(rfecv.cv_results_) - 2 ==

sklearn/feature_selection/tests/test_rfe.py Outdated Show resolved Hide resolved

assert (rfecv_cv_results_.keys() == rfecv.cv_results_.keys())
for key in rfecv_cv_results_.keys():
assert (rfecv_cv_results_[key] == rfecv.cv_results_[key])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is comparing floats:

Suggested change
assert (rfecv_cv_results_[key] == rfecv.cv_results_[key])
assert rfecv_cv_results_[key] == pytest.approx(rfecv.cv_results_[key])

arka204 and others added 2 commits June 9, 2020 01:38
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The computation of cv_results_ seems a little off.

for i in range(grid_scores.shape[0]):
key = "split%d_score" % i
self.cv_results_[key] = grid_scores[i]
self.cv_results_["mean_score"] = np.mean(grid_scores, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scores are already the sum along the 0 axis. So this mean, would not be over the splits.

We need to keep the original scores the was returned by:

scores = parallel(
func(rfe, self.estimator, X, y, train, test, scorer)
for train, test in cv.split(X, y, groups))
scores = np.sum(scores, axis=0)

so we can compute the mean and std correctly.

(Also the grid_scores_ are already the means of each split)

If we hold on to the scores:

scores = parallel( 
     func(rfe, self.estimator, X, y, train, test, scorer) 
     for train, test in cv.split(X, y, groups)) 

scores = np.array(scores)  
scores_sum = np.sum(scores, axis=0)  # technically could use mean here
...

# reverse to stay consistent with before
scores_rev = scores[:, ::-1]
self.cv_results_ = {}
self.cv_results_["mean_score"] = np.mean(scores_rev, axis=0)
self.cv_results_["std_score"] = np.std(scores_rev, axis=0)

for i in range(scores.shape[0]):
	self.cv_results_[f"split{i}_score"] = scores_rev[i]

And then grid_score_ is just self.cv_results_["mean_score"].

Comment on lines +518 to +522
values = np.asarray(
[rfecv.cv_results_["split{}_score".format(i)]
for i in range(results_size - 2)]).T
assert rfecv.cv_results_["mean_score"] == np.mean(values)
assert rfecv.cv_results_["std_score"] == np.std(values)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like 'mean_score' is a single number and not a vector of means for each feature subsets.

Comment on lines +184 to +188
for key in rfecv.cv_results_.keys():
if key == 'std_score':
assert (rfecv.cv_results_[key] == 0)
else:
assert (rfecv.cv_results_[key] == 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this and use test_std_and_mean to explicitly test this.

@cmarmo
Copy link
Member

cmarmo commented Aug 23, 2020

Hi @arka204 , would you be able to finish this pull request? Thanks!

@wowry
Copy link
Contributor

wowry commented May 28, 2021

Working on it in #20161.

@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed help wanted labels May 28, 2021
@cmarmo
Copy link
Member

cmarmo commented Jul 30, 2021

Closed by #20161.

@cmarmo cmarmo closed this Jul 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deprecate grid_scores_ everywhere, replace by cv_results_
5 participants