Add number of features used at each step to RFECV.cv_results_ #28670

miguelcsilva · 2024-03-20T17:09:49Z

Reference Issues/PRs

Addresses the suggestion made here.

What does this implement/fix? Explain your changes.

Add a new key to the RFECV.cv_results_ dictionary. This key is named n_features and its value is a numpy array with the number of features used at each step of the recursive feature elimination process.
It also adds a new test that verifies: 1) the added array is correct; 2) the size of all arrays of this dict is the same.
Finally, it updates the docs here to make use of the simplified way to build the plot. See below plot for the rendered version of the new doc page:

Any other comments?

Tried to make the code roughly aligned with current codebase logic, though I'm not sure I've fully been able to adhere to the repo spirit (specially on the type hints in the tests). So feel free to propose any changes/corrections.

github-actions · 2024-03-20T17:11:26Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 24be84e. Link to the linter CI: here}

jeremiedbb

Thanks for the PR @miguelcsilva.

I think it would make things even simpler if RFE exposed the list of n_features kept at each iteration as a fitted attribute array. This way you wouldn't have to recompute the list with this complicated logic when building the cv_results_ dict. It would also make the computation of the final number of features to select easier. The following

scikit-learn/sklearn/feature_selection/_rfe.py

Lines 761 to 767 in 02b7d44

    
           scores = np.array(scores) 
        
           scores_sum = np.sum(scores, axis=0) 
        
           scores_sum_rev = scores_sum[::-1] 
        
           argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1 
        
           n_features_to_select = max( 
        
               n_features - (argmax_idx * step), self.min_features_to_select 
        
           )

could be replaced by

        scores_sum = np.sum(scores, axis=0)
        n_features_to_select = rfe.n_features_list_[np.argmax(scores_sum)]

where rfe.n_features_list_ is the new attribute, better name to be found.

sklearn/feature_selection/_rfe.py

sklearn/feature_selection/tests/test_rfe.py

ogrisel

Thanks very much for the clean PR. In addition to @jeremiedbb's comment, here are a few more:

sklearn/feature_selection/tests/test_rfe.py

sklearn/feature_selection/_rfe.py

miguelcsilva · 2024-03-24T21:47:13Z

@ogrisel and @jeremiedbb think I have addressed most of your suggestions.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

scikit-learn/sklearn/feature_selection/tests/test_rfe.py

Line 432 in 9b7d176

rfecv.set_params(n_jobs=2)

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Finally, it seems like I also have an error on the changelog even though I added the changes to the changelog for v1.5. Also not sure if I'm doing something wrong here - should have added to v1.4 instead?

Would be great if one of the maintainers could steer me in the right direction on the points above. Hopefully I could then learn this for next time.

ogrisel · 2024-03-25T05:42:54Z

Could you please add an entry to the changelog under doc/whats_new/v1.5.rst?

I will try to review the rest later today.

…/miguelcsilva/scikit-learn into pr/miguelcsilva/28670

jeremiedbb · 2024-03-25T23:17:05Z

Thanks @miguelcsilva. I directly pushedsome changes to fix the CI issues.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

It could be fixed by reversing the arrays so that in case of tie breaks the lowest number of features is selected. I pushed it back in my last commit. I still find it a lot more readable than before. And I added a comment to explain why we reverse.

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

Actually I reverted making it a public attribute. I only made it a fitted attribute only accessible by RFECV, like the scores_ attributes, which makes things easier and keep scope of the PR more focused. It was also needed to be returned by fit_single_rfe because when fitting in parallel, the fitted rfe are in subprocesses and not available from the main process. Hence the missing attribute error you got.

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Since I made the attribute not public, this CI issue is thus no longer relevant.

jeremiedbb

LGTM

doc/whats_new/v1.5.rst

ogrisel

Nitpicking about variable names, but LGTM, whatever your final decision.

sklearn/feature_selection/_rfe.py

miguelcsilva · 2024-03-26T21:11:56Z

Thanks @miguelcsilva. I directly pushedsome changes to fix the CI issues.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

It could be fixed by reversing the arrays so that in case of tie breaks the lowest number of features is selected. I pushed it back in my last commit. I still find it a lot more readable than before. And I added a comment to explain why we reverse.

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

Actually I reverted making it a public attribute. I only made it a fitted attribute only accessible by RFECV, like the scores_ attributes, which makes things easier and keep scope of the PR more focused. It was also needed to be returned by fit_single_rfe because when fitting in parallel, the fitted rfe are in subprocesses and not available from the main process. Hence the missing attribute error you got.

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Since I made the attribute not public, this CI issue is thus no longer relevant.

Thanks for your help with this @jeremiedbb. Just read through the changes you made. Very insightful. Will try to keep it in mind for the next one.

…/miguelcsilva/scikit-learn into pr/miguelcsilva/28670

jeremiedbb

LGTM. Thanks @miguelcsilva !

github-actions bot added the module:feature_selection label Mar 20, 2024

jeremiedbb reviewed Mar 21, 2024

View reviewed changes

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved

sklearn/feature_selection/tests/test_rfe.py Outdated Show resolved Hide resolved

ogrisel reviewed Mar 21, 2024

View reviewed changes

sklearn/feature_selection/tests/test_rfe.py Outdated Show resolved Hide resolved

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved

miguelcsilva force-pushed the add-n_features-rfecv-cv_results_ branch from acbb2c6 to ec9dce6 Compare March 24, 2024 17:06

Add n_features to RFECV.cv_results_

c213ea5

miguelcsilva force-pushed the add-n_features-rfecv-cv_results_ branch from 6e97197 to c213ea5 Compare March 24, 2024 21:31

miguelcsilva and others added 3 commits March 25, 2024 18:03

Fix pr in changelog

8a57ab3

make new attr non public + simplify argmax

2fa460d

Merge branch 'add-n_features-rfecv-cv_results_' of https://github.com…

bbaf718

…/miguelcsilva/scikit-learn into pr/miguelcsilva/28670

jeremiedbb approved these changes Mar 25, 2024

View reviewed changes

jeremiedbb reviewed Mar 26, 2024

View reviewed changes

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved

jeremiedbb and others added 2 commits March 26, 2024 11:13

cln

29ff810

add meson-python to requirements

4fca941

ogrisel approved these changes Mar 26, 2024

View reviewed changes

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved

Rename private vars

b777435

jeremiedbb added 3 commits March 27, 2024 14:57

a little bit more renaming

3dcbb8a

Merge branch 'add-n_features-rfecv-cv_results_' of https://github.com…

88e0046

…/miguelcsilva/scikit-learn into pr/miguelcsilva/28670

oops

24be84e

jeremiedbb approved these changes Mar 27, 2024

View reviewed changes

jeremiedbb enabled auto-merge (squash) March 27, 2024 14:02

jeremiedbb merged commit 710fe97 into scikit-learn:main Mar 27, 2024
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add number of features used at each step to RFECV.cv_results_ #28670

Add number of features used at each step to RFECV.cv_results_ #28670

miguelcsilva commented Mar 20, 2024

github-actions bot commented Mar 20, 2024 •

edited

jeremiedbb left a comment

ogrisel left a comment

miguelcsilva commented Mar 24, 2024

ogrisel commented Mar 25, 2024 •

edited

jeremiedbb commented Mar 25, 2024 •

edited

jeremiedbb left a comment

ogrisel left a comment

miguelcsilva commented Mar 26, 2024

jeremiedbb left a comment

	scores = np.array(scores)
	scores_sum = np.sum(scores, axis=0)
	scores_sum_rev = scores_sum[::-1]
	argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1
	n_features_to_select = max(
	n_features - (argmax_idx * step), self.min_features_to_select
	)

Add number of features used at each step to RFECV.cv_results_ #28670

Add number of features used at each step to RFECV.cv_results_ #28670

Conversation

miguelcsilva commented Mar 20, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 20, 2024 • edited

✔️ Linting Passed

jeremiedbb left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

miguelcsilva commented Mar 24, 2024

ogrisel commented Mar 25, 2024 • edited

jeremiedbb commented Mar 25, 2024 • edited

jeremiedbb left a comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

miguelcsilva commented Mar 26, 2024

jeremiedbb left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 20, 2024 •

edited

ogrisel commented Mar 25, 2024 •

edited

jeremiedbb commented Mar 25, 2024 •

edited