Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add number of features used at each step to RFECV.cv_results_ #28670

Merged

Conversation

miguelcsilva
Copy link
Contributor

Reference Issues/PRs

Addresses the suggestion made here.

What does this implement/fix? Explain your changes.

Add a new key to the RFECV.cv_results_ dictionary. This key is named n_features and its value is a numpy array with the number of features used at each step of the recursive feature elimination process.
It also adds a new test that verifies: 1) the added array is correct; 2) the size of all arrays of this dict is the same.
Finally, it updates the docs here to make use of the simplified way to build the plot. See below plot for the rendered version of the new doc page:
image

Any other comments?

Tried to make the code roughly aligned with current codebase logic, though I'm not sure I've fully been able to adhere to the repo spirit (specially on the type hints in the tests). So feel free to propose any changes/corrections.

Copy link

github-actions bot commented Mar 20, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 24be84e. Link to the linter CI: here

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @miguelcsilva.

I think it would make things even simpler if RFE exposed the list of n_features kept at each iteration as a fitted attribute array. This way you wouldn't have to recompute the list with this complicated logic when building the cv_results_ dict. It would also make the computation of the final number of features to select easier. The following

scores = np.array(scores)
scores_sum = np.sum(scores, axis=0)
scores_sum_rev = scores_sum[::-1]
argmax_idx = len(scores_sum) - np.argmax(scores_sum_rev) - 1
n_features_to_select = max(
n_features - (argmax_idx * step), self.min_features_to_select
)

could be replaced by

        scores_sum = np.sum(scores, axis=0)
        n_features_to_select = rfe.n_features_list_[np.argmax(scores_sum)]

where rfe.n_features_list_ is the new attribute, better name to be found.

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved
sklearn/feature_selection/tests/test_rfe.py Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the clean PR. In addition to @jeremiedbb's comment, here are a few more:

sklearn/feature_selection/tests/test_rfe.py Outdated Show resolved Hide resolved
sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved
@miguelcsilva miguelcsilva force-pushed the add-n_features-rfecv-cv_results_ branch from acbb2c6 to ec9dce6 Compare March 24, 2024 17:06
@miguelcsilva miguelcsilva force-pushed the add-n_features-rfecv-cv_results_ branch from 6e97197 to c213ea5 Compare March 24, 2024 21:31
@miguelcsilva
Copy link
Contributor Author

@ogrisel and @jeremiedbb think I have addressed most of your suggestions.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

rfecv.set_params(n_jobs=2)

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Finally, it seems like I also have an error on the changelog even though I added the changes to the changelog for v1.5. Also not sure if I'm doing something wrong here - should have added to v1.4 instead?

Would be great if one of the maintainers could steer me in the right direction on the points above. Hopefully I could then learn this for next time.

@ogrisel
Copy link
Member

ogrisel commented Mar 25, 2024

Could you please add an entry to the changelog under doc/whats_new/v1.5.rst?

I will try to review the rest later today.

@jeremiedbb
Copy link
Member

jeremiedbb commented Mar 25, 2024

Thanks @miguelcsilva. I directly pushedsome changes to fix the CI issues.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

It could be fixed by reversing the arrays so that in case of tie breaks the lowest number of features is selected. I pushed it back in my last commit. I still find it a lot more readable than before. And I added a comment to explain why we reverse.

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

Actually I reverted making it a public attribute. I only made it a fitted attribute only accessible by RFECV, like the scores_ attributes, which makes things easier and keep scope of the PR more focused. It was also needed to be returned by fit_single_rfe because when fitting in parallel, the fitted rfe are in subprocesses and not available from the main process. Hence the missing attribute error you got.

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Since I made the attribute not public, this CI issue is thus no longer relevant.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

doc/whats_new/v1.5.rst Outdated Show resolved Hide resolved
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking about variable names, but LGTM, whatever your final decision.

sklearn/feature_selection/_rfe.py Outdated Show resolved Hide resolved
@miguelcsilva
Copy link
Contributor Author

Thanks @miguelcsilva. I directly pushedsome changes to fix the CI issues.

Did as @jeremiedbb proposed and added a n_features_fitted_ attribute to the RFE class. Couldn't however get this #28670 (review) suggestion to work. The problem is that in case of a tie for best score, the one with the lowest number of features must be selected (there is a test validating that).

It could be fixed by reversing the arrays so that in case of tie breaks the lowest number of features is selected. I pushed it back in my last commit. I still find it a lot more readable than before. And I added a comment to explain why we reverse.

There are still two tests failing that I'm not quite sure how to get passing. The first is that on the below, when .set_params(n_jobs=2) is called the rfecv object loses the n_features_fitted_ attribute that was set before

Actually I reverted making it a public attribute. I only made it a fitted attribute only accessible by RFECV, like the scores_ attributes, which makes things easier and keep scope of the PR more focused. It was also needed to be returned by fit_single_rfe because when fitting in parallel, the fitted rfe are in subprocesses and not available from the main process. Hence the missing attribute error you got.

The second error is on test_fit_docstring_attributes. I'm not quite sure how to work with the doctsrings in this repo yet.

Since I made the attribute not public, this CI issue is thus no longer relevant.

Thanks for your help with this @jeremiedbb. Just read through the changes you made. Very insightful. Will try to keep it in mind for the next one.

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @miguelcsilva !

@jeremiedbb jeremiedbb enabled auto-merge (squash) March 27, 2024 14:02
@jeremiedbb jeremiedbb merged commit 710fe97 into scikit-learn:main Mar 27, 2024
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants