Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Improves feature names support for SelectFromModel + Est w/o names #21991

Merged
merged 11 commits into from Dec 24, 2021

Conversation

thomasjpfan
Copy link
Member

Reference Issues/PRs

Fixes #21949

What does this implement/fix? Explain your changes.

transform will validate twice if the inner estimator supports feature_names_in_ and also validates. This double validation already happens on main.

Any other comments?

In a future PR, I think we need a way to configure feature_names_in_ validation depending on if the delegated estimator supports feature_names_in_.

@@ -428,3 +428,34 @@ def test_importance_getter(estimator, importance_getter):
)
selector.fit(data, y)
assert selector.transform(data).shape[1] == 1


class RandomForestNoFeatureNames(RandomForestClassifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should be using MinimalClassifier and MinimalRegressor.

return self


def test_estimator_does_not_support_feature_names():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make a more general test where we iterate over all possible estimators that are inheriting from MetaEstimatorMixin and create a classifier or a regressor that ensure the behaviour with the minimal classifier or regressor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test in sklearn/tests/test_common.py to generate meta-estimators with MinimalEstimators.

I left this one here to test get_feature_names_out.

@thomasjpfan thomasjpfan changed the title FIX Fixes feature names support for SelectFromModel + Est w/o names FIX Fixes feature attributes for meta estimators + inner estimator that do not support feature attributes Dec 20, 2021
Copy link
Member Author

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scope of PR increased to cover all subclasses of MetaEstimatorMixin.

return self


def test_estimator_does_not_support_feature_names():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test in sklearn/tests/test_common.py to generate meta-estimators with MinimalEstimators.

I left this one here to test get_feature_names_out.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 20, 2021

The scope of this PR increased quite a bit. Meta-estimators are validating inputs when the delegate does not set the required feature_names_in_ or n_features_in_.

Should we consider this behavior change a bug fix for 1.0.2 or something for 1.1? I can reduce this PR back down to just fixing SelectFromModel if we think the scope is too big.

@@ -28,8 +28,7 @@ Version 1.0.2
:class:`multioutput.MultiOutputRegressor`,
:class:`multiclass.OneVsRestClassifier`,
:class:`multiclass.OutputCodeClassifier`,
:class:`multiclass.OutputCodeClassifier`,
:class:`pipeline.Pipeline`, and :class:`pipeline.FeatureUnion`
:class:`multiclass.OutputCodeClassifier`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we have twice the same classifier here :)

@@ -158,10 +158,9 @@ def predict_proba(self, X):
force_all_finite=False,
dtype=None,
accept_sparse=True,
ensure_2d=False,
ensure_2d=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this change come with a test?

@glemaitre
Copy link
Member

I think that we could limit to only the bug fix in SelectFromModel for 1.0.2
The additional change will not break backward compatibility and will add support for feature names in the case where we were not doing it before. So we could postpone for 1.1

sklearn/base.py Outdated
@@ -602,6 +602,42 @@ def _validate_data(

return out

def _check_features_support(self, X, *, delegate=None, reset=True):
"""Set or check both `n_features_in_` and `feature_names_in_` based on delegate.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _check_features_support might be a bit generic for people not aware to what we intend to do, would it be good to reference the SLEP of the n_features_in_ and feature_names_in_ as well (in the long description of the docstring).

@ogrisel
Copy link
Member

ogrisel commented Dec 21, 2021

I think that we could limit to only the bug fix in SelectFromModel for 1.0.2

I have the same feeling. But the PR looks nice otherwise.

@thomasjpfan thomasjpfan changed the title FIX Fixes feature attributes for meta estimators + inner estimator that do not support feature attributes FIX Improves feature names support for SelectFromModel + Est w/o names Dec 21, 2021
@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 21, 2021

Updated PR to reduce the scope back to focusing on SelectFromModel. Technically this PR changes the behavior of SelectFromModel by setting feature_names_in_ if the delegate does not set it.

The alternative is to pass the delegate to _validate_data and do not raise the warning when the delegate does not have feature_names_in_.

@thomasjpfan
Copy link
Member Author

thomasjpfan commented Dec 22, 2021

I updated the PR with the bare minimal change to be a bug fix. SelectFromModel no longer validates feature names and delegates to the the base estimator to validate.

get_feature_names_out gives the incorrect feature names is expected behavior since SelectFromModel does not have feature names when the base estimator does not define it. One would need to pass in the original feature names to get_feature_names_out to give expected results.

The better behavior is for the meta-estimator to learn the feature names if the delegate does not, which will be a 1.1 feature.

Edit: To pass common test SelectFromModel needs validate. I updated PR to have the better behavior as described above.

@glemaitre glemaitre self-assigned this Dec 24, 2021
@glemaitre glemaitre merged commit 6db0e2c into scikit-learn:main Dec 24, 2021
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Dec 24, 2021
scikit-learn#21991)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
glemaitre added a commit that referenced this pull request Dec 25, 2021
#21991)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
venkyyuvy pushed a commit to venkyyuvy/scikit-learn that referenced this pull request Jan 1, 2022
mathijs02 pushed a commit to mathijs02/scikit-learn that referenced this pull request Dec 27, 2022
scikit-learn#21991)

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SelectFromModel function in scikit 1.0.1 does not work properly with catboost and caching
3 participants