Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX Improves feature names support for SelectFromModel + Est w/o names #21991

Merged
merged 11 commits into from Dec 24, 2021
6 changes: 6 additions & 0 deletions doc/whats_new/v1.0.rst
Expand Up @@ -35,6 +35,12 @@ Changelog
and :class:`decomposition.MiniBatchSparsePCA` to be convex and match the referenced
article. :pr:`19210` by :user:`Jérémie du Boisberranger <jeremiedbb>`.

:mod:`sklearn.feature_selection`
................................

- |Fix| Fixed :class:`feature_selection.SelectFromModel` to improve support
for estimators that set `feature_names_in_`. :pr:`21991` by `Thomas Fan`_.

:mod:`sklearn.metrics`
......................

Expand Down
4 changes: 4 additions & 0 deletions sklearn/feature_selection/_from_model.py
Expand Up @@ -265,8 +265,12 @@ def fit(self, X, y=None, **fit_params):
raise NotFittedError("Since 'prefit=True', call transform directly")
self.estimator_ = clone(self.estimator)
self.estimator_.fit(X, y, **fit_params)

if hasattr(self.estimator_, "feature_names_in_"):
self.feature_names_in_ = self.estimator_.feature_names_in_
else:
self._check_feature_names(X, reset=True)

return self

@property
Expand Down
31 changes: 31 additions & 0 deletions sklearn/feature_selection/tests/test_from_model.py
Expand Up @@ -428,3 +428,34 @@ def test_importance_getter(estimator, importance_getter):
)
selector.fit(data, y)
assert selector.transform(data).shape[1] == 1


class RandomForestNoFeatureNames(RandomForestClassifier):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should be using MinimalClassifier and MinimalRegressor.

def fit(self, X, y):
super().fit(X, y)
# Remove feature names
del self.feature_names_in_
return self


def test_estimator_does_not_support_feature_names():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make a more general test where we iterate over all possible estimators that are inheriting from MetaEstimatorMixin and create a classifier or a regressor that ensure the behaviour with the minimal classifier or regressor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated test in sklearn/tests/test_common.py to generate meta-estimators with MinimalEstimators.

I left this one here to test get_feature_names_out.

"""SelectFromModel works with estimators that do not support feature_names_in_.

Non-regression test for #21949.
"""
pytest.importorskip("pandas")
X, y = datasets.load_iris(as_frame=True, return_X_y=True)
all_feature_names = set(X.columns)

rf = RandomForestNoFeatureNames()
selector = SelectFromModel(rf).fit(X, y)

# selector learns the feature names itself
assert_array_equal(selector.feature_names_in_, X.columns)

feature_names_out = set(selector.get_feature_names_out())
assert feature_names_out < all_feature_names

with pytest.warns(None) as records:
selector.transform(X.iloc[1:3])
assert not [str(record.message) for record in records]