Skip to content

Conversation

diederikwp
Copy link
Contributor

Reference Issues/PRs

What does this implement/fix? Explain your changes.

FeatureUnion allows "passthrough" to be supplied in stead of a regular transformer, in which case all input features will be passed through. Currently, calling get_feature_names_out on such a union fails however. This PR makes it return the input feature names (prefixed with the transformer name) instead of failing.

Example: This code

import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.pipeline import FeatureUnion


ft = FeatureUnion([("imp", SimpleImputer()), ("pass", "passthrough")])

X = np.array([[1, 2, 3], [4, np.NaN, 5]])
ft.fit(X)
ft.get_feature_names_out(["f1", "f2", "f3"])

will return array(['imp__f1', 'imp__f2', 'imp__f3', 'pass__f1', 'pass__f2', 'pass__f3'], dtype=object) after merging this PR.

Currently it raises AttributeError: Transformer pass (type FunctionTransformer) does not provide get_feature_names_out.

Any other comments?

I believe this is the behaviour most users would expect. It also makes FeatureUnion consistent with ColumnTransformer when it comes to handling "passthrough" in get_feature_names_out.

@thomasjpfan thomasjpfan changed the title [MRG] Implement get_feature_names_out for FeatureUnion in case of "passthrough" ENH Implement get_feature_names_out for FeatureUnion in case of "passthrough" Aug 1, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I left a minor comment on the whats new.

Overall LGTM

@thomasjpfan thomasjpfan added the Quick Review For PRs that are quick to review label Aug 1, 2022
@diederikwp diederikwp force-pushed the feature-names-out-for-featurunion-passthrough branch from 17de56e to 9ca75e9 Compare August 13, 2022 09:09
@diederikwp
Copy link
Contributor Author

Updated with main branch and resolved a conflict in the whats_new. Anything else required before we can merge @thomasjpfan :) ?

@jeremiedbb jeremiedbb added this to the 1.2 milestone Sep 6, 2022
@jeremiedbb
Copy link
Member

Thanks for the PR @diederikwp. Mostly looks good, however it still does not work if we don't pass the feature names to get_feature_names_out:

import pandas as pd
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA
union = FeatureUnion([("pca", PCA(n_components=1)),
                      ("pass", "passthrough")])
X = pd.DataFrame([[0., 1., 3], [2., 2., 5]], columns=["a", "b", "c"])
union.fit_transform(X)
union.get_feature_names_out()

raises

When 'feature_names_out' is 'one-to-one', either 'input_features' must be passed, or 'feature_names_in_' and/or 'n_features_in_' must be defined. If you set 'validate' to 'True', then they will be defined automatically when 'fit' is called. Alternatively, you can set them in 'func'.

It comes from FunctionTransformer.

@thomasjpfan
Copy link
Member

Good catch @jeremiedbb, we do not have a test for not passing in input_features to FeatureUnion.get_feature_names_out.

I think the simplest solution is for FeatureUnion to set feature_names_in_ by calling self._check_feature_names. I would put the call in FeatureUnion._parallel_func because _parallel_func is used in both fit_transform and fit. Then we can use those names in get_feature_names_out:

def get_feature_names_out(self, input_features=None):
    input_features = _check_feature_names_in(
        self, input_features, generate_names=True
    )
    for name, trans, _ in self._iter():
        ...

Note _check_feature_names_in is from sklearn.utils.validation.

(There is a more complicated solution that requires #23993 and having FeatureUnion remember the state of the FunctionTransformer from fit. FunctionTransformer would be responsible for storing feature_names_in_, which is used when input_features=None)

@jeremiedbb
Copy link
Member

#23993 has been merged which automatically fixes the issue reported here #24058 (comment). I just added a test for that. @thomasjpfan is it still good for you ?

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @diederikwp

@diederikwp
Copy link
Contributor Author

Thanks @thomasjpfan and @jeremiedbb for making it work without passing input_features and for the reviews!

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this still looks good. Thank you for the update @diederikwp and @jeremiedbb !

@thomasjpfan thomasjpfan merged commit b4f1701 into scikit-learn:main Sep 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:pipeline Quick Review For PRs that are quick to review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants