ENH ColumnTransformer.get_feature_names() handles passthrough #14048

lrjball · 2019-06-08T23:40:49Z

Currently, if ColumnTransformer is called with remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality.

In this pull request, if the ColumnTransformer was fitted on a DataFrame and remainder='passthrough' then the columns which passed through will appear in get_feature_names() as their column names in the DataFrame, and if it was not fitted on a DataFrame, then it will be the indices of the columns which will appear in get_feature_names().

Also, if someone explicitly passes a transformer as the text 'passthrough', then the feature names will be name__{column name}, where column_name is whatever have defined when they passed it.
e.g.
ct = ColumnTransformer([('trans', 'passthrough', ['col0', 'col1'])])
will produce features
['trans__col0', 'trans__col1']
which is in keeping with the existing behavior for other transformers.

The behaviour for when a transformer does not have a get_feature_names method has also been changed. Now, instead of an error being raised, the feature names will be given as 'name_x0', ..., 'name_xN'. This allows get_feature_names to be useful by giving at least some indication of where the features came from, even if some of the transformers do not give explicit feature names.

Don't believe this fixes any currently open issues.

…sthough' Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead.

jnothman

I think the only reason we left this unimplemented was to avoid making hard commitments when first implementing the ColumnTransformer. Thanks for the pr

sklearn/compose/tests/test_column_transformer.py

jnothman · 2019-06-11T08:19:20Z

sklearn/compose/tests/test_column_transformer.py

-    assert_raise_message(
-        NotImplementedError, 'get_feature_names is not yet supported',
-        ct.get_feature_names)
+    assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1])


Not entirely happy with the mix of string and numeric types.

I know what you mean. The alternative would be to cast the array indices to string, would that be better?

This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.

"x0", "x1", ... seems to be used in a few other transformers (e.g. PolynomialFeatures), and is mentioned in scikit-learn/enhancement_proposals#18 as well. Will this be dependent on that SLEP then?

That SLEP could certainly inform the choice. I'd be most happy with "x0", "x1", ... here for now.

Should the indices be the same as the indices in the input array, or should they be from 0 to the number of passthrough columns? I think that keeping the indices the same as the input would make them easier to interpret, but this goes against the behaviour of something like PolynomialFeatures, or even the behaviour implemented here for transformers with no get_feature_names, where they always start counting from "x0" .

If starting from 0, the strings would probably need to start with something like 'passthrough__' to avoid confusion.

Number according to the input features, please.

I've added this now, thanks

jnothman · 2019-06-11T08:19:27Z

sklearn/compose/tests/test_column_transformer.py

-    assert_raise_message(
-        NotImplementedError, 'get_feature_names is not yet supported',
-        ct.get_feature_names)
+    assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1'])


I'm not convinced that we should be applying a prefix if trans does nothing... Can you imagine a context where this helps? Yes, it avoids naming conflicts...

This was just to keep it consistent with the 'name__column' syntax used for when an explicit name is passed with a transformer, if for any reason someone passed transformers=[(name, 'passthrough', [columns])]. This seemed more consistent than dropping the name.

Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.

I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer. But as it can be passed as a transformer with a name, it seems to make sense to use that name on the feature names, like with the other transformers. Having said that, these should probably be 'trans_x0', 'trans_x1' to be consistent with that notation. Happy to change it altogether though if there is a better option.

I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer.

It is available to allow the user to try disabling some transformations in a parameter search such as grid search.

Ah okay, so I suppose in that case you would want those disabled columns to be treated the same way as all of the other passthrough columns. I will make this change.

While making the changes for ‘passthrough’, it seems sensible to make an additional change to ColumnTransformer.get_feature_names(), to make it always returns something. This implementation adds feature names name__x0, …, name__xN for transformers without a get_feature_names method. It seems harsh to raise an error if there are some transformers which do have feature names, and even if none of them have a get_feature_names then it is still helpful to know which features came from which transformer.

jorisvandenbossche

Thanks for working on this!

jorisvandenbossche · 2019-06-18T06:40:58Z

sklearn/compose/_column_transformer.py

-            feature_names.extend([name + "__" + f for f in
-                                  trans.get_feature_names()])
+                feature_names.extend([name + "__x" + str(i)
+                                      for i in range(dim)])


So this is also giving names if the transformer has no feature names method? (just to clarify, this is not strictly related to the PR of enabling feature names for passthrough?)

Yes that's right, it seems better to give default names than to error out especially if several of the other transformers have feature names. And even if none of them do, at least this let's you know which features came from which transformer. I suppose it isn't strictly related to the passthrough parameter, although is pretty similar. Would it be worth making a separate pull request for this?

Hmm... Yes, I don't think we should be doing this. I'd rather tell the user that they have to define get_feature_names in a component transformer than to invent something.

Fair enough, I suppose this would have just been a workaround until every Transformer has a get_feature_names() method.

I've now removed this change

sklearn/compose/tests/test_column_transformer.py

jorisvandenbossche · 2019-06-18T06:46:49Z

sklearn/compose/tests/test_column_transformer.py

-    assert_raise_message(
-        NotImplementedError, 'get_feature_names is not yet supported',
-        ct.get_feature_names)
+    assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1])


This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.

…sthough' Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead, where the feature names will be 'xi' for the ith index.

jnothman · 2019-07-11T10:42:17Z

sklearn/compose/_column_transformer.py

+        remaining_idx = sorted(list(set(range(n_columns)) - set(cols)))
+        if hasattr(X, 'columns'):
+            columns = X.columns
+            self._remainder_names = [columns[idx] for idx in remaining_idx]


This code can be changed after #14237 is merged

jnothman · 2019-07-11T10:46:03Z

sklearn/compose/tests/test_column_transformer.py

-    assert_raise_message(
-        NotImplementedError, 'get_feature_names is not yet supported',
-        ct.get_feature_names)
+    assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1'])


Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.

…h' the same way as remainder='passthrough'. The behaviour of get_feature_names for passthrough is now the following: - If fitted on a dataframe, then the columns passed with the trans='passthrough' will be treated as positional if int (in which case the feature name will be the column name at that position) or they will be used for the actual feature name. Any columns in remainder='passthrough' will be appended to the end of feature_names, as the column names from the fitted dataframe - If fitted on an array, then the column names for both when trans='passthrough' or when remainder='passthrough' will be 'xi' for each index value i. In terms of ordering, the remainder columns will again come after the rest of the feature names.

sklearn/compose/_column_transformer.py

Removed as this was waiting on PR #14495, which did not go ahead.

adrinjalali · 2020-02-14T13:19:36Z

Opened #16444 to discuss the options related to the remaining question.

adrinjalali · 2020-02-19T16:40:37Z

@lrjball thanks for your patience. #16444 resolves the prefixing decision issue, and as you can see there, we'd like to have the feature names not prefixed in either of the cases:

remainder=passthrough
estimator is passthrough

Would you have time to apply the changes?

lrjball · 2020-02-19T20:10:17Z

@adrinjalali No problem. Can I just confirm this is the agreed logic before I make any changes:

when a dataframe is passed, the feature names when either the transformer is 'passthrough' or remainder='passthrough' will just be the column names from the dataframe.
when an array is passed, the feature names for either passthough case will be the string 'xi' where i is the index of that feature in the input array.

In either case, if any of the transformers other than 'passthrough' (or 'drop') don't have a get_feature_names attribute, an error will be raised.

Is that right?

adrinjalali · 2020-02-24T10:55:59Z

Yes, that's how I think of it.

Fixed issue caused by missing import introduced when doing a merge in the browser. _check_key_type has been replaced with _determine_key_type.

lrjball · 2020-03-02T21:22:49Z

Okay, the behavior is now as agreed above, and the merge conflict has been fixed.

adrinjalali

Nice! Thanks for your patience and persistence @lrjball

thomasjpfan

Thank you for the PR @lrjball !

sklearn/compose/tests/test_column_transformer.py

sklearn/compose/_column_transformer.py

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

- Seperated the pandas part of the test into its own function to avoid the whole test being skipped when pandas is not installed. - Removed the unused _output_dims attribute.

…rjball/scikit-learn into column_transformer_passthrough

jnothman

Please add an |Enhancement| entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

lrjball · 2020-03-05T19:12:44Z

Thanks, I have added an entry to the change log now. Is it worth me updating the get_feature_names docstring as well, or is that unnecessary?

sklearn/compose/tests/test_column_transformer.py

Added support for mask and slices for both dataframes and arrays, as well as tests for each case.

…-learn#14048)

jnothman reviewed Jun 11, 2019

View reviewed changes

jorisvandenbossche reviewed Jun 18, 2019

View reviewed changes

removed typo fix to raise as separate PR

728cc8a

lrjball mentioned this pull request Jun 19, 2019

Fixed typo in test_column_transformer #14128

Merged

jnothman reviewed Jul 11, 2019

View reviewed changes

rth added the Needs work label Jul 25, 2019

lrjball added 2 commits July 26, 2019 22:22

Merge branch 'master' into column_transformer_passthrough

073848d

replaced depricated assert_equal with assert

4cacd93

rth removed the Needs work label Jul 26, 2019

lrjball mentioned this pull request Jul 28, 2019

Allowed trans='passthrough' to handle scalar column input. #14495

Closed

lrjball commented Jul 28, 2019

View reviewed changes

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

lrjball added 2 commits August 18, 2019 22:03

removed checks on scalar columns for passthrough

38f9db0

Removed as this was waiting on PR #14495, which did not go ahead.

Merge branch 'master' into column_transformer_passthrough

1b55757

jnothman added the Waiting for Reviewer label Aug 25, 2019

adrinjalali added module:compose Needs Decision Requires decision labels Feb 14, 2020

adrinjalali mentioned this pull request Feb 14, 2020

RFC: prefixing output feature names in ColumnTransformer with passthrough #16444

Closed

adrinjalali added Needs work and removed Needs Decision Requires decision Waiting for Reviewer labels Feb 19, 2020

lrjball added 2 commits March 2, 2020 20:23

Merge branch 'master' into column_transformer_passthrough

e1c16ba

Fixed missing import issue

76f47ac

Fixed issue caused by missing import introduced when doing a merge in the browser. _check_key_type has been replaced with _determine_key_type.

adrinjalali approved these changes Mar 3, 2020

View reviewed changes

thomasjpfan reviewed Mar 3, 2020

View reviewed changes

sklearn/compose/tests/test_column_transformer.py Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved

lrjball and others added 3 commits March 3, 2020 22:12

Update sklearn/compose/_column_transformer.py

07b9403

Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>

Separated pandas test into own function, and removed unused attribute.

762850e

- Seperated the pandas part of the test into its own function to avoid the whole test being skipped when pandas is not installed. - Removed the unused _output_dims attribute.

Merge branch 'column_transformer_passthrough' of https://github.com/l…

d031673

…rjball/scikit-learn into column_transformer_passthrough

lrjball mentioned this pull request Mar 3, 2020

TST Fixes test so that whole test isn't skipped if pandas not… #16627

Merged

jnothman approved these changes Mar 4, 2020

View reviewed changes

added enhancement entry to whats_new

319842e

thomasjpfan reviewed Mar 5, 2020

View reviewed changes

sklearn/compose/tests/test_column_transformer.py Show resolved Hide resolved

Added support for boolean masks and slices

447a3ba

Added support for mask and slices for both dataframes and arrays, as well as tests for each case.

cmarmo removed the Needs work label Mar 6, 2020

jnothman changed the title ~~Extended get_feature_names() for ColumnTransformer to include 'passthrough'~~ ENH ColumnTransformer.get_feature_names() handles passthrough Apr 19, 2020

jnothman merged commit 670b85c into scikit-learn:master Apr 19, 2020

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020

ENH ColumnTransformer.get_feature_names() handles passthrough (scikit…

f9bbf80

…-learn#14048)

viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020

ENH ColumnTransformer.get_feature_names() handles passthrough (scikit…

3994098

…-learn#14048)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

lrjball commented Jun 8, 2019 •

edited

jnothman left a comment

jnothman Jun 11, 2019

lrjball Jun 11, 2019

jorisvandenbossche Jun 18, 2019

lrjball Jun 19, 2019

jnothman Jun 20, 2019

lrjball Jun 20, 2019

jnothman Jun 22, 2019

lrjball Jun 26, 2019

jnothman Jun 11, 2019

lrjball Jun 11, 2019

jnothman Jul 11, 2019

lrjball Jul 17, 2019

jnothman Jul 25, 2019

lrjball Jul 26, 2019

jorisvandenbossche left a comment

jorisvandenbossche Jun 18, 2019

lrjball Jun 18, 2019

jnothman Jun 22, 2019

lrjball Jun 25, 2019

lrjball Jun 26, 2019

jorisvandenbossche Jun 18, 2019

jnothman Jul 11, 2019

jnothman Jul 11, 2019

adrinjalali commented Feb 14, 2020

adrinjalali commented Feb 19, 2020

lrjball commented Feb 19, 2020

adrinjalali commented Feb 24, 2020

lrjball commented Mar 2, 2020

adrinjalali left a comment

thomasjpfan left a comment

jnothman left a comment

lrjball commented Mar 5, 2020

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

Conversation

lrjball commented Jun 8, 2019 • edited

jnothman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adrinjalali commented Feb 14, 2020

adrinjalali commented Feb 19, 2020

lrjball commented Feb 19, 2020

adrinjalali commented Feb 24, 2020

lrjball commented Mar 2, 2020

adrinjalali left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

lrjball commented Mar 5, 2020

lrjball commented Jun 8, 2019 •

edited