Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

Merged
merged 16 commits into from Apr 19, 2020
Merged

ENH ColumnTransformer.get_feature_names() handles passthrough #14048

merged 16 commits into from Apr 19, 2020

Conversation

lrjball
Copy link
Contributor

@lrjball lrjball commented Jun 8, 2019

Currently, if ColumnTransformer is called with remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality.

In this pull request, if the ColumnTransformer was fitted on a DataFrame and remainder='passthrough' then the columns which passed through will appear in get_feature_names() as their column names in the DataFrame, and if it was not fitted on a DataFrame, then it will be the indices of the columns which will appear in get_feature_names().

Also, if someone explicitly passes a transformer as the text 'passthrough', then the feature names will be name__{column name}, where column_name is whatever have defined when they passed it.
e.g.
ct = ColumnTransformer([('trans', 'passthrough', ['col0', 'col1'])])
will produce features
['trans__col0', 'trans__col1']
which is in keeping with the existing behavior for other transformers.

The behaviour for when a transformer does not have a get_feature_names method has also been changed. Now, instead of an error being raised, the feature names will be given as 'name_x0', ..., 'name_xN'. This allows get_feature_names to be useful by giving at least some indication of where the features came from, even if some of the transformers do not give explicit feature names.

Don't believe this fixes any currently open issues.

…sthough'

Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only reason we left this unimplemented was to avoid making hard commitments when first implementing the ColumnTransformer. Thanks for the pr

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved
assert_raise_message(
NotImplementedError, 'get_feature_names is not yet supported',
ct.get_feature_names)
assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely happy with the mix of string and numeric types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean. The alternative would be to cast the array indices to string, would that be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"x0", "x1", ... seems to be used in a few other transformers (e.g. PolynomialFeatures), and is mentioned in scikit-learn/enhancement_proposals#18 as well. Will this be dependent on that SLEP then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That SLEP could certainly inform the choice. I'd be most happy with "x0", "x1", ... here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the indices be the same as the indices in the input array, or should they be from 0 to the number of passthrough columns? I think that keeping the indices the same as the input would make them easier to interpret, but this goes against the behaviour of something like PolynomialFeatures, or even the behaviour implemented here for transformers with no get_feature_names, where they always start counting from "x0" .

If starting from 0, the strings would probably need to start with something like 'passthrough__' to avoid confusion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number according to the input features, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this now, thanks

assert_raise_message(
NotImplementedError, 'get_feature_names is not yet supported',
ct.get_feature_names)
assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we should be applying a prefix if trans does nothing... Can you imagine a context where this helps? Yes, it avoids naming conflicts...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to keep it consistent with the 'name__column' syntax used for when an explicit name is passed with a transformer, if for any reason someone passed transformers=[(name, 'passthrough', [columns])]. This seemed more consistent than dropping the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer. But as it can be passed as a transformer with a name, it seems to make sense to use that name on the feature names, like with the other transformers. Having said that, these should probably be 'trans_x0', 'trans_x1' to be consistent with that notation. Happy to change it altogether though if there is a better option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer.

It is available to allow the user to try disabling some transformations in a parameter search such as grid search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, so I suppose in that case you would want those disabled columns to be treated the same way as all of the other passthrough columns. I will make this change.

While making the changes for ‘passthrough’, it seems sensible to make an additional change to ColumnTransformer.get_feature_names(), to make it always returns something. This implementation adds feature names name__x0, …, name__xN for transformers without a get_feature_names method.

It seems harsh to raise an error if there are some transformers which do have feature names, and even if none of them have a get_feature_names then it is still helpful to know which features came from which transformer.
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

feature_names.extend([name + "__" + f for f in
trans.get_feature_names()])
feature_names.extend([name + "__x" + str(i)
for i in range(dim)])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is also giving names if the transformer has no feature names method? (just to clarify, this is not strictly related to the PR of enabling feature names for passthrough?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, it seems better to give default names than to error out especially if several of the other transformers have feature names. And even if none of them do, at least this let's you know which features came from which transformer. I suppose it isn't strictly related to the passthrough parameter, although is pretty similar. Would it be worth making a separate pull request for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Yes, I don't think we should be doing this. I'd rather tell the user that they have to define get_feature_names in a component transformer than to invent something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I suppose this would have just been a workaround until every Transformer has a get_feature_names() method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now removed this change

sklearn/compose/tests/test_column_transformer.py Outdated Show resolved Hide resolved
assert_raise_message(
NotImplementedError, 'get_feature_names is not yet supported',
ct.get_feature_names)
assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.

…sthough'

Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead, where the feature names will be 'xi' for the ith index.
remaining_idx = sorted(list(set(range(n_columns)) - set(cols)))
if hasattr(X, 'columns'):
columns = X.columns
self._remainder_names = [columns[idx] for idx in remaining_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code can be changed after #14237 is merged

assert_raise_message(
NotImplementedError, 'get_feature_names is not yet supported',
ct.get_feature_names)
assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.

@rth rth added the Needs work label Jul 25, 2019
@rth rth removed the Needs work label Jul 26, 2019
…h' the same way

as remainder='passthrough'.

The behaviour of get_feature_names for passthrough is now the following:
- If fitted on a dataframe, then the columns passed with the trans='passthrough'
will be treated as positional if int (in which case the feature name will be
the column name at that position) or they will be used for the actual feature name.
Any columns in remainder='passthrough' will be appended to the end of feature_names,
as the column names from the fitted dataframe
- If fitted on an array, then the column names for both when trans='passthrough' or
when remainder='passthrough' will be 'xi' for each index value i. In terms of ordering,
the remainder columns will again come after the rest of the feature names.
@adrinjalali
Copy link
Member

Opened #16444 to discuss the options related to the remaining question.

@adrinjalali
Copy link
Member

@lrjball thanks for your patience. #16444 resolves the prefixing decision issue, and as you can see there, we'd like to have the feature names not prefixed in either of the cases:

  • remainder=passthrough
  • estimator is passthrough

Would you have time to apply the changes?

@lrjball
Copy link
Contributor Author

lrjball commented Feb 19, 2020

@adrinjalali No problem. Can I just confirm this is the agreed logic before I make any changes:

  • when a dataframe is passed, the feature names when either the transformer is 'passthrough' or remainder='passthrough' will just be the column names from the dataframe.
  • when an array is passed, the feature names for either passthough case will be the string 'xi' where i is the index of that feature in the input array.

In either case, if any of the transformers other than 'passthrough' (or 'drop') don't have a get_feature_names attribute, an error will be raised.

Is that right?

@adrinjalali
Copy link
Member

Yes, that's how I think of it.

Fixed issue caused by missing import introduced when doing a merge in the browser. _check_key_type has been replaced with _determine_key_type.
@lrjball
Copy link
Contributor Author

lrjball commented Mar 2, 2020

Okay, the behavior is now as agreed above, and the merge conflict has been fixed.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for your patience and persistence @lrjball

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @lrjball !

sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved
sklearn/compose/_column_transformer.py Outdated Show resolved Hide resolved
lrjball and others added 3 commits March 3, 2020 22:12
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
- Seperated the pandas part of the test into its own function to avoid the whole test being skipped when pandas is not installed.
- Removed the unused _output_dims attribute.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an |Enhancement| entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

@lrjball
Copy link
Contributor Author

lrjball commented Mar 5, 2020

Thanks, I have added an entry to the change log now. Is it worth me updating the get_feature_names docstring as well, or is that unnecessary?

Added support for mask and slices for both dataframes and arrays, as well as tests for each case.
@cmarmo cmarmo removed the Needs work label Mar 6, 2020
@jnothman jnothman changed the title Extended get_feature_names() for ColumnTransformer to include 'passthrough' ENH ColumnTransformer.get_feature_names() handles passthrough Apr 19, 2020
@jnothman jnothman merged commit 670b85c into scikit-learn:master Apr 19, 2020
gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
viclafargue pushed a commit to viclafargue/scikit-learn that referenced this pull request Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants