-
-
Notifications
You must be signed in to change notification settings - Fork 25.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH ColumnTransformer.get_feature_names() handles passthrough #14048
ENH ColumnTransformer.get_feature_names() handles passthrough #14048
Conversation
…sthough' Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only reason we left this unimplemented was to avoid making hard commitments when first implementing the ColumnTransformer. Thanks for the pr
assert_raise_message( | ||
NotImplementedError, 'get_feature_names is not yet supported', | ||
ct.get_feature_names) | ||
assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely happy with the mix of string and numeric types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what you mean. The alternative would be to cast the array indices to string, would that be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"x0", "x1", ... seems to be used in a few other transformers (e.g. PolynomialFeatures), and is mentioned in scikit-learn/enhancement_proposals#18 as well. Will this be dependent on that SLEP then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That SLEP could certainly inform the choice. I'd be most happy with "x0", "x1", ... here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the indices be the same as the indices in the input array, or should they be from 0 to the number of passthrough columns? I think that keeping the indices the same as the input would make them easier to interpret, but this goes against the behaviour of something like PolynomialFeatures, or even the behaviour implemented here for transformers with no get_feature_names, where they always start counting from "x0" .
If starting from 0, the strings would probably need to start with something like 'passthrough__' to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number according to the input features, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this now, thanks
assert_raise_message( | ||
NotImplementedError, 'get_feature_names is not yet supported', | ||
ct.get_feature_names) | ||
assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we should be applying a prefix if trans does nothing... Can you imagine a context where this helps? Yes, it avoids naming conflicts...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just to keep it consistent with the 'name__column' syntax used for when an explicit name is passed with a transformer, if for any reason someone passed transformers=[(name, 'passthrough', [columns])]. This seemed more consistent than dropping the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer. But as it can be passed as a transformer with a name, it seems to make sense to use that name on the feature names, like with the other transformers. Having said that, these should probably be 'trans_x0', 'trans_x1' to be consistent with that notation. Happy to change it altogether though if there is a better option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why 'passthrough' is a valid option for a transformer at all, it seems like that could just be handled with remainder='passthrough' and dropping any unused columns before using the ColumnTransformer.
It is available to allow the user to try disabling some transformations in a parameter search such as grid search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay, so I suppose in that case you would want those disabled columns to be treated the same way as all of the other passthrough columns. I will make this change.
While making the changes for ‘passthrough’, it seems sensible to make an additional change to ColumnTransformer.get_feature_names(), to make it always returns something. This implementation adds feature names name__x0, …, name__xN for transformers without a get_feature_names method. It seems harsh to raise an error if there are some transformers which do have feature names, and even if none of them have a get_feature_names then it is still helpful to know which features came from which transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
feature_names.extend([name + "__" + f for f in | ||
trans.get_feature_names()]) | ||
feature_names.extend([name + "__x" + str(i) | ||
for i in range(dim)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is also giving names if the transformer has no feature names method? (just to clarify, this is not strictly related to the PR of enabling feature names for passthrough?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right, it seems better to give default names than to error out especially if several of the other transformers have feature names. And even if none of them do, at least this let's you know which features came from which transformer. I suppose it isn't strictly related to the passthrough parameter, although is pretty similar. Would it be worth making a separate pull request for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Yes, I don't think we should be doing this. I'd rather tell the user that they have to define get_feature_names in a component transformer than to invent something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I suppose this would have just been a workaround until every Transformer has a get_feature_names() method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now removed this change
assert_raise_message( | ||
NotImplementedError, 'get_feature_names is not yet supported', | ||
ct.get_feature_names) | ||
assert_equal(ct.get_feature_names(), ['trans__a', 'trans__b', 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes a bit in the general "get feature names" discussion domain. For arrays, we might give default names like "x0", "x1", ... but that's not fully decided yet.
…sthough' Currently, if remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality. Now if the transformer is fit on a DataFrame, then the passthrough columns will appear in gte_feature_names() as the respective column names in the DataFrame, and if it is not a DataFrame then the column indices will be used instead, where the feature names will be 'xi' for the ith index.
remaining_idx = sorted(list(set(range(n_columns)) - set(cols))) | ||
if hasattr(X, 'columns'): | ||
columns = X.columns | ||
self._remainder_names = [columns[idx] for idx in remaining_idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can be changed after #14237 is merged
assert_raise_message( | ||
NotImplementedError, 'get_feature_names is not yet supported', | ||
ct.get_feature_names) | ||
assert_equal(ct.get_feature_names(), ['trans__0', 'trans__1']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm.... I still am not happy with this prefixing. I'd be interested to hear other opinions, or use-cases.
…h' the same way as remainder='passthrough'. The behaviour of get_feature_names for passthrough is now the following: - If fitted on a dataframe, then the columns passed with the trans='passthrough' will be treated as positional if int (in which case the feature name will be the column name at that position) or they will be used for the actual feature name. Any columns in remainder='passthrough' will be appended to the end of feature_names, as the column names from the fitted dataframe - If fitted on an array, then the column names for both when trans='passthrough' or when remainder='passthrough' will be 'xi' for each index value i. In terms of ordering, the remainder columns will again come after the rest of the feature names.
Removed as this was waiting on PR #14495, which did not go ahead.
Opened #16444 to discuss the options related to the remaining question. |
@adrinjalali No problem. Can I just confirm this is the agreed logic before I make any changes:
In either case, if any of the transformers other than Is that right? |
Yes, that's how I think of it. |
Fixed issue caused by missing import introduced when doing a merge in the browser. _check_key_type has been replaced with _determine_key_type.
Okay, the behavior is now as agreed above, and the merge conflict has been fixed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for your patience and persistence @lrjball
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @lrjball !
Co-Authored-By: Thomas J Fan <thomasjpfan@gmail.com>
- Seperated the pandas part of the test into its own function to avoid the whole test being skipped when pandas is not installed. - Removed the unused _output_dims attribute.
…rjball/scikit-learn into column_transformer_passthrough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add an |Enhancement|
entry to the change log at doc/whats_new/v0.23.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
Thanks, I have added an entry to the change log now. Is it worth me updating the get_feature_names docstring as well, or is that unnecessary? |
Added support for mask and slices for both dataframes and arrays, as well as tests for each case.
Currently, if ColumnTransformer is called with remainder='passthrough', then get_feature_names() will raise a NotImplementedError, but this pull request adds in that functionality.
In this pull request, if the ColumnTransformer was fitted on a DataFrame and remainder='passthrough' then the columns which passed through will appear in get_feature_names() as their column names in the DataFrame, and if it was not fitted on a DataFrame, then it will be the indices of the columns which will appear in get_feature_names().
Also, if someone explicitly passes a transformer as the text 'passthrough', then the feature names will be name__{column name}, where column_name is whatever have defined when they passed it.
e.g.
ct = ColumnTransformer([('trans', 'passthrough', ['col0', 'col1'])])
will produce features
['trans__col0', 'trans__col1']
which is in keeping with the existing behavior for other transformers.
The behaviour for when a transformer does not have a get_feature_names method has also been changed. Now, instead of an error being raised, the feature names will be given as 'name_x0', ..., 'name_xN'. This allows get_feature_names to be useful by giving at least some indication of where the features came from, even if some of the transformers do not give explicit feature names.
Don't believe this fixes any currently open issues.