New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add get_feature_names to CategoricalEncoder #10181

Closed
amueller opened this Issue Nov 21, 2017 · 11 comments

Comments

Projects
None yet
4 participants
@amueller
Member

amueller commented Nov 21, 2017

We should add a get_feature_names to the new CategoricalEncoder, as discussed here. I think it would be good to be consistent with the PolynomialFeature which allows passing in original feature names to map them to new feature names. Also see #6425.

@Nirvan101

This comment has been minimized.

Contributor

Nirvan101 commented Nov 21, 2017

I'd like to try this one.

@amueller

This comment has been minimized.

Member

amueller commented Nov 21, 2017

If you haven't contributed before, I suggest you try an issue labeled "good first issue". Though this one isn't too hard, eigher.

@Nirvan101

This comment has been minimized.

Contributor

Nirvan101 commented Nov 22, 2017

@amueller
I think I can handle it.
So we want something like this right?

enc.fit([['male',0], ['female', 1]])
enc.get_feature_names()

>> ['female', 'male', 0, 1]

Can you please give an example of how original feature names can map to new feature names? I have seen the get_feature_names() from PolynomialFeatures, but I don't understand what that means in this case.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 22, 2017

@Nirvan101

This comment has been minimized.

Contributor

Nirvan101 commented Nov 22, 2017

@jnothman Is this what you mean?

enc.fit(  [ [ 'male' ,    0,  1],
             [ 'female' ,  1 , 0]  ] )

enc.get_feature_names(['one','two','three'])

>> ['one_female', 'one_male' , 'two_0' , 'two_1' , 'three_0' , 'three_1']

And in case I don't pass any strings, it should just use x0 , x1 and so on for the prefixes right?

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 22, 2017

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Nov 22, 2017

I like the idea to be able to specify input feature names.

Regarding syntax of combining the two names, as prior art we have eg DictVectorizer that does something like ['0=female', '0=male', '1=0', '1=1'] (assuming we use 0 and 1 as the column names for arrays) or Pipelines that uses double underscores (['0__female', '0__male', '1__0', '1__1']). Others?
I personally like the __ a bit more I think, but the fact that this is used by pipelines is for me actually a reason to use = in this case. Eg in combination with the ColumnTransformer (assuming this would use the __ syntax like pipeline), you could then get a feature name like 'cat__0=male' instead of 'cat__0__male'.

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Nov 22, 2017

Additional question:

  • if the input is a pandas DataFrame, do we want to preserve the column names (to use instead of 0, 1, ..)?
    (ideally yes IMO, but this would require some extra code as currently it is not detected whether a DataFrame is passed or not, it is just coerced to array)
@jnothman

This comment has been minimized.

Member

jnothman commented Nov 22, 2017

@jorisvandenbossche

This comment has been minimized.

Contributor

jorisvandenbossche commented Nov 22, 2017

it's hard for us to keep them

It's not really 'hard':

class CategoricalEncoder():

    def fit(self, X, ...):
        ...
        if hasattr(X, 'iloc'):
            self._input_features = X.columns
        ...

    def get_feature_names(self, input_features=None):
        if input_features is None:
            input_features = self._input_features
        ...

but of course it is added complexity, and more explicit support for pandas dataframes, which is not necessarily something we want to add (I just don't think 'hard' is the correct reason :-)).

But eg if you combine multiple sets of columns and transformers in a ColumnTransformer, it is not always that straightforward for the user to keep track of IMO, because you then need to combine the different sets of selected column into one list to pass to get_feature_names.

@jnothman

This comment has been minimized.

Member

jnothman commented Nov 22, 2017

amueller added a commit that referenced this issue Jul 16, 2018

[MRG] Add get_feature_names to OneHotEncoder (#10198)
**Reference Issues/PRs**
Fixes #10181 

**What does this implement/fix? Explain your changes.**
Added function **get_feature_names()** to **CategoricalEncoder** class. This is in `data.py` under `sklearn.preprocessing` 


<!--
Please be aware that we are a loose team of volunteers so patience is
necessary; assistance handling other issues is very welcome. We value
all user contributions, no matter how minor they are. If we are slow to
review, either the pull request needs some benchmarking, tinkering,
convincing, etc. or more likely the reviewers are simply busy. In either
case, we ask for your understanding during the review process.
For more information, see our FAQ on this topic:
http://scikit-learn.org/dev/faq.html#why-is-my-pull-request-not-getting-any-attention.

Thanks for contributing!
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment