Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformative get_feature_names for various transformers #6425

Closed
5 of 11 tasks
jnothman opened this issue Feb 23, 2016 · 57 comments
Closed
5 of 11 tasks

Transformative get_feature_names for various transformers #6425

jnothman opened this issue Feb 23, 2016 · 57 comments

Comments

@jnothman
Copy link
Member

@jnothman jnothman commented Feb 23, 2016

#6372 adds get_feature_names to PolynomialFeatures. It accepts a list of names of input_features (or substitutes with defaults) and constructs feature name strings that are human-readable and informative. Similar support should be available for other transformers, including feature selectors, feature agglomeration, FunctionTransformer, and perhaps even PCA (giving the top contributors to each component). FeatureUnion should be modified to handle the case where an argument is supplied. A proposal for support in Pipeline is given in #6424.

Modelled on #6372, each enhancement can be contributed as a separate PR. Note that default names for features are [x0, x1, ...]

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

@jnothman May I try this?

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 23, 2016

On which (family of) estimator?

Loading

@iamved
Copy link

@iamved iamved commented Feb 23, 2016

Hi @jnothman, I am interested in taking this issue. Could you please suggest how I can get started on this issue?

Loading

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Feb 23, 2016

I'll handle implementing this for FunctionTransformer for now, and we'll see if there's more classes to implement this in after I'm done :)

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

@jnothman I'll modify the FeatureUnion.

including feature selectors, feature agglomeration, FunctionTransformer, and perhaps even PCA

Is feature agglomeration here refering to cluster.FeatureAgglomeration?

Loading

@nelson-liu
Copy link
Contributor

@nelson-liu nelson-liu commented Feb 23, 2016

@yenchenlin1994 I assume so?

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

@nelson-liu Thx!

If so, I would also love to implement it for cluster.FeatureAgglomeration.

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 23, 2016

I have added an extended list of transformers where this may apply and noted the default feature naming convention (though maybe its generation belongs in utils)

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

Hello @jnothman ,

What should preprocessing.Normalizer do when input_features passed into get_feature_names is None?

PolynomialFeatures doesn't suffer from this since it set both self.n_input_features_ and self.n_output_features_ during fit().

Maybe preprocessing.Normalizer should set self.n_input_features_ too during fit()?

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 23, 2016

Fair question, which I don't currently have an answer for. One option is for it to just return feature_names even if that means returning None.

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

Oh and even if input_features passed into get_feature_names of preprocessing.Normalizer is not None,
I guess what it can do is to return feature_names, which is the same with input_features in this case?

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 23, 2016

yes, trivial, as noted in the issue description

On 24 February 2016 at 00:01, Yen notifications@github.com wrote:

Oh and even if input_features passed into get_feature_names of
preprocessing.Normalizer is not None,
I guess what it can only do is to return feature_names, which is the same
with input_features in this case?


Reply to this email directly or view it on GitHub
#6425 (comment)
.

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 23, 2016

Oh okay!
I will also do scalars, normalizers and imputers and Binarizer.
Will send a PR right away.

Thanks for your clarification.

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 24, 2016

Hello @jnothman ,
about

feature selection and randomized L1

Do you mean all classes listed here:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

It seems that all these classes may be put into Pipeline and therefore need get_feature_names too.
Please correct me if I'm wrong. Thanks!

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 24, 2016

Yes, I mean those.

On 24 February 2016 at 17:44, Yen notifications@github.com wrote:

Hello @jnothman https://github.com/jnothman ,
about

feature selection and randomized L1

Do you mean all classes listed here:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

It seems that all these classes may be put into Pipeline and therefore
need get_feature_names too.
Please correct me if I'm wrong. Thanks!


Reply to this email directly or view it on GitHub
#6425 (comment)
.

Loading

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Feb 24, 2016

Hi everyone, if it is fine I too would like to work on this issue. Would be helpful if the estimators which are currently worked on could be mentioned, so that I can try something which does not overlap. Thanks !

Loading

@yenchenlin
Copy link
Contributor

@yenchenlin yenchenlin commented Feb 24, 2016

I think I can also work on

feature selection and randomized L1

@maniteja123 from PCA to the end of the issue description is not yet done

Loading

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Feb 24, 2016

@yenchenlin1994, thanks for letting me know.

Loading

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Feb 24, 2016

@jnothman It would be of great help if you could confirm if the output for PCA needs to have shape n_components where each element is the input feature having the maximum contribution. Should the case of multiple features having high contribution along one component be handled ? Thank you !

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Feb 24, 2016

I'm really not sure about PCA. Try make something useful. If you think it
will be helpful to users to have names for projection-style features,
submit a PR. There is definitely a component of art to this.

On 25 February 2016 at 01:12, Maniteja Nandana notifications@github.com
wrote:

@jnothman https://github.com/jnothman It would be of great help if you
could confirm if the output for PCA needs to have shape n_components
where each element is the input feature having the maximum variance. Should
the case of multiple features having high variance along one component be
handled ? Thank you !


Reply to this email directly or view it on GitHub
#6425 (comment)
.

Loading

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Feb 24, 2016

Thanks for the reply. My doubt is mainly about choosing dominant features and also that all the components are not equally significant. Since multiple features can have almost same contribution along a component, there might be need for some threshold to figure out the number of input features to be considered. Anyway I will create a initial PR with just the most dominant feature along the component and continue the discussion there. Hope it is fine.

Loading

@maniteja123
Copy link
Contributor

@maniteja123 maniteja123 commented Feb 24, 2016

@yenchenlin1994 one more question. I am not sure how to handle this for SparseRandomProjection and GaussianRandomProjection. Have you worked already on all of these ?

feature selection and randomized L1
feature agglomeration
FeatureUnion

If you have already started working, will be waiting for your PRs :) Thanks !

Loading

@amueller
Copy link
Member

@amueller amueller commented Dec 12, 2017

I would like to build a transformer which selects (or excludes) features by name.

Can you get rid of that with a ColumnTransformer? I guess the question is a bit whether it's always possible to have the ColumnTransformer be right at the beginning of the pipeline, where we still know the names / positions of the columns.

Loading

@amueller
Copy link
Member

@amueller amueller commented Dec 12, 2017

@kmike Are the structured annotations mostly needed because of the ranges?

Also, @GaelVaroquaux any more opinions on this? I doing the easy cases like feature selection and imputation (which might drop columns), in addition to having some support in FeatureUnion and Pipeline (and ColumnTransformer) will be very useful.

Loading

@kmike
Copy link
Contributor

@kmike kmike commented Dec 12, 2017

@amueller once feature names get more complex (e.g. pca on top of tf*idf), showing them as a text gets more and more opinionated, and maybe problem-specific. How concise should be a feature name, e.g. should we show only top PCA components (how many?), or all of them? Note the amount of bikeshedding @jnothman got from me at TeamHG-Memex/eli5#208.

It seems the root of the problem is that formatting a feature name is not the same as figuring out where feature comes from.

This is where structured representation helps. Full information - all PCA components, or (start, end) ranges in case of text vectorizers - can be excessive for a default feature name, but it allows richer display: highlighting features in text, showing the rest of the components on mouse hover / click.

Loading

@amueller
Copy link
Member

@amueller amueller commented Dec 14, 2017

@kmike thanks for the explanation :) Maybe doing strings first would still work. For PCA I would just basically do pca1, pca2 etc for now (aka punt)

Loading

@amueller
Copy link
Member

@amueller amueller commented Jun 2, 2018

@jnothman @GaelVaroquaux should we include this in the "Townhall" meeting?

Loading

@jnothman
Copy link
Member Author

@jnothman jnothman commented Jun 3, 2018

Loading

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 6, 2018

Loading

@iamkiros
Copy link

@iamkiros iamkiros commented Mar 3, 2019

Any news ?

Loading

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Mar 3, 2019

@kiros32 #13307

Loading

@tgy
Copy link

@tgy tgy commented May 1, 2020

i think OrdinalEncoder can be added to this list

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Andy's pets
PR phase
Joel's pets
In progress