Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformative get_feature_names for various transformers #6425

Open
jnothman opened this issue Feb 23, 2016 · 56 comments

Comments

@jnothman
Copy link
Member

commented Feb 23, 2016

#6372 adds get_feature_names to PolynomialFeatures. It accepts a list of names of input_features (or substitutes with defaults) and constructs feature name strings that are human-readable and informative. Similar support should be available for other transformers, including feature selectors, feature agglomeration, FunctionTransformer, and perhaps even PCA (giving the top contributors to each component). FeatureUnion should be modified to handle the case where an argument is supplied. A proposal for support in Pipeline is given in #6424.

Modelled on #6372, each enhancement can be contributed as a separate PR. Note that default names for features are [x0, x1, ...]

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

@jnothman May I try this?

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2016

On which (family of) estimator?

@iamved

This comment has been minimized.

Copy link

commented Feb 23, 2016

Hi @jnothman, I am interested in taking this issue. Could you please suggest how I can get started on this issue?

@nelson-liu

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

I'll handle implementing this for FunctionTransformer for now, and we'll see if there's more classes to implement this in after I'm done :)

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

@jnothman I'll modify the FeatureUnion.

including feature selectors, feature agglomeration, FunctionTransformer, and perhaps even PCA

Is feature agglomeration here refering to cluster.FeatureAgglomeration?

@nelson-liu

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

@yenchenlin1994 I assume so?

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

@nelson-liu Thx!

If so, I would also love to implement it for cluster.FeatureAgglomeration.

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2016

I have added an extended list of transformers where this may apply and noted the default feature naming convention (though maybe its generation belongs in utils)

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

Hello @jnothman ,

What should preprocessing.Normalizer do when input_features passed into get_feature_names is None?

PolynomialFeatures doesn't suffer from this since it set both self.n_input_features_ and self.n_output_features_ during fit().

Maybe preprocessing.Normalizer should set self.n_input_features_ too during fit()?

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2016

Fair question, which I don't currently have an answer for. One option is for it to just return feature_names even if that means returning None.

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

Oh and even if input_features passed into get_feature_names of preprocessing.Normalizer is not None,
I guess what it can do is to return feature_names, which is the same with input_features in this case?

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 23, 2016

yes, trivial, as noted in the issue description

On 24 February 2016 at 00:01, Yen notifications@github.com wrote:

Oh and even if input_features passed into get_feature_names of
preprocessing.Normalizer is not None,
I guess what it can only do is to return feature_names, which is the same
with input_features in this case?


Reply to this email directly or view it on GitHub
#6425 (comment)
.

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 23, 2016

Oh okay!
I will also do scalars, normalizers and imputers and Binarizer.
Will send a PR right away.

Thanks for your clarification.

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

Hello @jnothman ,
about

feature selection and randomized L1

Do you mean all classes listed here:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

It seems that all these classes may be put into Pipeline and therefore need get_feature_names too.
Please correct me if I'm wrong. Thanks!

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2016

Yes, I mean those.

On 24 February 2016 at 17:44, Yen notifications@github.com wrote:

Hello @jnothman https://github.com/jnothman ,
about

feature selection and randomized L1

Do you mean all classes listed here:

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

It seems that all these classes may be put into Pipeline and therefore
need get_feature_names too.
Please correct me if I'm wrong. Thanks!


Reply to this email directly or view it on GitHub
#6425 (comment)
.

@maniteja123

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

Hi everyone, if it is fine I too would like to work on this issue. Would be helpful if the estimators which are currently worked on could be mentioned, so that I can try something which does not overlap. Thanks !

@yenchenlin

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

I think I can also work on

feature selection and randomized L1

@maniteja123 from PCA to the end of the issue description is not yet done

@maniteja123

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

@yenchenlin1994, thanks for letting me know.

@maniteja123

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

@jnothman It would be of great help if you could confirm if the output for PCA needs to have shape n_components where each element is the input feature having the maximum contribution. Should the case of multiple features having high contribution along one component be handled ? Thank you !

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 24, 2016

I'm really not sure about PCA. Try make something useful. If you think it
will be helpful to users to have names for projection-style features,
submit a PR. There is definitely a component of art to this.

On 25 February 2016 at 01:12, Maniteja Nandana notifications@github.com
wrote:

@jnothman https://github.com/jnothman It would be of great help if you
could confirm if the output for PCA needs to have shape n_components
where each element is the input feature having the maximum variance. Should
the case of multiple features having high variance along one component be
handled ? Thank you !


Reply to this email directly or view it on GitHub
#6425 (comment)
.

@maniteja123

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

Thanks for the reply. My doubt is mainly about choosing dominant features and also that all the components are not equally significant. Since multiple features can have almost same contribution along a component, there might be need for some threshold to figure out the number of input features to be considered. Anyway I will create a initial PR with just the most dominant feature along the component and continue the discussion there. Hope it is fine.

@maniteja123

This comment has been minimized.

Copy link
Contributor

commented Feb 24, 2016

@yenchenlin1994 one more question. I am not sure how to handle this for SparseRandomProjection and GaussianRandomProjection. Have you worked already on all of these ?

feature selection and randomized L1
feature agglomeration
FeatureUnion

If you have already started working, will be waiting for your PRs :) Thanks !

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2017

I'd also be most interested in @kmike's opinion on what better feature name support for pipelines should look like, whether aiming for perfection or aiming for agility.

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2017

@kmike's comment at https://github.com/scikit-learn/scikit-learn/pull/6431/files#r88185069 might suggest that in practice an interface like get_output_feature_names(indices, input_feature_names=None) saves computation and memory costs:

importances = pipeline.steps[-1][1].feature_importances_
truncated_pipeline = Pipeline(pipeline.steps[:-1])
top_features = np.argsort(importances)[-10:]
print(dict(zip(truncated_pipeline.get_output_feature_names(top_features,
                                                           X.columns.values),
               importances[top_features])))  # for dataframe X

Then a FeatureUnion can (with enough information stored) only query the necessary constituent transformers.

But I'm sure this involves more scaffolding work than get_feature_names as proposed here does.

@kmike

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2017

Hey,

Here are the bits of structured information we missed recently. Maybe they are a bit too specific, but anyways:

  1. For CountVectorizer, HashingVectorizer and TfIdfVectorizer we needed (start, end) spans which map feature names back to the input text. It required quite a lot of copy-paste to implement: https://github.com/TeamHG-Memex/eli5/blob/master/eli5/sklearn/_span_analyzers.py, and then you need to pass this information using a side channel. This allows to implement highlighting like that:
    default

  2. For FeatureHasher and HashingVectorizer there can be several "sub-feature names" for a single dimension - when recovering possible features from a corpus there can be collisions. Each "sub-feature name" also has a sign. For each dimension we're showing such "sub-feature names" sorted by their frequency in the corpus; if the top first "sub-feature name" is negative then the whole feature is treated as negative, and signs of all other "sub-feature names" are inverted. Collisions are also truncated for display (with an option to expand) if there is too may of them.

  3. For FeatureUnion it'd be nice to map feature names back to transformers, i.e. to know which transformer is responsible for a feature; currently this is done with .startswith(). Ideally, it'd be nice to have the whole chain, along with meta-information, available in a structured form.

  4. For pipelines it'd be nice to preserve meta-information about feature names, e.g. spans from (1) if CountVectorizer is followed by TfIdfTransformer.

All of the above is unconvenient and hacky to implement with feature names as strings.

Also, as @jnothman said, performance can be a problem sometimes - e.g. if you're building feature names for HashingVectorizer then the dimension is huge, and most feature names (but not all) are empty or auto-generated. Another use case is FeatureUnion of several transformers where some of them don't have get_feature_names defined; that's nice to still be able to inspect feature names which are defined, and have some kind of auto-generated names for undefined ones. This It is not a super-big deal, but it may cause several seconds delays in interactive usage; it could add up quickly if scikit-learn start to use auto-generated feature names everywhere.

That said, I see the appeal of plain string feature names; they are much easier to understand and implement.

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jan 18, 2017

Somehow this turned into an eli5 wishlist to scikit-learn?

For CountVectorizer, HashingVectorizer and TfIdfVectorizer we needed (start, end) spans

Hmm. Can we leave that out of the picture for now? Or do you just mean that this is part of a structured representation you would appreciate in eli5?

Ideally, it'd be nice to have the whole chain, along with meta-information, available in a structured form.

I'm not sure what exactly you mean by this. I proposed an age ago having an attribute on a FeatureUnion to describe which output features come from which constituent transformer. It's easy to get that information when fit_transform is called, but not when fit is called without transform; hence its design is trickier than it looks.

For pipelines it'd be nice to preserve meta-information about feature names, e.g. spans from (1) if CountVectorizer is followed by TfIdfTransformer.

So you mean a structured feature description? I can see the need for this, but designing it would be a big effort, and still best if it remains marked "experimental".

I would rather have something that provides the user with information for basic cases, but could be expanded later.

@kmike

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2017

Somehow this turned into an eli5 wishlist to scikit-learn?

Heh, right. I don't have a good API feedback, so I just enumerated related problems we had.

Hmm. Can we leave that out of the picture for now? Or do you just mean that this is part of a structured representation you would appreciate in eli5?

Sure, this is just an example of structured feature name representation. Actually it is no longer 'feature names' - that's some structured object which describes where the feature came from, like a link to the transformer in case of FeatureUnion. It could be a naming issue; structured representation could be irrelevant if we're talking only about feature names.

I would rather have something that provides the user with information for basic cases, but could be expanded later.

All of this can be expanded later by adding new methods which return information e.g. in lists of the same length as get_feature_names(). So yeah, structured representation is not a blocker for get_feature_names improvements.

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jan 25, 2017

For my work, I've created a module that monkey-patches scikit-learn transformers with a transform_feature_names method that can generate feature names for a pipeline/featureunion construction: https://gist.github.com/jnothman/bb1608e6ffea3109ff2ce7c926b0e0cb

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jan 31, 2017

Never mind that; I should have used singledispatch there instead of monkey-patching and have updated it accordingly.

More importantly, I've submitted a patch to @eli5 which should handle explaining feature importances in pipelines (TeamHG-Memex/eli5#158). I hope @eli5, particularly with its prolific use of singledispatch, is able to adopt this with greater agility than scikit-learn.

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Feb 1, 2017

A flaw with this design: I would like to build a transformer which selects (or excludes) features by name. It can be designed as a meta-transformer which gets the transformed feature names from the base transformer. However, doing this properly requires that the input feature names are known at fit time:

class SelectByName(BaseEstimator, TransformerMixin):
    def __init__(self, transformer, names, exclude=False):
        self.transformer = self.transformer
        self.names = names
        self.exclude = exclude

    def fit(self, X, y=None, **kwargs):
        self.transformer_ = clone(self.transformer)
        self.transformer_.fit(X, y, **kwargs)
        # XXX: how do we get in_names here?
        feature_names = self.transformer_.transform_feature_names()
        self.support_mask_ = ...

For this application, it is necessary to pass the feature names alongside X.

@amueller

This comment has been minimized.

Copy link
Member

commented Dec 12, 2017

I would like to build a transformer which selects (or excludes) features by name.

Can you get rid of that with a ColumnTransformer? I guess the question is a bit whether it's always possible to have the ColumnTransformer be right at the beginning of the pipeline, where we still know the names / positions of the columns.

@amueller

This comment has been minimized.

Copy link
Member

commented Dec 12, 2017

@kmike Are the structured annotations mostly needed because of the ranges?

Also, @GaelVaroquaux any more opinions on this? I doing the easy cases like feature selection and imputation (which might drop columns), in addition to having some support in FeatureUnion and Pipeline (and ColumnTransformer) will be very useful.

@kmike

This comment has been minimized.

Copy link
Contributor

commented Dec 12, 2017

@amueller once feature names get more complex (e.g. pca on top of tf*idf), showing them as a text gets more and more opinionated, and maybe problem-specific. How concise should be a feature name, e.g. should we show only top PCA components (how many?), or all of them? Note the amount of bikeshedding @jnothman got from me at TeamHG-Memex/eli5#208.

It seems the root of the problem is that formatting a feature name is not the same as figuring out where feature comes from.

This is where structured representation helps. Full information - all PCA components, or (start, end) ranges in case of text vectorizers - can be excessive for a default feature name, but it allows richer display: highlighting features in text, showing the rest of the components on mouse hover / click.

@amueller

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

@kmike thanks for the explanation :) Maybe doing strings first would still work. For PCA I would just basically do pca1, pca2 etc for now (aka punt)

@amueller

This comment has been minimized.

Copy link
Member

commented Jun 2, 2018

@jnothman @GaelVaroquaux should we include this in the "Townhall" meeting?

@jnothman

This comment has been minimized.

Copy link
Member Author

commented Jun 3, 2018

@GaelVaroquaux

This comment has been minimized.

Copy link
Member

commented Jun 6, 2018

@kiros32

This comment has been minimized.

Copy link

commented Mar 3, 2019

Any news ?

@adrinjalali

This comment has been minimized.

Copy link
Member

commented Mar 3, 2019

@kiros32 #13307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.