get_feature_names support for pipelines #2007

Open
wants to merge 1 commit into
from

Projects

None yet

4 participants

@kmike
Contributor
kmike commented May 27, 2013

Hi,

Pipeline.get_feature_names() method is added in this pull request. This fixes FeatureUnion.get_feature_names() when one of transformers is a Pipeline.

I tried to provide an example in tests. I'm not entirely sure there is no a better way to write the code in example - please double-check it.

@jnothman
Member

Interesting... I am not sure I'm happy with this meaning of Pipeline.get_feature_names.
What does it mean I then when I want to transform the output of the feature extractor:
my_pipeline = Pipeline[ ('vectorize', DictVectorizer()), ('scale', StandardScaler()), ('select', SelectKBest()), ]
In this case (as opposed to if I tacked on a PCA), all the output features can be named, but those names cannot be retrieved from the last step alone. You could perhaps implement something crazy like:

def get_feature_names(self):
    """Assuming a single transformer in the `Pipeline` has `get_feature_names`,
    call it and transform its result through the remainder of the pipeline."""
    names = None
    for name, step in transformers:
        if hasattr(step, 'get_feature_names'):
            if names is not None:
                raise ValueError('Multiple steps with get_feature_names')
            names = step.get_feature_names()
        elif names is not None:
            names = step.transform(names)
    if names is None:
        raise ValueError('No step with get_feature_names')
    return names

get_feature_names seems to be a fairly informal part of the scikit-learn API. It seems specific, but not universal, to feature extractors. Until it is clearer, it may be better just to privately extend Pipeline (i.e. write class ExtractorPipeline) for your purposes, or else just get the feature names on an ad-hoc basis. (I think it's more important to be able to find out which features in a union come from where as in #1952, than what their names are.)

@kmike
Contributor
kmike commented Sep 19, 2013

Just faced this issue again. I still think that feature names could be very helpful. For example, if LinearSVC or LogisticRegression is used for text categorization, it is convenient to look at features with coefficients with largest absolute values to see what classifier learned and why is it making errors - feature names that come from CountVectorizer/DictVectorizer/TfidfVectorizer can give this insight.

Your get_feature_names trick is very smart :) But it could fail if some step can't transform data in a format feature_names uses. What about adding 'previous_feature_names' argument to get_feature_names() functions, and implementing get_feature_names for SelectKBest, SelectPercentile, etc.?

@ogrisel
Member
ogrisel commented Sep 19, 2013

I agree that having to trace feature provenance manually can be a real pain in practice.

Maybe we could think about using record arrays as an alternative to regular numpy arrays in some cases and subclasses of scipy.sparse matrices that would have string metadata to store the column names.

Feature selectors and other transformers that preserve the feature meaning (like scalers) could take care about outputing transform datastructures that would preserve this information when available on the input.

@kmike
Contributor
kmike commented Sep 19, 2013

I didn't have a chance to use record arrays yet, but won't using them incur overhead even if feature names are not interesting to the caller? Passing previous feature_names to get_feature_names functions doesn't have this problem.

@ogrisel
Member
ogrisel commented Sep 19, 2013

I didn't have a chance to use record arrays yet, but won't using them incur overhead even if feature names are not interesting to the caller? Passing previous feature_names to get_feature_names functions doesn't have this problem.

I have not tried myself either. My plan is not to make the use of record arrays mandatory but just to make sure that the info is preserved when available.

I think we need experimenting with various options to better know what are the practical tradeoffs.

@jnothman
Member

Having played a bit with recarrays, I think their support across sklearn as a data format will be an enormous change. For example, numpy does not (and cannot) treat the fields as an axis, so you can't perform vectorised operations without changing the view dtype:

>>> a = np.array([(0,1)], dtype=[('a', 'f'), ('b', 'f')])
>>> a.mean()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot perform reduce with flexible type
>>> a.view(dtype='f').mean()
0.5

Yes, it would be very nice to have a way to pass around named columns, but without a wrapper like pandas provides, support will not come easily.

@kmike wrote:

Just faced this issue again. I still think that feature names could be very helpful. For example, if LinearSVC or LogisticRegression is used for text categorization, it is convenient to look at features with coefficients with largest absolute values to see what classifier learned and why is it making errors - feature names that come from CountVectorizer/DictVectorizer/TfidfVectorizer can give this insight.
Your get_feature_names trick is very smart :) But it could fail if some step can't transform data in a format feature_names uses. What about adding 'previous_feature_names' argument to get_feature_names() functions, and implementing get_feature_names for SelectKBest, SelectPercentile, etc.?

Note that if the LinearSVC or LogisticRegression at the end of the pipeline uses l1 penalty, my get_feature_names implementation will return the names of features with non-zero coefficients, due to the _LearntSelectorMixin. It's a little annoying that you are not assured their correspondence with the actual coefficients; for that you really want to get the names of the features that are input to the last step, not output.

@ogrisel
Member
ogrisel commented Sep 22, 2013

Indeed so recarrays are no solution either... I wish numpy arrays were not a builtin class and could allow for plugin arbitrary custom metadata that we could then update manually when suitable in sklearn...

@jnothman
Member

Perhaps the solution here is that get_feature_names() should be understood as "transform feature names" and should take as input an array/list of feature names as input. Such an API is still lacking relative to what @ogrisel wishes in this discussion; for example, a "select features by name" transformer still needs to be a meta-estimator. But at least semantics of Pipeline.get_feature_names() would be straightforward.

@jnothman jnothman changed the title from Better support for feature transformation pipelines to get_feature_names support for feature transformation pipelines Nov 15, 2014
@jnothman jnothman changed the title from get_feature_names support for feature transformation pipelines to get_feature_names support for pipelines Nov 15, 2014
@jnothman jnothman referenced this pull request Jun 6, 2015
Open

[MRG] Add feature_extraction.ColumnTransformer #3886

5 of 5 tasks complete
@vene
Member
vene commented Jun 6, 2015

But taking "input feature names" as a parameter will be awkward to the user who just has e.g. a vectorizer. (The call would be vect.get_feature_names(["foo"].)

Maybe it should be a new API point transform_feature_names and estimators for which it makes sense (feature extractors, even PCA: estimators that create features) could implement a feature_names_ property.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment