Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline doesn't work with Label Encoder #3956

Closed
johnny555 opened this issue Dec 11, 2014 · 24 comments

Comments

Projects
None yet
10 participants
@johnny555
Copy link

commented Dec 11, 2014

I've found that I cannot use pipelines if I wish to use the label encoder. In the following I wish to build a pipeline that first encodes the label and then constructs a one-hot encoding from that labelling.

from sklearn.preprocessing import  OneHotEncoder, LabelEncoder
from sklearn.pipeline import make_pipeline
import numpy as np

X = np.array(['cat', 'dog', 'cow', 'cat', 'cow', 'dog'])

enc = LabelEncoder()
hot = OneHotEncoder()

pipe = make_pipeline(enc, hot)
pipe.fit_transform(X)

However, the following error is returned:

lib/python2.7/site-packages/sklearn/pipeline.pyc in _pre_transform(self, X, y, **fit_params)
    117         for name, transform in self.steps[:-1]:
    118             if hasattr(transform, "fit_transform"):
--> 119                 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
    120             else:
    121                 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

TypeError: fit_transform() takes exactly 2 arguments (3 given)

It seems that the problem is that the fit method for label encoder only takes a y argument, whereas the pipeline assumes that it will take an X and an optional y.

@jnothman

This comment has been minimized.

Copy link
Member

commented Dec 11, 2014

Do you mean to use LabelBinarizer instead? This has been a subject of complaint previously, and we should probably document this more clearly. The label "transformers" should probably not have the same interface as data transformers.

@amueller amueller added the API label Jan 9, 2015

@amueller

This comment has been minimized.

Copy link
Member

commented Jan 9, 2015

It is a bit odd that we don't have these for data, though. We have the DictVectorizer and OneHotEncoder but I don't see a way how to perform the operation here without creating a dict first.
In particular, the docs: http://scikit-learn.org/dev/modules/preprocessing.html#encoding-categorical-features leave out exactly the step that is breaking here.... that is not super great....

@amueller amueller added the Enhancement label Jan 9, 2015

@mkneierV

This comment has been minimized.

Copy link

commented Jul 29, 2015

Is there a good reason why one hot encoder can't just work with string values**? In my experience, the task of one hot encoding a string valued feature is incredibly common, and it seems tedious to have to use to preprocessing transformers to accomplish it. I'd be happy to contribute code for it.

** By work with string values I mean:

string_feature = np.array(['cat', 'dog', 'cat', 'ferret'])

would be transformed to:

[[1, 0, 0],
 [0, 1, 0],
 [1, 0, 0]
 [0, 0, 1]]
@jnothman

This comment has been minimized.

Copy link
Member

commented Jul 29, 2015

@mkneierV, That particular case happens to work fine with CountVectorizer. See also #4920, but as you note the description of "work with strings" is underspecified: in general, I think we must require 2d input for OneHotEncoder. So it could encode [['cat'], ['dog'], ['cat'], ['ferret']] as above [['cat', 'cat'], ['dog', 'dog'], ['cat', 'cat'], ['ferret', 'ferret']] as the same duplicated horizontally.

@amueller

This comment has been minimized.

Copy link
Member

commented Jul 29, 2015

+1. Currently this can only be done by using (the not merged) FunctionTransformer and a FeatureUnion :-/

This should really work for arbitrary categorical matrices (and mixed type data).

@dukebody

This comment has been minimized.

Copy link

commented Aug 9, 2015

This is bitting me too. I constantly have to one-hot-encode string features. Since these features usually come from a Pandas dataframe, I don't see why should we force users to transform the columns to a dictionary to be able to use the DictVectorizer.

I guess one should never mix label transformers (which usually expect a single 1-dimensional argument) and data transformers, but I think people (including me) do this because there are not equivalent data transformers for certain label ones, like LabelEncoder or LabelBinarizer.

@amueller

This comment has been minimized.

Copy link
Member

commented Aug 10, 2015

I agree. The solution is to improve OneHotEncoder I think. See #4920. There was also a related issue that I can't find any more (we should be able to encode multiple string-valued columns. Maybe @jnothman remembers?

@hshteingart

This comment has been minimized.

Copy link

commented Jul 31, 2016

I fount a way to go around the problem by using the CountVectorizer which can turn those strings into a binary representation directly:

from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer().fit_transform(X).todense()
Out[]: 
matrix([[1, 0, 0],
        [0, 0, 1],
        [0, 1, 0],
        [1, 0, 0],
        [0, 1, 0],
        [0, 0, 1]], dtype=int64)
@kyteague

This comment has been minimized.

Copy link

commented Sep 11, 2017

+1 for this functionality. I need to convert tokenized words into integer representations to pass into an embedding layer in a NN. I do not want one hot encoding. Although it feels like this transformer might live in the text_extraction module.

This is pretty trivial code to write, but would be nice to not have to reimplement this every time.

@jnothman

This comment has been minimized.

Copy link
Member

commented Sep 11, 2017

@kyteague

This comment has been minimized.

Copy link

commented Sep 11, 2017

The data is sequential (it's text), but you can think of it as a single categorical feature per time step. The integers get fed into an embedding layer that acts as a latent feature space.

It might make sense to have a separate SequenceVectorizer in the feature_extraction.text module. There are a couple other common options (padding and reversing) as well.

@jnothman

This comment has been minimized.

Copy link
Member

commented Sep 11, 2017

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 24, 2017

This can be closed now since the CategoricalEncoder PR is merged (#9151).

@qinhanmin2014

This comment has been minimized.

Copy link
Member

commented Nov 24, 2017

Resolved in #9151.

@ryanpeach

This comment has been minimized.

Copy link

commented Nov 8, 2018

Hey so, I was wondering, what if we made pipeline allow y transformations? It would look like this:

Pipeline(features=[Xtransform1],
    labels=[Ytransform1],
    model=clf)

On fit it would do:

clf.fit(
    Xtransform1.fit_transform(X), 
    Ytransform1.fit_transform(y)
)

On predict it would do:

Ytransform1.inv_transform(
    clf.predict(
        Xtransform1.transform(X)
    )
)

The one big positive I see to this is if you save the model, it could be provided natural input, and return natural output, without having to return to documentation. I just encountered this for work which is why I'm asking.

Would sklearn have any interest in this kind of object?

@jnothman

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

@ryanpeach

This comment has been minimized.

Copy link

commented Nov 8, 2018

Just curious, does that work for Classifiers?

It does look like my target usecase.

@jnothman

This comment has been minimized.

Copy link
Member

commented Nov 8, 2018

@ryanpeach

This comment has been minimized.

Copy link

commented Nov 9, 2018

So we used a label encoder for transforming our string labels into numbers. Then we saved the model after training to s3. We did this for, get this, over 10000 models, all as pkl files. What we forgot was that labelencoders dont change labels the same way every time. So now those models are useless.

We would like the ability to save models without having to save multiple models (like labelencoder and the model itself), and without having to refer to documentation to see how it works. We'd like it to just work.

Not saying we cant just write our own object for this, but I was just asking if sklearn would like such a thing too. No promises though.

I'll point out too that proper label encoding is an important part of cross validation for classifiers as well as regressions. What if a class you've never seen before shows up in your testing data? You need to be informed that you should have stratified your splitting. Or in the real world, you want an error to be raised.

@jnothman

This comment has been minimized.

Copy link
Member

commented Nov 10, 2018

@ryanpeach

This comment has been minimized.

Copy link

commented Nov 11, 2018

That's good to know! I had no idea!

Still might be useful for more complex label processing.

My problems have been resolved, just hoping this sparks discussion in future versions.

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

@ryanpeach we have been discussing transforming y for years but haven't come up with enough usage examples that would require it to draft a good API imho.

@ryanpeach

This comment has been minimized.

Copy link

commented Nov 14, 2018

I feel like a side project might be suitable. I've long wanted a way to represent an sklearn model with full IO transformation in something as general as a complex DAG lol.

As long as labelencoding is built into classifiers, you are right that the use case is limited.

@amueller

This comment has been minimized.

Copy link
Member

commented Nov 14, 2018

https://mcasl.github.io/PipeGraph/
https://github.com/Microsoft/NimbusML

See #4143 for the general issue on transforming y in pipelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.