Skip to content

Loading…

LabelBinarizer regression between 0.14.1 and 0.15.0 #3462

Closed
ogrisel opened this Issue · 15 comments

5 participants

@ogrisel
scikit-learn member

In 0.14.1 we have the following behavior:

>>> lb = LabelBinarizer()
>>> lb.fit_transform(['a', 'b', 'c'])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])
>>> lb.transform(['a', 'd', 'e'])
array([[1, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In 0.15.0 the call to transform with unseen labels raises a ValueError. If we to change to a new behavior we should at least raise a deprecation warning and keep the old behavior by default while implementing the new behavior with a flag.

@ogrisel ogrisel added this to the 0.15.1 milestone
@ogrisel ogrisel added the Bug label
@ogrisel
scikit-learn member

This is related to the discussion on PR #3243 .

@cjauvin

Just to summarize what I have already suggested in the mailing list about this, I would see three options for dealing with unseen labels:

  1. Map them to the all-zero vector (like before version 0.15)

  2. Raise an error (like current version)

  3. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)

@arjoly
scikit-learn member

+1 for being backward compatible!

@ogrisel
scikit-learn member
  1. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)

@cjauvin do you have a use-case for this option?

@jnothman
scikit-learn member

There are some benefits to the previous behaviour. For example, if I want to binarize a multiclass problem with labels ['majority', 'a', b'] in order to ignore the "majority" class, all I need to do is use:

label_binarize(labels, ['a', 'b'])
@cjauvin

@cjauvin do you have a use-case for this option?

@ogrisel To be honest the very fact that you ask that makes me suspect that I might not be using the LabelBinarizer in a "proper" way, because I've been having quite a regular need for that actually. For instance, if a dataset is relatively small, it might happen that a categorical variable (that I want to one-hot encode) in a particular random train/test split has some values in the test part that have not been seen in the train part (i.e. assuming that you are doing cross-validation in a "legal" way, that is rigorously applying the same preprocessing/scaling/encoding to each split): in such a case I simply map all the unseen values to a single extra class, interpreted simply as "<unknown>". How do you normally deal with such problems, is there something that I overlooked?

Also, as I'm writing this, I realize that I don't really understand the difference between the LabelBinarizer and the OneHotEncoder.. when do you use one or the other? Perhaps my confusion is due to that in fact?

@jnothman
scikit-learn member
@cjauvin

@jnothman Thanks, I was not aware of this distinction between data and label, and somehow always assumed that "label" meant "categorical", rather than stricly "predicted class".

But then it seems that there are many ways to deal with categorical variables.. I can do it with a DictVectorizer:

>>> dv = DictVectorizer(sparse=False)
>>> dv.fit([{'k': v} for v in ['a', 'b', 'c']])

and then with unseen values, we have this mapping to the all-zero vector behavior that we've been talking about in this thread:

>>> dv.transform([{'k': v} for v in ['a', 'd', 'e']])
array([[ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

but this "dict-ification", required for a single variable, seems a bit clumsy: is there another way? OneHotEncoder only deals with integers I think.. But there's also pandas.get_dummies:

>>> pandas.get_dummies(['a', 'b', 'c'])
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1

and then there's also LabelBinarizer, which can do the job, but is not meant for it, as you pointed out.

So what's the best way to deal with nominal variables, and for that given method, what is the best way to deal with unseen values?

@jnothman
scikit-learn member
@hamsal

I will be glad to implement the fix for this issue to properly maintain backwards compatibility. I did not notice this when I made the changes.

@arjoly
scikit-learn member

Thanks @hamsal!

@ogrisel
scikit-learn member

Thanks @hamsal, please let me know ASAP when it's ready. I would like to release 0.15.1 as soon as possible.

@hamsal

I will do my best to complete it early tomorrow

@hamsal

You can find my work in the pull request above, all that is left to finalize is to fix any Travis issues that may come up.

@ogrisel
scikit-learn member

Fix in #3486

@ogrisel ogrisel closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.