LabelBinarizer regression between 0.14.1 and 0.15.0 #3462

Closed
ogrisel opened this Issue Jul 21, 2014 · 15 comments

Comments

Projects
None yet
5 participants
@ogrisel
Member

ogrisel commented Jul 21, 2014

In 0.14.1 we have the following behavior:

>>> lb = LabelBinarizer()
>>> lb.fit_transform(['a', 'b', 'c'])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])
>>> lb.transform(['a', 'd', 'e'])
array([[1, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

In 0.15.0 the call to transform with unseen labels raises a ValueError. If we to change to a new behavior we should at least raise a deprecation warning and keep the old behavior by default while implementing the new behavior with a flag.

@ogrisel ogrisel added this to the 0.15.1 milestone Jul 21, 2014

@ogrisel ogrisel added the Bug label Jul 21, 2014

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 21, 2014

Member

This is related to the discussion on PR #3243 .

Member

ogrisel commented Jul 21, 2014

This is related to the discussion on PR #3243 .

@cjauvin

This comment has been minimized.

Show comment
Hide comment
@cjauvin

cjauvin Jul 21, 2014

Contributor

Just to summarize what I have already suggested in the mailing list about this, I would see three options for dealing with unseen labels:

  1. Map them to the all-zero vector (like before version 0.15)
  2. Raise an error (like current version)
  3. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)
Contributor

cjauvin commented Jul 21, 2014

Just to summarize what I have already suggested in the mailing list about this, I would see three options for dealing with unseen labels:

  1. Map them to the all-zero vector (like before version 0.15)
  2. Raise an error (like current version)
  3. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)
@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 21, 2014

Member

+1 for being backward compatible!

Member

arjoly commented Jul 21, 2014

+1 for being backward compatible!

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 21, 2014

Member
  1. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)

@cjauvin do you have a use-case for this option?

Member

ogrisel commented Jul 21, 2014

  1. Map them to an extra column; this would be the most complicated option, since it involves provisioning an extra column at creation (which by definition could only be non-zero in results returned by transform)

@cjauvin do you have a use-case for this option?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 22, 2014

Member

There are some benefits to the previous behaviour. For example, if I want to binarize a multiclass problem with labels ['majority', 'a', b'] in order to ignore the "majority" class, all I need to do is use:

label_binarize(labels, ['a', 'b'])
Member

jnothman commented Jul 22, 2014

There are some benefits to the previous behaviour. For example, if I want to binarize a multiclass problem with labels ['majority', 'a', b'] in order to ignore the "majority" class, all I need to do is use:

label_binarize(labels, ['a', 'b'])
@cjauvin

This comment has been minimized.

Show comment
Hide comment
@cjauvin

cjauvin Jul 22, 2014

Contributor

@cjauvin do you have a use-case for this option?

@ogrisel To be honest the very fact that you ask that makes me suspect that I might not be using the LabelBinarizer in a "proper" way, because I've been having quite a regular need for that actually. For instance, if a dataset is relatively small, it might happen that a categorical variable (that I want to one-hot encode) in a particular random train/test split has some values in the test part that have not been seen in the train part (i.e. assuming that you are doing cross-validation in a "legal" way, that is rigorously applying the same preprocessing/scaling/encoding to each split): in such a case I simply map all the unseen values to a single extra class, interpreted simply as "<unknown>". How do you normally deal with such problems, is there something that I overlooked?

Also, as I'm writing this, I realize that I don't really understand the difference between the LabelBinarizer and the OneHotEncoder.. when do you use one or the other? Perhaps my confusion is due to that in fact?

Contributor

cjauvin commented Jul 22, 2014

@cjauvin do you have a use-case for this option?

@ogrisel To be honest the very fact that you ask that makes me suspect that I might not be using the LabelBinarizer in a "proper" way, because I've been having quite a regular need for that actually. For instance, if a dataset is relatively small, it might happen that a categorical variable (that I want to one-hot encode) in a particular random train/test split has some values in the test part that have not been seen in the train part (i.e. assuming that you are doing cross-validation in a "legal" way, that is rigorously applying the same preprocessing/scaling/encoding to each split): in such a case I simply map all the unseen values to a single extra class, interpreted simply as "<unknown>". How do you normally deal with such problems, is there something that I overlooked?

Also, as I'm writing this, I realize that I don't really understand the difference between the LabelBinarizer and the OneHotEncoder.. when do you use one or the other? Perhaps my confusion is due to that in fact?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 22, 2014

Member

LabelBinarizer is for the target variable (i.e. the class in
classification), not the predictors. The differences can be subtle, but
these are indeed for two different purposes, and their appearing in
separate label and data modules is meant to accentuate this. The
LabelBinarizer is used internally to reformulate a multiclass problem as a
series of binary problems; it may be useful to a user who wants more
control of the exact transformation (e.g. excluding a majority class from
the transformed problem), but otherwise I don't think it's used a lot by
end-users.

On 22 July 2014 12:16, Christian Jauvin notifications@github.com wrote:

@cjauvin https://github.com/cjauvin do you have a use-case for this
option?

@ogrisel https://github.com/ogrisel To be honest the very fact that you
ask that makes me suspect that I might not be using the LabelBinarizer in a
"proper" way, because I've been having quite a regular need for that
actually. For instance, if a dataset is relatively small, it might happen
that a categorical variable (that I want to one-hot encode) in a particular
random train/test split has some values in the test part that have not been
seen in the train part (i.e. assuming that you are doing cross-validation
in a "legal" way, that is rigorously applying the same
preprocessing/scaling/encoding to each split): in such a case I simply map
all the unseen values to a single extra class, interpreted simply as
"". How do you normally deal with such problems, is there
something that I overlooked?

Also, as I'm writing this, I realize that I don't really understand the
difference between the LabelBinarizer and the OneHotEncoder.. when do you
use one or the other? Perhaps my confusion is due to that in fact?


Reply to this email directly or view it on GitHub
#3462 (comment)
.

Member

jnothman commented Jul 22, 2014

LabelBinarizer is for the target variable (i.e. the class in
classification), not the predictors. The differences can be subtle, but
these are indeed for two different purposes, and their appearing in
separate label and data modules is meant to accentuate this. The
LabelBinarizer is used internally to reformulate a multiclass problem as a
series of binary problems; it may be useful to a user who wants more
control of the exact transformation (e.g. excluding a majority class from
the transformed problem), but otherwise I don't think it's used a lot by
end-users.

On 22 July 2014 12:16, Christian Jauvin notifications@github.com wrote:

@cjauvin https://github.com/cjauvin do you have a use-case for this
option?

@ogrisel https://github.com/ogrisel To be honest the very fact that you
ask that makes me suspect that I might not be using the LabelBinarizer in a
"proper" way, because I've been having quite a regular need for that
actually. For instance, if a dataset is relatively small, it might happen
that a categorical variable (that I want to one-hot encode) in a particular
random train/test split has some values in the test part that have not been
seen in the train part (i.e. assuming that you are doing cross-validation
in a "legal" way, that is rigorously applying the same
preprocessing/scaling/encoding to each split): in such a case I simply map
all the unseen values to a single extra class, interpreted simply as
"". How do you normally deal with such problems, is there
something that I overlooked?

Also, as I'm writing this, I realize that I don't really understand the
difference between the LabelBinarizer and the OneHotEncoder.. when do you
use one or the other? Perhaps my confusion is due to that in fact?


Reply to this email directly or view it on GitHub
#3462 (comment)
.

@cjauvin

This comment has been minimized.

Show comment
Hide comment
@cjauvin

cjauvin Jul 22, 2014

Contributor

@jnothman Thanks, I was not aware of this distinction between data and label, and somehow always assumed that "label" meant "categorical", rather than stricly "predicted class".

But then it seems that there are many ways to deal with categorical variables.. I can do it with a DictVectorizer:

>>> dv = DictVectorizer(sparse=False)
>>> dv.fit([{'k': v} for v in ['a', 'b', 'c']])

and then with unseen values, we have this mapping to the all-zero vector behavior that we've been talking about in this thread:

>>> dv.transform([{'k': v} for v in ['a', 'd', 'e']])
array([[ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

but this "dict-ification", required for a single variable, seems a bit clumsy: is there another way? OneHotEncoder only deals with integers I think.. But there's also pandas.get_dummies:

>>> pandas.get_dummies(['a', 'b', 'c'])
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1

and then there's also LabelBinarizer, which can do the job, but is not meant for it, as you pointed out.

So what's the best way to deal with nominal variables, and for that given method, what is the best way to deal with unseen values?

Contributor

cjauvin commented Jul 22, 2014

@jnothman Thanks, I was not aware of this distinction between data and label, and somehow always assumed that "label" meant "categorical", rather than stricly "predicted class".

But then it seems that there are many ways to deal with categorical variables.. I can do it with a DictVectorizer:

>>> dv = DictVectorizer(sparse=False)
>>> dv.fit([{'k': v} for v in ['a', 'b', 'c']])

and then with unseen values, we have this mapping to the all-zero vector behavior that we've been talking about in this thread:

>>> dv.transform([{'k': v} for v in ['a', 'd', 'e']])
array([[ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

but this "dict-ification", required for a single variable, seems a bit clumsy: is there another way? OneHotEncoder only deals with integers I think.. But there's also pandas.get_dummies:

>>> pandas.get_dummies(['a', 'b', 'c'])
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1

and then there's also LabelBinarizer, which can do the job, but is not meant for it, as you pointed out.

So what's the best way to deal with nominal variables, and for that given method, what is the best way to deal with unseen values?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jul 22, 2014

Member

I think the assumption in scikit-learn is that after feature extraction,
features are a numeric array. And in the case of DictVectorizer it ignores
vocabulary unseen in training, like text.CountVectorizer, and assumes that
if you want something else you have a way to massage the data before input;
however, perhaps just as it has a 'restrict' method, it could have an
'extend' method...

On 22 July 2014 23:37, Christian Jauvin notifications@github.com wrote:

@jnothman https://github.com/jnothman Thanks, I was not aware of this
distinction between data and label, and somehow always assumed that "label"
meant "categorical", rather than stricly "predicted class".

But then it seems that there are many ways to deal with categorical
variables.. I can do it with a DictVectorizer:

dv = DictVectorizer(sparse=False)>>> dv.fit([{'k': v} for v in ['a', 'b', 'c']])

and then with unseen values, we have this mapping to the all-zero vector
behavior that we've been talking about in this thread:

dv.transform([{'k': v} for v in ['a', 'd', 'e']])array([[ 1., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])

but this "dict-ification", required for a single variable, seems a bit
clumsy: is there another way? OneHotEncoder only deals with integers I
think.. But there's also pandas.get_dummies:

pandas.get_dummies(['a', 'b', 'c'])
a b c0 1 0 01 0 1 02 0 0 1

and then there's also LabelBinarizer, which can do the job, but is not
meant for it, as you pointed out.

So what's the best way to deal with nominal variables, and for that given
method, what is the best way to deal with unseen values?


Reply to this email directly or view it on GitHub
#3462 (comment)
.

Member

jnothman commented Jul 22, 2014

I think the assumption in scikit-learn is that after feature extraction,
features are a numeric array. And in the case of DictVectorizer it ignores
vocabulary unseen in training, like text.CountVectorizer, and assumes that
if you want something else you have a way to massage the data before input;
however, perhaps just as it has a 'restrict' method, it could have an
'extend' method...

On 22 July 2014 23:37, Christian Jauvin notifications@github.com wrote:

@jnothman https://github.com/jnothman Thanks, I was not aware of this
distinction between data and label, and somehow always assumed that "label"
meant "categorical", rather than stricly "predicted class".

But then it seems that there are many ways to deal with categorical
variables.. I can do it with a DictVectorizer:

dv = DictVectorizer(sparse=False)>>> dv.fit([{'k': v} for v in ['a', 'b', 'c']])

and then with unseen values, we have this mapping to the all-zero vector
behavior that we've been talking about in this thread:

dv.transform([{'k': v} for v in ['a', 'd', 'e']])array([[ 1., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])

but this "dict-ification", required for a single variable, seems a bit
clumsy: is there another way? OneHotEncoder only deals with integers I
think.. But there's also pandas.get_dummies:

pandas.get_dummies(['a', 'b', 'c'])
a b c0 1 0 01 0 1 02 0 0 1

and then there's also LabelBinarizer, which can do the job, but is not
meant for it, as you pointed out.

So what's the best way to deal with nominal variables, and for that given
method, what is the best way to deal with unseen values?


Reply to this email directly or view it on GitHub
#3462 (comment)
.

@hamsal

This comment has been minimized.

Show comment
Hide comment
@hamsal

hamsal Jul 22, 2014

Contributor

I will be glad to implement the fix for this issue to properly maintain backwards compatibility. I did not notice this when I made the changes.

Contributor

hamsal commented Jul 22, 2014

I will be glad to implement the fix for this issue to properly maintain backwards compatibility. I did not notice this when I made the changes.

@arjoly

This comment has been minimized.

Show comment
Hide comment
@arjoly

arjoly Jul 23, 2014

Member

Thanks @hamsal!

Member

arjoly commented Jul 23, 2014

Thanks @hamsal!

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 24, 2014

Member

Thanks @hamsal, please let me know ASAP when it's ready. I would like to release 0.15.1 as soon as possible.

Member

ogrisel commented Jul 24, 2014

Thanks @hamsal, please let me know ASAP when it's ready. I would like to release 0.15.1 as soon as possible.

@hamsal

This comment has been minimized.

Show comment
Hide comment
@hamsal

hamsal Jul 24, 2014

Contributor

I will do my best to complete it early tomorrow

Contributor

hamsal commented Jul 24, 2014

I will do my best to complete it early tomorrow

@hamsal

This comment has been minimized.

Show comment
Hide comment
@hamsal

hamsal Jul 25, 2014

Contributor

You can find my work in the pull request above, all that is left to finalize is to fix any Travis issues that may come up.

Contributor

hamsal commented Jul 25, 2014

You can find my work in the pull request above, all that is left to finalize is to fix any Travis issues that may come up.

@ogrisel

This comment has been minimized.

Show comment
Hide comment
@ogrisel

ogrisel Jul 29, 2014

Member

Fix in #3486

Member

ogrisel commented Jul 29, 2014

Fix in #3486

@ogrisel ogrisel closed this Jul 29, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment