non-numerical labels with LabelBinarizer #856

Closed
mblondel opened this Issue May 15, 2012 · 7 comments

Comments

Projects
None yet
4 participants
Owner

mblondel commented May 15, 2012

LabelBinarizer may support non-numerical labels just like the recently added LabelEncoder but this is currently not tested. Things to test:

  • binary case
  • multi-class case
  • multi-label case
  • array-like input

@larsmans larsmans added a commit that referenced this issue May 24, 2012

@larsmans larsmans ENH fix and test LabelBinarizer's handling of string labels
Solves Issue #856. Not advertised in the documentation yet.
855257f
Owner

larsmans commented May 24, 2012

Added some handling for the binary and multiclass cases, though not the multilabel case. Mind you, our handling of multilabel data overloads the meaning of Python sequences quite a bit now...

Owner

mblondel commented May 24, 2012

Mind you, our handling of multilabel data overloads the meaning of Python sequences quite a bit now...

Can you elaborate ?

Owner

larsmans commented May 24, 2012

See preprocessing._is_multilabel:

  • when y[0] is an ndarray, we have an indicator matrix
  • when y[0] is a string, we've got string labels
  • when y[0] is another kind of sequence, we've got a multilabel list
Owner

mblondel commented May 24, 2012

We should probably remove the special handling of the indicator matrix: if Y is already an indicator matrix, there's no point passing it through LabelBinarizer.

Contributor

samuela commented Aug 8, 2013

I can confirm that this was working 0.13.X but is now broken in 0.14.X. LabelBinarizer used to accept a numpy array of dtype 'object' but doesn't anymore. I believe https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/multiclass.py#L293 is at least part of the problem.

This is an extremely important feature for working with categorical features and will likely force me to revert to version 0.13. Whenever loading any type of CSV file with pandas, I end up with a numpy array that uses the 'object' dtype to describe string columns.

Contributor

samuela commented Aug 21, 2013

Turns out I was running into a related issue #2374.

Owner

arjoly commented Jul 19, 2014

Given the split between label binarizer and multilabel binarizer, this is should be well tested now. Thanks to @jnothman and @hamsal.

arjoly closed this Jul 19, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment