Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Replace lists of labels for multi-label by sparse matrices #2270

Closed
amueller opened this Issue Jul 27, 2013 · 10 comments

Comments

Projects
None yet
4 participants
Owner

amueller commented Jul 27, 2013

We agreed on the sprint that having lists of lists is too hard from an api standpoint and we should rather use sparse matrices, to ease input validation and remove ambiguities.

We will need to provide some helper functions to create the sparse matrix labels, though.

Owner

arjoly commented Nov 15, 2013

For reference #2458

Owner

jnothman commented Dec 8, 2013

Support for sparse label indicators being well under way, I think we need to look at deprecation: We need to continue to support conversion from sequence of sequences (for longer than the deprecation cycle, if not forever) to label indicator matrix, i.e. when the user uses label_binarize or LabelBinarizer.fit, there should be no warning. But other times that sklearn.utils.multiclass.is_sequence_of_sequences returns True, a DeprecationWarning is appropriate. Does that seem right? Does it mean that every time LabelBinarizer or label_binarize is called internally we should add a flag to indicate the need for a warning? Is there a better way to do this?

We also need a warning in make_multiclass_classification(..., return_indicator=False) but that's uncontroversial.

Owner

GaelVaroquaux commented Dec 8, 2013

Support for sparse label indicators being well under way,

This is great!

We need to continue to support conversion from sequence of sequences
(for longer than the deprecation cycle, if not forever)

I think that supporting sequence of sequences leads to code that is
complex and very tricky to get right. I would prefer if in the long run
we did not support it. We can keep a function that does the conversion,
and maybe use it in the LabelBinarizer and label_binarize

to label indicator matrix, i.e. when the user uses label_binarize or
LabelBinarizer.fit, there should be no warning. But other times that
sklearn.utils.multiclass.is_sequence_of_sequences returns True, a
DeprecationWarning is appropriate. Does that seem right?

It does seem right to me.

Does it mean that every time LabelBinarizer or label_binarize is called
internally we should add a flag to indicate the need for a warning?

Forgive me, I am probably missing something trivial, however is it not
possible to make LabelBinarizer and label_binarize use internally a label
indicator array?

Thanks a lot for these efforts, I believe that they are very important
for refactoring the code and making it easier to maintain.

Owner

jnothman commented Dec 8, 2013

Forgive me, I am probably missing something trivial, however is it not possible to make LabelBinarizer and label_binarize use internally a label indicator array?

LabelBinarize and label_binarizer output a label indicator array, given a sequence of sequences (or a label indicator, but in that case they are a no-op; or an ordinary multiclass array). So the preprocessors are used in some parts of the code (OVR, metrics) to ensure normalised input and so that, using inverse_transform, the output predictions are in the same form as the input.

We want to allow users to use those utilities, but when they are used internally they should warn against the deprecated format.

Owner

GaelVaroquaux commented Dec 8, 2013

Makes sens. Thanks for the explanation.

Owner

jnothman commented Dec 9, 2013

I feel like the solution I've proposed is too complex. It would require an argument to label_binarize or LabelBinarizer like permit_sequences=True that is passed as False when used internally, and that would need to stay after deprecation. If we offer a separate tool like a LabelSetsBinarizer to handle the sequence of sequences/sets case when a user wants it, then that requires minimal disruption to everything else. I'll pull together a PR.

Owner

GaelVaroquaux commented Dec 10, 2013

If we offer a separate tool like a LabelSetsBinarizer to handle
the sequence of sequences/sets case when a user wants it, then that requires
minimal disruption to everything else.

Sounds good to me. We need to be careful to document this well enough so
that users find it. Maybe it can even be suggested in the error msg that
complains that the input format is not good.

Owner

jnothman commented Dec 10, 2013

Maybe it can even be suggested in the error msg that complains that the input format is not good.

That was my intention. Coming soon...

Owner

arjoly commented Jul 19, 2014

We finally make it thanks to @jnothman and @hamsal.
there is still some work to have sparse output metrics, but this is coming and all the sequence of sequence stuff is now deprecated.

@arjoly arjoly closed this Jul 19, 2014

Owner

GaelVaroquaux commented Jul 19, 2014

We finally make it thanks to @jnothman and @hamsal.

Good job people! That was a tough, but important work.

@jnothman jnothman referenced this issue Feb 11, 2015

Closed

[WIP] ENH support sparse y in GridSearchCV #4228

2 of 4 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment