[MRG+1] deprecate sequences of sequences multilabel support #2657

jnothman · 2013-12-11T09:09:27Z

Towards #2270

add warnings: most uses of the deprecated format are through type_of_target. Its helper, is_sequence_of_sequences triggers the warning. A further warning applies to make_multilabel_classification.
provide alternative binarizer, sklearn.preprocessing.MultiLabelBinarizer
fix documentation

This will require a bit of updating once the alternative sparse matrix support in is incorporated from #2458 and https://github.com/jnothman/scikit-learn/tree/sparse_multi_metrics ... or vice-versa.

coveralls · 2013-12-11T09:56:52Z

Coverage remained the same when pulling 0e0ac76 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

coveralls · 2013-12-11T23:47:19Z

Coverage remained the same when pulling 2ab7c62 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

mblondel · 2013-12-12T01:45:10Z

I think sequences of sequences should also be deprecated from OneVsRestClassifier. Then if y is a 1d array, OneVsRestClassifier should call LabelBinarizer on it. If y is a 2d array, it should assume that it's a label indicator matrix (optionally, it should lazily densify each column if y is sparse).

mblondel · 2013-12-12T01:52:22Z

My reasoning is that we really don't want to have code like

if _sequences_of_sequences(y):
    Y = MultiLabelBinarizer().fit_transform(y)
else:
    Y = LabelBinarizer().fit_transform(y)

lying around. So we should deprecate sequences of sequences even in OneVsRestClassfier and rely on the shape of y instead.

coveralls · 2013-12-12T07:28:32Z

Coverage remained the same when pulling 301061f on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

jnothman · 2013-12-12T07:34:57Z

@mblondel, anything that uses is_sequence_of_sequences will produce a
DeprecationWarning, including OvR's use of LabelBinarizer. Sorry for not
clarifying that in my PR blurb, but there was so long between coding the
warnings and finishing all else for the PR.

On Thu, Dec 12, 2013 at 6:28 PM, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/377101

Coverage remained the same when pulling 301061f
301061f
on jnothman:seq_of_seqs_warn into f86ed77
f86ed77
on scikit-learn:master.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2657#issuecomment-30394249
.

jnothman · 2013-12-12T07:37:13Z

But you're right that I've not checked parameters sections of comments. I shall do that.

jnothman · 2013-12-12T07:40:23Z

Also, I see that we will in the future not need a full type_of_target in something like OvR. But given that we need to support the old format for now, I think that's best to clean up at the end of the deprecation cycle.

coveralls · 2013-12-12T08:36:16Z

Coverage remained the same when pulling 99dc630 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

ogrisel · 2013-12-12T17:44:23Z

Putting that PR on the 0.15 milestone as well. I will try to have a deeper look at it in the coming days.

jnothman · 2013-12-12T21:47:33Z

Ideally, this should be pulled in with the sparse label indicator support PRs. But at a minimum it would be nice to start warning that the end is nigh...

mblondel · 2013-12-13T04:40:23Z

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer.

arjoly · 2013-12-13T07:34:04Z

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right?

Note that this is no longer the case since #1993 was fixed.

jnothman · 2013-12-13T07:51:42Z

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer.

Perhaps. I guess I'd need to review where LabelBinarizer is used to work that out. And it would be a much bigger PR...

Is it possible to consider the changes here and then work out what we want the function of LabelBinarizer to be? For now, we are certain that we want it to give a warning on sequences of sequences, and to remove support for them in a couple of versions.

mblondel · 2013-12-13T08:15:32Z

Note that this is no longer the case since #1993 was fixed.

Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this.

jnothman · 2013-12-17T03:11:03Z

@mblondel

Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this.

I don't understand why you think so. If I understand you correctly, this is tested at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_multiclass.py#L94

mblondel · 2013-12-17T07:41:59Z

It's just that if the no-op behavior was removed from LabelBinarizer, I don't see how this line can work in the multilabel case with indicator matrix:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L88

jnothman · 2013-12-17T07:58:45Z

That's because the no-op behaviour was removed because it wasn't affected
to LabelBinarizer's pos_label and neg_label parameters. When passed a label
indicator, it will modify it to use the specified pos/neg_label, hence an
op. But it'll still work when used with default pos/neg labels.

On Tue, Dec 17, 2013 at 6:42 PM, Mathieu Blondel
notifications@github.comwrote:

It's just that if the no-op behavior was removed from LabelBinarizer, I
don't see how this line can work in the multilabel case with indicator
matrix:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L88

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2657#issuecomment-30731616
.

jnothman · 2014-01-02T07:43:29Z

I think your implication is right, @mblondel: that LabelBinarizer and label_binarize should no longer support multilabel at all. I'll change this back to WIP and consider how to implement it.

jnothman · 2014-01-16T03:17:51Z

I think your implication is right, @mblondel: that LabelBinarizer and label_binarize should no longer support multilabel at all. I'll change this back to WIP and consider how to implement it.

Having said this, I'm not really sure what the best way to sort out the various uses of label binarizer is, and whether it's worth having a utility that will binarize multiclass data, but pass through -- or change the negative value on -- an array that is already binarized (i.e. multilabel).

jnothman · 2014-02-24T02:02:48Z

@mblondel, I am no longer convinced that this is the right place to deprecate multilabel support in LabelBinarizer: OvR currently stores its state in a label_binarizer_ enabling the inverse transformation during prediction. To take multilabel out of LabelBinarizer entails selecting another way of storing this state information and performing the inverse transformation. LabelBinarizer is then poorly named for multilabel data, because it takes a binarized input, yet it is still a convenient abstraction even though it more-or-less passes the data through.

As a result, I have changed this PR back to [MRG] and welcome reviews. (ping @arjoly, I know that you're working in related space)

arjoly · 2014-02-24T16:33:23Z

Can you rebase on top of master? I will try to have a look at this in the coming days.

coveralls · 2014-02-24T23:52:40Z

Coverage remained the same when pulling 8066ff6 on jnothman:seq_of_seqs_warn into b96f354 on scikit-learn:master.

coveralls · 2014-02-25T02:39:39Z

Coverage remained the same when pulling 2b69aa9 on jnothman:seq_of_seqs_warn into b96f354 on scikit-learn:master.

arjoly · 2014-02-25T07:37:03Z

~~In metrics.py, should we raise deprecation warning with multilabel-sequence ?~~ You suppress a part of the doc on this subject, but not all.

edit: Ok, you used is_sequence to perform the deprecation.

arjoly · 2014-02-25T07:46:27Z

sklearn/preprocessing/label.py

+
+        # Automatically increment on new class
+        class_mapping = defaultdict(int)
+        class_mapping.default_factory = class_mapping.__len__


Look like a hack. Can you comment what you are doing here?

It is a hack; one that I've already used in CountVectorizer. I've commented it two lines prior, but maybe putting the following in utils is better:

class AutoIncrementer(dict): def __missing__(self, key): out = len(self) self[key] = out return out

WDYT?

For what it's worth, this solution is almost twice as slow as the defaultdict hack, but I don't expect this assignment to dominate anywhere we use it:

%timeit a=AutoIncrementer(); sum(a[k] for k in range(100000)) 10 loops, best of 3: 66.1 ms per loop

I don't have any strong opinion on this. Let's keep this version.

If there's a clear name for it, I'd be happy to refactor some utilities relating to this indexation or vocabulary construct (mapping from objects to contiguous integers from 0 to n_objects - 1). The operations that commonly happen are: create a dict that autoincrements; map that dict to an equivalent array/list; map back the other direction. I'm not sure if these would benefit from being abstracted behind functions.

Yet, for that second function, for example, we have used:

l = sorted(d, key=d.__getitem__)

or similar to convert to a list, but the following is faster (linear complexity; faster benchmark):

l = np.array(len(d), dtype=object) keys, vals = zip(*six.iteritems(d)) l[vals] = keys

so putting it behind a named function might make sense.

If those patterns are repeated several times, it would be nice to make some code refactoring. But this would be the topic of another pr.

Yes, and I've been wondering whether I should submit a patch to perform parallelised feature extraction for text which would probably incorporate these sorts of helpers.

arjoly · 2014-02-25T07:56:20Z

While MultiLabelBinarizer clearly states that it is intended to be for multilabel data, I don't find directly the link with the sequence/collection of sequence/collection label format.

What do you think of LabelSequenceBinarizer, MultilabelSequenceBinarizer, LabelSequenceEncoder or MultilabelSequenceEncoder?

arjoly · 2014-02-25T07:57:52Z

sklearn/preprocessing/label.py

+        return yt.toarray()
+
+    def _transform(self, y, class_mapping):
+        indices = array.array('i')


Can you add a small comment of what is _transform and class_mapping?

Also avoid stride trick

arjoly · 2014-03-10T12:25:33Z

LGTM!

Thanks @jnothman

GaelVaroquaux · 2014-03-11T02:30:14Z

doc/modules/multiclass.rst

+Producing multilabel data as a list of sets of labels may be more intuitive.
+The transformer :class:`MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>`
+will convert between a collection of collections of labels and the indicator
+format.


Maybe it would be useful to add a note giving the version at which the MultiLabelBinarizer was added. This can be done using the versionadded markup in Sphinx: http://sphinx-doc.org/markup/para.html

In this case the version will be 0.15.

hamsal · 2014-05-23T23:03:09Z

Hi @jnothman, Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository?

I will soon be working on #2458 to fill in the missing pieces and be responsive to comments/reviews of the implementation. I would like to make this PR, which is pretty much complete, a prerequisite because it will be easier to deprecate sequence of sequences now before #2458 is implemented.

Thanks!

jnothman · 2014-05-24T11:43:46Z

Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository?

Github suggests there shouldn't be any conflicts for a merge/rebase. Ideally, we should just merge this PR ASAP, but it needs another review.

arjoly · 2014-05-29T16:20:57Z

A last review would be very nice !

kastnerkyle · 2014-06-04T08:33:25Z

Checked this out, and had to rebase with master. Once that was done, I see 1 failing doctest

======================================================================
FAIL: Doctest: preprocessing.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/volatile/accounts/kkastner/anaconda3/lib/python3.4/doctest.py", line 2193, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
nose.proxy.AssertionError: Failed doctest test for preprocessing.rst
  File "/volatile/accounts/kkastner/src/scikit-learn/doc/modules/preprocessing.rst", line 0

----------------------------------------------------------------------
File "/volatile/accounts/kkastner/src/scikit-learn/doc/modules/preprocessing.rst", line 386, in preprocessing.rst
Failed example:
    lb.classes_
Expected:
    array([1, 2, 3])
Got:
    array([1, 2, 3], dtype=object)

>>  raise self.failureException(self.format_failure(<_io.StringIO object at 0x2b298afcbc18>.getvalue()))


----------------------------------------------------------------------

Can you rebase with master? This seems like a Very Good Thing (TM)

ogrisel · 2014-06-04T11:12:26Z

sklearn/preprocessing/label.py

+            classes = sorted(set(itertools.chain.from_iterable(y)))
+        else:
+            classes = self.classes
+        self.classes_ = np.empty(len(classes), dtype=object)


I would find it more natural to only use the object dtype for non integer y. That would fix the doctest failure.

This could be implemented as (untested):

if all(isinstance(c, int) for c in classes): dtype = np.int else: dtype = object self.classes_ = np.asarray(classes, dtype=dtype)

@jnothman @arjoly do you agree with this suggested change? If so I can do it an then merge this PR.

it's ok for me

As long as we can assume consistent typing from the first element of the iterable of iterables.

Sorry, got confused about context.

ogrisel · 2014-06-05T10:07:22Z

I will merge #3246 as that includes the fix for the int dtype as soon as travis is green.

ogrisel · 2014-06-05T11:44:54Z

Alright merged! Thanks you very much @jnothman for the fix.

arjoly · 2014-06-05T11:50:34Z

Great, this is finally done! Congratulation @jnothman for all your efforts!

jnothman · 2014-06-05T12:45:48Z

Thanks for pulling this through.

jnothman · 2014-06-05T12:52:47Z

And I'm looking forward to the 0.15 beta. Is there sense in trying to include in it some support for a sparse multilabel format (e.g. sparse_output in LabelBinarizer, recognition in type_of_target and sparse support in metrics, all of which have been largely implemented)? Warning for users with label spaces that are memory consumptive when dense indicators may get annoying when there is no recourse to a sparse alternative.

jnothman mentioned this pull request Feb 4, 2014

[MRG+1] allow y to be a list in GridSearchCV, cross_val_score and train_test_split #2694

Merged

arjoly reviewed Feb 25, 2014
View reviewed changes

jnothman added 10 commits March 10, 2014 22:56

FIX don't use dict comprehension for Py 2.6

04a98f8

FIX No set construction shorthand in Py2.6

4ee94da

DOC comment on _transform interface

4888c54

Also avoid stride trick

TST Validate input MultiLabelBinarizer.inverse_transform

cbc75fe

No set construction shorthand in Py2.6

8d20025

DOC/FIX Address @arjoly's comments

642f3ae

TST stronger test for non-integers in MultiLabelBinarizer

17c2f4f

Assert or ignore all sequence of sequences deprecation warnings

e84fe03

TST avoid more warnings related to sequence of sequences

ec2dc07

TST fix testing for sequence of sequences warning in metrics

567acd6

GaelVaroquaux reviewed Mar 11, 2014
View reviewed changes

ogrisel reviewed Jun 4, 2014
View reviewed changes

ogrisel mentioned this pull request Jun 5, 2014

[MRG+1] deprecate sequences of sequences multilabel support #3246

Merged

ogrisel closed this Jun 5, 2014

arjoly mentioned this pull request Jun 5, 2014

[MRG+2] Sparse label_binarizer #3203

Closed

17 tasks

amueller modified the milestones: 0.16, 0.15 Jul 15, 2014

jnothman mentioned this pull request Feb 11, 2015

[WIP] ENH support sparse y in GridSearchCV #4228

Closed

4 tasks

Uh oh!

[MRG+1] deprecate sequences of sequences multilabel support #2657

[MRG+1] deprecate sequences of sequences multilabel support #2657

Uh oh!

Conversation

jnothman commented Dec 11, 2013

Uh oh!

coveralls commented Dec 11, 2013

Uh oh!

coveralls commented Dec 11, 2013

Uh oh!

mblondel commented Dec 12, 2013

Uh oh!

mblondel commented Dec 12, 2013

Uh oh!

coveralls commented Dec 12, 2013

Uh oh!

jnothman commented Dec 12, 2013

Uh oh!

jnothman commented Dec 12, 2013

Uh oh!

jnothman commented Dec 12, 2013

Uh oh!

coveralls commented Dec 12, 2013

Uh oh!

ogrisel commented Dec 12, 2013

Uh oh!

jnothman commented Dec 12, 2013

Uh oh!

mblondel commented Dec 13, 2013

Uh oh!

arjoly commented Dec 13, 2013

Uh oh!

jnothman commented Dec 13, 2013

Uh oh!

mblondel commented Dec 13, 2013

Uh oh!

jnothman commented Dec 17, 2013

Uh oh!

mblondel commented Dec 17, 2013

Uh oh!

jnothman commented Dec 17, 2013

Uh oh!

jnothman commented Jan 2, 2014

Uh oh!

jnothman commented Jan 16, 2014

Uh oh!

jnothman commented Feb 24, 2014

Uh oh!

arjoly commented Feb 24, 2014

Uh oh!

coveralls commented Feb 24, 2014

Uh oh!

coveralls commented Feb 25, 2014

Uh oh!

arjoly commented Feb 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Feb 25, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arjoly commented Mar 10, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment