Skip to content

[MRG+1] deprecate sequences of sequences multilabel support #2657

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

jnothman
Copy link
Member

Towards #2270

  • add warnings: most uses of the deprecated format are through type_of_target. Its helper, is_sequence_of_sequences triggers the warning. A further warning applies to make_multilabel_classification.
  • provide alternative binarizer, sklearn.preprocessing.MultiLabelBinarizer
  • fix documentation

This will require a bit of updating once the alternative sparse matrix support in is incorporated from #2458 and https://github.com/jnothman/scikit-learn/tree/sparse_multi_metrics ... or vice-versa.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 0e0ac76 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 2ab7c62 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

@mblondel
Copy link
Member

I think sequences of sequences should also be deprecated from OneVsRestClassifier. Then if y is a 1d array, OneVsRestClassifier should call LabelBinarizer on it. If y is a 2d array, it should assume that it's a label indicator matrix (optionally, it should lazily densify each column if y is sparse).

@mblondel
Copy link
Member

My reasoning is that we really don't want to have code like

if _sequences_of_sequences(y):
    Y = MultiLabelBinarizer().fit_transform(y)
else:
    Y = LabelBinarizer().fit_transform(y)

lying around. So we should deprecate sequences of sequences even in OneVsRestClassfier and rely on the shape of y instead.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 301061f on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

@jnothman
Copy link
Member Author

@mblondel, anything that uses is_sequence_of_sequences will produce a
DeprecationWarning, including OvR's use of LabelBinarizer. Sorry for not
clarifying that in my PR blurb, but there was so long between coding the
warnings and finishing all else for the PR.

On Thu, Dec 12, 2013 at 6:28 PM, Coveralls notifications@github.com wrote:

[image: Coverage Status] https://coveralls.io/builds/377101

Coverage remained the same when pulling 301061f
301061f
on jnothman:seq_of_seqs_warn
into f86ed77
f86ed77
on scikit-learn:master
.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2657#issuecomment-30394249
.

@jnothman
Copy link
Member Author

But you're right that I've not checked parameters sections of comments. I shall do that.

@jnothman
Copy link
Member Author

Also, I see that we will in the future not need a full type_of_target in something like OvR. But given that we need to support the old format for now, I think that's best to clean up at the end of the deprecation cycle.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 99dc630 on jnothman:seq_of_seqs_warn into f86ed77 on scikit-learn:master.

@ogrisel
Copy link
Member

ogrisel commented Dec 12, 2013

Putting that PR on the 0.15 milestone as well. I will try to have a deeper look at it in the coming days.

@jnothman
Copy link
Member Author

Ideally, this should be pulled in with the sparse label indicator support PRs. But at a minimum it would be nice to start warning that the end is nigh...

@mblondel
Copy link
Member

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer.

@arjoly
Copy link
Member

arjoly commented Dec 13, 2013

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right?

Note that this is no longer the case since #1993 was fixed.

@jnothman
Copy link
Member Author

So OneVsRestClassifier's support for label indicator matrices relies on the fact that LabelBinarizer is a no-op when it is fed with a 2d y right? I was thinking this behavior should probably go too if multi-label support is removed from LabelBinarizer.

Perhaps. I guess I'd need to review where LabelBinarizer is used to work that out. And it would be a much bigger PR...

Is it possible to consider the changes here and then work out what we want the function of LabelBinarizer to be? For now, we are certain that we want it to give a warning on sequences of sequences, and to remove support for them in a couple of versions.

@mblondel
Copy link
Member

Note that this is no longer the case since #1993 was fixed.

Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this.

@jnothman
Copy link
Member Author

@mblondel

Then multilabel classication with indicator matrix probably doesn't work yet in OneVsRestClassifier. It would be nice to test this.

I don't understand why you think so. If I understand you correctly, this is tested at https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tests/test_multiclass.py#L94

@mblondel
Copy link
Member

It's just that if the no-op behavior was removed from LabelBinarizer, I don't see how this line can work in the multilabel case with indicator matrix:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L88

@jnothman
Copy link
Member Author

That's because the no-op behaviour was removed because it wasn't affected
to LabelBinarizer's pos_label and neg_label parameters. When passed a label
indicator, it will modify it to use the specified pos/neg_label, hence an
op. But it'll still work when used with default pos/neg labels.

On Tue, Dec 17, 2013 at 6:42 PM, Mathieu Blondel
notifications@github.comwrote:

It's just that if the no-op behavior was removed from LabelBinarizer, I
don't see how this line can work in the multilabel case with indicator
matrix:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/multiclass.py#L88


Reply to this email directly or view it on GitHubhttps://github.com//pull/2657#issuecomment-30731616
.

@jnothman
Copy link
Member Author

jnothman commented Jan 2, 2014

I think your implication is right, @mblondel: that LabelBinarizer and label_binarize should no longer support multilabel at all. I'll change this back to WIP and consider how to implement it.

@jnothman
Copy link
Member Author

I think your implication is right, @mblondel: that LabelBinarizer and label_binarize should no longer support multilabel at all. I'll change this back to WIP and consider how to implement it.

Having said this, I'm not really sure what the best way to sort out the various uses of label binarizer is, and whether it's worth having a utility that will binarize multiclass data, but pass through -- or change the negative value on -- an array that is already binarized (i.e. multilabel).

@jnothman
Copy link
Member Author

@mblondel, I am no longer convinced that this is the right place to deprecate multilabel support in LabelBinarizer: OvR currently stores its state in a label_binarizer_ enabling the inverse transformation during prediction. To take multilabel out of LabelBinarizer entails selecting another way of storing this state information and performing the inverse transformation. LabelBinarizer is then poorly named for multilabel data, because it takes a binarized input, yet it is still a convenient abstraction even though it more-or-less passes the data through.

As a result, I have changed this PR back to [MRG] and welcome reviews. (ping @arjoly, I know that you're working in related space)

@arjoly
Copy link
Member

arjoly commented Feb 24, 2014

Can you rebase on top of master? I will try to have a look at this in the coming days.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 8066ff6 on jnothman:seq_of_seqs_warn into b96f354 on scikit-learn:master.

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 2b69aa9 on jnothman:seq_of_seqs_warn into b96f354 on scikit-learn:master.

@arjoly
Copy link
Member

arjoly commented Feb 25, 2014

In metrics.py, should we raise deprecation warning with multilabel-sequence ? You suppress a part of the doc on this subject, but not all.

edit: Ok, you used is_sequence to perform the deprecation.


# Automatically increment on new class
class_mapping = defaultdict(int)
class_mapping.default_factory = class_mapping.__len__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look like a hack. Can you comment what you are doing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a hack; one that I've already used in CountVectorizer. I've commented it two lines prior, but maybe putting the following in utils is better:

class AutoIncrementer(dict):
    def __missing__(self, key):
        out = len(self)
        self[key] = out
        return out

WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, this solution is almost twice as slow as the defaultdict hack, but I don't expect this assignment to dominate anywhere we use it:

%timeit a=AutoIncrementer(); sum(a[k] for k in range(100000))
10 loops, best of 3: 66.1 ms per loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any strong opinion on this. Let's keep this version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a clear name for it, I'd be happy to refactor some utilities relating to this indexation or vocabulary construct (mapping from objects to contiguous integers from 0 to n_objects - 1). The operations that commonly happen are: create a dict that autoincrements; map that dict to an equivalent array/list; map back the other direction. I'm not sure if these would benefit from being abstracted behind functions.

Yet, for that second function, for example, we have used:

l = sorted(d, key=d.__getitem__)

or similar to convert to a list, but the following is faster (linear complexity; faster benchmark):

l = np.array(len(d), dtype=object)
keys, vals = zip(*six.iteritems(d))
l[vals] = keys

so putting it behind a named function might make sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those patterns are repeated several times, it would be nice to make some code refactoring. But this would be the topic of another pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and I've been wondering whether I should submit a patch to perform parallelised feature extraction for text which would probably incorporate these sorts of helpers.

@arjoly
Copy link
Member

arjoly commented Feb 25, 2014

While MultiLabelBinarizer clearly states that it is intended to be for multilabel data, I don't find directly the link with the sequence/collection of sequence/collection label format.

What do you think of LabelSequenceBinarizer, MultilabelSequenceBinarizer, LabelSequenceEncoder or MultilabelSequenceEncoder?

return yt.toarray()

def _transform(self, y, class_mapping):
indices = array.array('i')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a small comment of what is _transform and class_mapping?

@arjoly
Copy link
Member

arjoly commented Mar 10, 2014

LGTM!

Thanks @jnothman

Producing multilabel data as a list of sets of labels may be more intuitive.
The transformer :class:`MultiLabelBinarizer <preprocessing.MultiLabelBinarizer>`
will convert between a collection of collections of labels and the indicator
format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be useful to add a note giving the version at which the MultiLabelBinarizer was added. This can be done using the versionadded markup in Sphinx: http://sphinx-doc.org/markup/para.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the version will be 0.15.

@hamsal
Copy link
Contributor

hamsal commented May 23, 2014

Hi @jnothman, Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository?

I will soon be working on #2458 to fill in the missing pieces and be responsive to comments/reviews of the implementation. I would like to make this PR, which is pretty much complete, a prerequisite because it will be easier to deprecate sequence of sequences now before #2458 is implemented.

Thanks!

@jnothman
Copy link
Member Author

Does this PR need a rebase? If so Is it OK if I take the code here and rebase it on top of the current repository?

Github suggests there shouldn't be any conflicts for a merge/rebase. Ideally, we should just merge this PR ASAP, but it needs another review.

@arjoly
Copy link
Member

arjoly commented May 29, 2014

A last review would be very nice !

@kastnerkyle
Copy link
Member

Checked this out, and had to rebase with master. Once that was done, I see 1 failing doctest

======================================================================
FAIL: Doctest: preprocessing.rst
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/volatile/accounts/kkastner/anaconda3/lib/python3.4/doctest.py", line 2193, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
nose.proxy.AssertionError: Failed doctest test for preprocessing.rst
  File "/volatile/accounts/kkastner/src/scikit-learn/doc/modules/preprocessing.rst", line 0

----------------------------------------------------------------------
File "/volatile/accounts/kkastner/src/scikit-learn/doc/modules/preprocessing.rst", line 386, in preprocessing.rst
Failed example:
    lb.classes_
Expected:
    array([1, 2, 3])
Got:
    array([1, 2, 3], dtype=object)

>>  raise self.failureException(self.format_failure(<_io.StringIO object at 0x2b298afcbc18>.getvalue()))


----------------------------------------------------------------------

Can you rebase with master? This seems like a Very Good Thing (TM)

classes = sorted(set(itertools.chain.from_iterable(y)))
else:
classes = self.classes
self.classes_ = np.empty(len(classes), dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would find it more natural to only use the object dtype for non integer y. That would fix the doctest failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be implemented as (untested):

if all(isinstance(c, int) for c in classes):
    dtype = np.int
else:
    dtype = object
self.classes_ = np.asarray(classes, dtype=dtype)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman @arjoly do you agree with this suggested change? If so I can do it an then merge this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's ok for me

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we can assume consistent typing from the first element of the iterable of iterables.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, got confused about context.

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2014

I will merge #3246 as that includes the fix for the int dtype as soon as travis is green.

@ogrisel
Copy link
Member

ogrisel commented Jun 5, 2014

Alright merged! Thanks you very much @jnothman for the fix.

@ogrisel ogrisel closed this Jun 5, 2014
@arjoly
Copy link
Member

arjoly commented Jun 5, 2014

Great, this is finally done! Congratulation @jnothman for all your efforts!

@arjoly arjoly mentioned this pull request Jun 5, 2014
17 tasks
@jnothman
Copy link
Member Author

jnothman commented Jun 5, 2014

Thanks for pulling this through.

@jnothman
Copy link
Member Author

jnothman commented Jun 5, 2014

And I'm looking forward to the 0.15 beta. Is there sense in trying to include in it some support for a sparse multilabel format (e.g. sparse_output in LabelBinarizer, recognition in type_of_target and sparse support in metrics, all of which have been largely implemented)? Warning for users with label spaces that are memory consumptive when dense indicators may get annoying when there is no recourse to a sparse alternative.

@amueller amueller modified the milestones: 0.16, 0.15 Jul 15, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants