Skip to content

Conversation

cbrummitt
Copy link
Contributor

This PR fixes a bug for multi-label targets in StratifiedShuffleSplit. The solution being used now is the "label powerset" method: each sequence of labels is mapped to a string with str(row), which transforms a multi-label problem into a multi-class problem.

To see the source of the problem, note that len(str(np.arange(1000))) returns 4056 while len(str(np.arange(1001))) returns 36. The reason is that arrays with > 1000 elements are truncated with an ellipsis: str(np.arange(1001)) gives '[ 0 1 2 ..., 998 999 1000]'. Thus, for multi-label targets with > 1000 labels, samples are mapped onto the same short string whenever their first three values and last three values are the same, which is not the intended behavior.

The solution proposed here, discussed with @vene in this comment thread, is to use ' '.join(row.astype('str')) to convert each target to a string. We are guaranteed that we can do call .astype('str') on row because y = check_array(y, ensure_2d=False, dtype=None) converts y to a numpy array.

As an added benefit, this approach ends up being several faster than str(row) when len(row) < 1000:

In [1]: import numpy as np
In [2]: row = np.random.randint(0, 2, size=500)
In [3]: %timeit ' '.join(row.astype('str'))
169 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: %timeit str(row)
738 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I also considered using sklearn.utils.murmurhash3_32 (suggested by @vene as another option) but concluded it was tricky to coerce all the kinds of labels people might use into an int32 data type.

…ecause str(row) uses an ellipsis when len(row) > 1000
@vene
Copy link
Member

vene commented Oct 13, 2017

Awesome, thank you for catching this and solving it!

Could you also modify the relevant test so that it fails without the patch? Thanks!

@jnothman jnothman added this to the 0.19.1 milestone Oct 15, 2017
@lesteve
Copy link
Member

lesteve commented Oct 16, 2017

It would be nice to add a test.

@cbrummitt
Copy link
Contributor Author

Good idea @vene and @lesteve. I added a test for a y with > 1000 labels. I simply added a new function test_stratified_shuffle_split_multilabel_many_labels to test_split.py that fails on the old method and passes with this bug fix.

Is there anything else that would need to be done to hook up this test?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels [MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels Oct 17, 2017
@lesteve
Copy link
Member

lesteve commented Oct 17, 2017

LGTM, merging, thanks a lot!

@lesteve lesteve merged commit d074e40 into scikit-learn:master Oct 17, 2017
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 17, 2017
…argets having > 1000 labels (scikit-learn#9922)

* Use ' '.join(row) for multi-label targets in StratifiedShuffleSplit because str(row) uses an ellipsis when len(row) > 1000
* Add a new test for multilabel problems with more than a thousand labels
@cbrummitt cbrummitt deleted the fix-multilabel-StratifiedShuffleSplit branch October 19, 2017 01:06
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…argets having > 1000 labels (scikit-learn#9922)

* Use ' '.join(row) for multi-label targets in StratifiedShuffleSplit because str(row) uses an ellipsis when len(row) > 1000
* Add a new test for multilabel problems with more than a thousand labels
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
…argets having > 1000 labels (scikit-learn#9922)

* Use ' '.join(row) for multi-label targets in StratifiedShuffleSplit because str(row) uses an ellipsis when len(row) > 1000
* Add a new test for multilabel problems with more than a thousand labels
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants