[MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels #9922

cbrummitt · 2017-10-13T18:34:15Z

This PR fixes a bug for multi-label targets in StratifiedShuffleSplit. The solution being used now is the "label powerset" method: each sequence of labels is mapped to a string with str(row), which transforms a multi-label problem into a multi-class problem.

To see the source of the problem, note that len(str(np.arange(1000))) returns 4056 while len(str(np.arange(1001))) returns 36. The reason is that arrays with > 1000 elements are truncated with an ellipsis: str(np.arange(1001)) gives '[ 0 1 2 ..., 998 999 1000]'. Thus, for multi-label targets with > 1000 labels, samples are mapped onto the same short string whenever their first three values and last three values are the same, which is not the intended behavior.

The solution proposed here, discussed with @vene in this comment thread, is to use ' '.join(row.astype('str')) to convert each target to a string. We are guaranteed that we can do call .astype('str') on row because y = check_array(y, ensure_2d=False, dtype=None) converts y to a numpy array.

As an added benefit, this approach ends up being several faster than str(row) when len(row) < 1000:

In [1]: import numpy as np
In [2]: row = np.random.randint(0, 2, size=500)
In [3]: %timeit ' '.join(row.astype('str'))
169 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [4]: %timeit str(row)
738 µs ± 23.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I also considered using sklearn.utils.murmurhash3_32 (suggested by @vene as another option) but concluded it was tricky to coerce all the kinds of labels people might use into an int32 data type.

…ecause str(row) uses an ellipsis when len(row) > 1000

vene · 2017-10-13T19:22:05Z

Awesome, thank you for catching this and solving it!

Could you also modify the relevant test so that it fails without the patch? Thanks!

lesteve · 2017-10-16T09:22:22Z

It would be nice to add a test.

cbrummitt · 2017-10-16T14:17:53Z

Good idea @vene and @lesteve. I added a test for a y with > 1000 labels. I simply added a new function test_stratified_shuffle_split_multilabel_many_labels to test_split.py that fails on the old method and passes with this bug fix.

Is there anything else that would need to be done to hook up this test?

jnothman

LGTM

lesteve · 2017-10-17T07:44:20Z

LGTM, merging, thanks a lot!

…argets having > 1000 labels (scikit-learn#9922) * Use ' '.join(row) for multi-label targets in StratifiedShuffleSplit because str(row) uses an ellipsis when len(row) > 1000 * Add a new test for multilabel problems with more than a thousand labels

Use ' '.join(row) for multi-label targets in StratifiedShuffleSplit b…

1dbebc1

…ecause str(row) uses an ellipsis when len(row) > 1000

cbrummitt mentioned this pull request Oct 13, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

Merged

jnothman added this to the 0.19.1 milestone Oct 15, 2017

Add a new test for multilabel problems with more than a thousand labels

233c0c6

Change tabs to four spaces

66f8250

jnothman approved these changes Oct 16, 2017

View reviewed changes

jnothman changed the title ~~Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels~~ [MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels Oct 17, 2017

lesteve merged commit d074e40 into scikit-learn:master Oct 17, 2017

cbrummitt deleted the fix-multilabel-StratifiedShuffleSplit branch October 19, 2017 01:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels #9922

[MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels #9922

Uh oh!

cbrummitt commented Oct 13, 2017

Uh oh!

vene commented Oct 13, 2017

Uh oh!

lesteve commented Oct 16, 2017

Uh oh!

cbrummitt commented Oct 16, 2017

Uh oh!

jnothman left a comment

Uh oh!

lesteve commented Oct 17, 2017

Uh oh!

Uh oh!

Uh oh!

[MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels #9922

[MRG+1] Fix bug in StratifiedShuffleSplit for multi-label data with targets having > 1000 labels #9922

Uh oh!

Conversation

cbrummitt commented Oct 13, 2017

Uh oh!

vene commented Oct 13, 2017

Uh oh!

lesteve commented Oct 16, 2017

Uh oh!

cbrummitt commented Oct 16, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

lesteve commented Oct 17, 2017

Uh oh!

Uh oh!