[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

vene · 2017-06-07T17:09:37Z

Fixes #9037

What does this implement/fix?

StratifiedShuffleSplit gives unexpected results if given a 2d y, such as a multi-label or multi-task indicator matrix: basically the current behaviour implicitly flattens y, which makes the indices meaningless.

This fix computes the split indices by replacing each row of y with a unique hash. In the multi-label case this corresponds to considering the whole set of active labels as the label for that sample.

This is a WIP since I might have missed some use cases where this is not appropriate.

amueller · 2017-06-07T17:13:10Z

sklearn/model_selection/tests/test_split.py

@@ -667,6 +667,15 @@ def test_stratified_shuffle_split_overlap_train_test_bug():
    assert_array_equal(np.intersect1d(train, test), [])


+def test_stratified_shuffle_split_multilabel():
+    # issue 9037
+    for y in [np.array([[0, 1], [1, 0], [1, 0], [0, 1]]),


we don't support list of labels right now, right?

not afaik, why? This is supposed to be a multi-label indicator Y.

I could build it explicitly with a (Multi)LabelBinarizer if it makes the test clearer?

jnothman · 2017-06-07T22:06:48Z

#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case

vene · 2017-06-08T05:10:10Z

Indeed, however I think the breakage is orthogonal to the one I'm addressing here.
I suspect this PR should fix #8670 but not #8913. We could aim for that in this PR too, or separate them.

jnothman · 2017-06-08T05:17:05Z

I don't mind if they're fixed together or separately, but #8913 seems the more urgent!

…

On 8 June 2017 at 15:10, Vlad Niculae ***@***.***> wrote: Indeed, however I think the breakage is orthogonal to the one I'm addressing here. I suspect this PR should fix #8670 <#8670> but not #8913 <#8913>. We could aim for that in this PR too, or separate them. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9044 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60odyVgr_ANDWw6Fr0gyRmF95Up0ks5sB4I0gaJpZM4NzAxp> .

vene · 2017-06-08T11:13:59Z

I am now thinking they should be fixed separately, since this is much more trivial than #8913.

jnothman · 2017-06-08T11:28:02Z

Okay

jnothman

Is this still WIP?

jnothman · 2017-06-07T22:05:26Z

sklearn/model_selection/tests/test_split.py

+        X = np.ones_like(y)
+        sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
+        train, test = next(iter(sss.split(X=X, y=y)))
+        assert_array_equal(np.intersect1d(train, test), [])


we should also test that we produce a complete partition in test, and that it has the right distribution.

vene · 2017-06-08T12:02:23Z

sklearn/model_selection/tests/test_split.py

    assert_array_equal(np.intersect1d(train, test), [])

+    # complete partition
+    assert_array_equal(np.union1d(train, test), np.arange(len(y)))


I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)

vene · 2017-06-08T12:03:28Z

@jnothman strengthened the test and changed to mrg

jnothman · 2017-06-08T12:17:28Z

sklearn/model_selection/tests/test_split.py

+
+        # correct stratification of entire rows
+        # (by design, here y[:, 0] uniquely determines the entire row of y)
+        assert_equal(np.mean(y[train, 0]), 0.5)


I'm tired, but I'm finding these assertions hard to understand

jnothman

I'm a little concerned that the test only checks that stratification occurs over one column of y.

jnothman · 2017-06-08T12:43:25Z

sklearn/model_selection/_split.py

+
+        if y.ndim == 2:
+            # for multi-label y, map each distinct row to a unique identifier:
+            y = np.array([hash(tuple(row)) for row in y])


I had meant to point out that python's hash is far from a uniform/unique hash, but I don't think that's a big concern here

Uniformity certainly doesn't matter here, but uniqueness would.

However, these should all be tuples of integers, which I hope python handles well. Another alternative would be [str(row) for row in y] which should guarantee the desired behaviour, what do you think?

(here is the implementation of tuplehash)

ngoix · 2017-06-08T12:53:27Z

LGTM

ngoix · 2017-06-08T12:47:11Z

sklearn/model_selection/tests/test_split.py

+              np.array([[0, 1], [1, 1], [1, 1], [0, 1]])]:
+        X = np.ones_like(y)
+        sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
+        train, test = next(iter(sss.split(X=X, y=y)))


do you really need an iter here?

good catch. I think it used to be this way long ago :)

vene · 2017-06-08T13:15:15Z

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".

The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

EDIT: are you concerned about the strength of the test or about its readability for future development?

vene · 2017-06-08T14:01:47Z

sklearn/model_selection/_split.py

+
+        if y.ndim == 2:
+            # for multi-label y, map each distinct row to its string repr:
+            y = np.array([str(row) for row in y])


@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)

In [10]: Y.shape Out[10]: (1000, 10000) In [11]: Y[:3, :3] Out[11]: array([[False, True, False], [False, True, True], [False, True, True]], dtype=bool) In [12]: %timeit [str(y) for y in Y] 10 loops, best of 3: 178 ms per loop In [13]: %timeit [hash(tuple(y)) for y in Y] 1 loop, best of 3: 216 ms per loop In [14]: Y = Y.astype(np.int) In [15]: %timeit [str(y) for y in Y] 10 loops, best of 3: 180 ms per loop In [16]: %timeit [hash(tuple(y)) for y in Y] 1 loop, best of 3: 580 ms per loop

In [19]: %memit [hash(tuple(y)) for y in Y] peak memory: 141.09 MiB, increment: 0.07 MiB In [20]: %memit [str(y) for y in Y] peak memory: 141.09 MiB, increment: 0.00 MiB In [21]: Y = Y.astype(np.bool) In [22]: %memit [hash(tuple(y)) for y in Y] peak memory: 74.33 MiB, increment: 0.00 MiB In [23]: %memit [str(y) for y in Y] peak memory: 74.33 MiB, increment: 0.00 MiB

Can you try our cython murmurhash? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)

@vene Are you sure str(row) is the right thing to do here? I am finding that for a row with many elements, str(row) ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y that is very wide and has lots of zeros, so y = np.array([str(row) for row in y]) converts y to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?

@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!

Unfortunately repr has the same issue. A quick fix seems to be " ".join(x.astype('str'), which is even faster than str on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32. Would you like to submit a fix?

@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)

@vene Just submitted a PR: #9922

raghavrv · 2017-06-08T16:15:32Z

sklearn/model_selection/_split.py

+
+        if y.ndim == 2:
+            # for multi-label y, map each distinct row to a unique identifier:
+            y = np.array([hash(tuple(row)) for row in y])


This is neat ;)

raghavrv · 2017-06-08T16:32:15Z

@vene I think our sklearn.utils.murmurhash is faster...

>>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop

vene · 2017-06-08T16:45:37Z

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

It's true that it seems faster, however, I think str is more correct here, because we don't want collisions and we don't care about the distribution.

The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:


In [24]: x = [(1, 2), (1, 2), (3, 4)]

In [25]: np.unique(x)
Out[25]: array([1, 2, 3, 4])

In [26]: np.unique([str(x_) for x_ in x])
Out[26]:
array(['(1, 2)', '(3, 4)'],
      dtype='<U6')

vene · 2017-06-08T17:02:58Z

@jnothman, what is your preference between str or murmurhash3_32?

jnothman · 2017-06-08T21:51:37Z

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".
The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

I realise. I'm still not sure it's sufficient, wrt the strength of the test.

raghavrv · 2017-06-08T21:59:04Z

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

Ah that was dumb of me! Thx :)

vene · 2017-06-08T22:19:39Z

@jnothman what do you suggest? what other cases do you think I should try to cover?

jnothman · 2017-06-09T02:02:55Z

i don't think in this case that the hash uniformity is very important. It was just a thought whole reviewing

…

On 9 Jun 2017 8:19 am, "Vlad Niculae" ***@***.***> wrote: @jnothman <https://github.com/jnothman> what do you suggest? what other cases do you think I should try to cover? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9044 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61MftOhAtRLpCV6MeNPRVN9mWSZeks5sCHN9gaJpZM4NzAxp> .

vene · 2017-06-09T06:12:57Z

I think uniformity is not important, because the output of this modified y I'm building is only ever seen by np.unique, then used to compute the splits, then discarded.

However, collisions could cause subtle issues.

str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes)

jnothman · 2017-06-09T06:14:49Z

I would doubt collisions are likely with murmurhash..? But str is fine, IMO.

…

On 9 June 2017 at 16:13, Vlad Niculae ***@***.***> wrote: I think uniformity is not important, because the output of this modified y . I'm building is only ever seen by np.unique, then used to compute the splits, then discarded. However, collisions could cause subtle issues. str(y) is guaranteed to have no collisions but it's a bit slower murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9044 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz67ciOsoYqhb2RAqWuuqgfnpc3gM9ks5sCOJsgaJpZM4NzAxp> .

jnothman

LGTM

MechCoder · 2017-06-13T12:12:02Z

Pretty neat. Thank you @vene !

vene · 2017-06-13T12:39:22Z

thanks @MechCoder!

* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash

amueller · 2018-03-23T18:43:24Z

I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right?

regression test and fix for 2d stratified shuffle split

7142ab3

amueller reviewed Jun 7, 2017

View reviewed changes

jnothman reviewed Jun 8, 2017

View reviewed changes

strengthen non-overlap sss tests

83d38f3

vene commented Jun 8, 2017

View reviewed changes

vene changed the title ~~[WIP] fix StratifiedShuffleSplit with 2d y~~ [MRG] fix StratifiedShuffleSplit with 2d y Jun 8, 2017

jnothman reviewed Jun 8, 2017

View reviewed changes

clarify test and comment

a48c4a3

jnothman reviewed Jun 8, 2017

View reviewed changes

ngoix reviewed Jun 8, 2017

View reviewed changes

remove iter from tests, use str instead of hash

b6ad68a

vene commented Jun 8, 2017

View reviewed changes

raghavrv reviewed Jun 8, 2017

View reviewed changes

jnothman added this to the 0.19 milestone Jun 12, 2017

jnothman reviewed Jun 12, 2017

View reviewed changes

jnothman changed the title ~~[MRG] fix StratifiedShuffleSplit with 2d y~~ [MRG+1] fix StratifiedShuffleSplit with 2d y Jun 12, 2017

jnothman added the Waiting for Reviewer label Jun 13, 2017

MechCoder merged commit 93563b0 into scikit-learn:master Jun 13, 2017

jnothman mentioned this pull request Jun 15, 2017

train_test_split and StratifiedShuffleSplit accept categorically encoded values when they shouldn't #9128

Closed

rdvelazquez mentioned this pull request Sep 26, 2017

Revise parameter grid cognoma/machine-learning#114

Merged

jnothman mentioned this pull request Mar 21, 2018

Stratified train_test_split with binarized labels mixes up train and test data! #8670

Closed

edvardlindelof mentioned this pull request Nov 2, 2019

DOC clarify StratifiedShuffleSplit supports 2d y #15456

Merged

Uh oh!

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

Uh oh!

Conversation

vene commented Jun 7, 2017

Fixes #9037

What does this implement/fix?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 7, 2017

Uh oh!

vene commented Jun 8, 2017

Uh oh!

jnothman commented Jun 8, 2017 via email

Uh oh!

vene commented Jun 8, 2017

Uh oh!

jnothman commented Jun 8, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vene commented Jun 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoix commented Jun 8, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vene commented Jun 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vene Jun 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

raghavrv commented Jun 8, 2017

Uh oh!

vene commented Jun 8, 2017

Uh oh!

vene commented Jun 8, 2017

Uh oh!

vene commented Jun 8, 2017 •

edited

Loading

vene Jun 8, 2017 •

edited

Loading

vene commented Jun 9, 2017 •

edited

Loading

jnothman left a comment •

edited

Loading