New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

Merged
merged 4 commits into from Jun 13, 2017

Conversation

Projects
None yet
7 participants
@vene
Member

vene commented Jun 7, 2017

Fixes #9037

What does this implement/fix?

StratifiedShuffleSplit gives unexpected results if given a 2d y, such as a multi-label or multi-task indicator matrix: basically the current behaviour implicitly flattens y, which makes the indices meaningless.

This fix computes the split indices by replacing each row of y with a unique hash. In the multi-label case this corresponds to considering the whole set of active labels as the label for that sample.

This is a WIP since I might have missed some use cases where this is not appropriate.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 7, 2017

Member

#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case

Member

jnothman commented Jun 7, 2017

#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

Indeed, however I think the breakage is orthogonal to the one I'm addressing here.
I suspect this PR should fix #8670 but not #8913. We could aim for that in this PR too, or separate them.

Member

vene commented Jun 8, 2017

Indeed, however I think the breakage is orthogonal to the one I'm addressing here.
I suspect this PR should fix #8670 but not #8913. We could aim for that in this PR too, or separate them.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member
Member

jnothman commented Jun 8, 2017

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

I am now thinking they should be fixed separately, since this is much more trivial than #8913.

Member

vene commented Jun 8, 2017

I am now thinking they should be fixed separately, since this is much more trivial than #8913.

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member

Okay

Member

jnothman commented Jun 8, 2017

Okay

@jnothman

Is this still WIP?

assert_array_equal(np.intersect1d(train, test), [])
# complete partition
assert_array_equal(np.union1d(train, test), np.arange(len(y)))

This comment has been minimized.

@vene

vene Jun 8, 2017

Member

I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)

@vene

vene Jun 8, 2017

Member

I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)

@vene vene changed the title from [WIP] fix StratifiedShuffleSplit with 2d y to [MRG] fix StratifiedShuffleSplit with 2d y Jun 8, 2017

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

@jnothman strengthened the test and changed to mrg

Member

vene commented Jun 8, 2017

@jnothman strengthened the test and changed to mrg

@jnothman

I'm a little concerned that the test only checks that stratification occurs over one column of y.

Show outdated Hide outdated sklearn/model_selection/_split.py
@ngoix

This comment has been minimized.

Show comment
Hide comment
@ngoix

ngoix Jun 8, 2017

Contributor

LGTM

Contributor

ngoix commented Jun 8, 2017

LGTM

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".

The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

EDIT: are you concerned about the strength of the test or about its readability for future development?

Member

vene commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".

The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

EDIT: are you concerned about the strength of the test or about its readability for future development?

if y.ndim == 2:
# for multi-label y, map each distinct row to its string repr:
y = np.array([str(row) for row in y])

This comment has been minimized.

@vene

vene Jun 8, 2017

Member

@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)


In [10]: Y.shape
Out[10]: (1000, 10000)

In [11]: Y[:3, :3]
Out[11]:
array([[False,  True, False],
       [False,  True,  True],
       [False,  True,  True]], dtype=bool)

In [12]: %timeit [str(y) for y in Y]
10 loops, best of 3: 178 ms per loop

In [13]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 216 ms per loop

In [14]: Y = Y.astype(np.int)

In [15]: %timeit [str(y) for y in Y]
10 loops, best of 3: 180 ms per loop

In [16]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 580 ms per loop
@vene

vene Jun 8, 2017

Member

@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)


In [10]: Y.shape
Out[10]: (1000, 10000)

In [11]: Y[:3, :3]
Out[11]:
array([[False,  True, False],
       [False,  True,  True],
       [False,  True,  True]], dtype=bool)

In [12]: %timeit [str(y) for y in Y]
10 loops, best of 3: 178 ms per loop

In [13]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 216 ms per loop

In [14]: Y = Y.astype(np.int)

In [15]: %timeit [str(y) for y in Y]
10 loops, best of 3: 180 ms per loop

In [16]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 580 ms per loop

This comment has been minimized.

@vene

vene Jun 8, 2017

Member
In [19]: %memit [hash(tuple(y)) for y in Y]
peak memory: 141.09 MiB, increment: 0.07 MiB

In [20]: %memit [str(y) for y in Y]
peak memory: 141.09 MiB, increment: 0.00 MiB

In [21]: Y = Y.astype(np.bool)

In [22]: %memit [hash(tuple(y)) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

In [23]: %memit [str(y) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB
@vene

vene Jun 8, 2017

Member
In [19]: %memit [hash(tuple(y)) for y in Y]
peak memory: 141.09 MiB, increment: 0.07 MiB

In [20]: %memit [str(y) for y in Y]
peak memory: 141.09 MiB, increment: 0.00 MiB

In [21]: Y = Y.astype(np.bool)

In [22]: %memit [hash(tuple(y)) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

In [23]: %memit [str(y) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

This comment has been minimized.

@raghavrv

raghavrv Jun 8, 2017

Member

Can you try our cython murmurhash? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)

@raghavrv

raghavrv Jun 8, 2017

Member

Can you try our cython murmurhash? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)

This comment has been minimized.

@cbrummitt

cbrummitt Sep 27, 2017

Contributor

@vene Are you sure str(row) is the right thing to do here? I am finding that for a row with many elements, str(row) ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y that is very wide and has lots of zeros, so y = np.array([str(row) for row in y]) converts y to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?

@cbrummitt

cbrummitt Sep 27, 2017

Contributor

@vene Are you sure str(row) is the right thing to do here? I am finding that for a row with many elements, str(row) ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y that is very wide and has lots of zeros, so y = np.array([str(row) for row in y]) converts y to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?

This comment has been minimized.

@vene

vene Sep 27, 2017

Member

@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!

Unfortunately repr has the same issue. A quick fix seems to be " ".join(x.astype('str'), which is even faster than str on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32. Would you like to submit a fix?

@vene

vene Sep 27, 2017

Member

@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!

Unfortunately repr has the same issue. A quick fix seems to be " ".join(x.astype('str'), which is even faster than str on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32. Would you like to submit a fix?

This comment has been minimized.

@cbrummitt

cbrummitt Sep 29, 2017

Contributor

@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)

@cbrummitt

cbrummitt Sep 29, 2017

Contributor

@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)

This comment has been minimized.

@cbrummitt

cbrummitt Oct 13, 2017

Contributor

@vene Just submitted a PR: #9922

@cbrummitt

cbrummitt Oct 13, 2017

Contributor

@vene Just submitted a PR: #9922

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 8, 2017

Member

@vene I think our sklearn.utils.murmurhash is faster...

>>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop
Member

raghavrv commented Jun 8, 2017

@vene I think our sklearn.utils.murmurhash is faster...

>>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop
@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

It's true that it seems faster, however, I think str is more correct here, because we don't want collisions and we don't care about the distribution.

The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:


In [24]: x = [(1, 2), (1, 2), (3, 4)]

In [25]: np.unique(x)
Out[25]: array([1, 2, 3, 4])

In [26]: np.unique([str(x_) for x_ in x])
Out[26]:
array(['(1, 2)', '(3, 4)'],
      dtype='<U6')
Member

vene commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

It's true that it seems faster, however, I think str is more correct here, because we don't want collisions and we don't care about the distribution.

The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:


In [24]: x = [(1, 2), (1, 2), (3, 4)]

In [25]: np.unique(x)
Out[25]: array([1, 2, 3, 4])

In [26]: np.unique([str(x_) for x_ in x])
Out[26]:
array(['(1, 2)', '(3, 4)'],
      dtype='<U6')
@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

@jnothman, what is your preference between str or murmurhash3_32?

Member

vene commented Jun 8, 2017

@jnothman, what is your preference between str or murmurhash3_32?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 8, 2017

Member

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".
The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

I realise. I'm still not sure it's sufficient, wrt the strength of the test.

Member

jnothman commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".
The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

I realise. I'm still not sure it's sufficient, wrt the strength of the test.

@raghavrv

This comment has been minimized.

Show comment
Hide comment
@raghavrv

raghavrv Jun 8, 2017

Member

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

Ah that was dumb of me! Thx :)

Member

raghavrv commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

Ah that was dumb of me! Thx :)

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 8, 2017

Member

@jnothman what do you suggest? what other cases do you think I should try to cover?

Member

vene commented Jun 8, 2017

@jnothman what do you suggest? what other cases do you think I should try to cover?

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 9, 2017

Member
Member

jnothman commented Jun 9, 2017

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 9, 2017

Member

I think uniformity is not important, because the output of this modified y I'm building is only ever seen by np.unique, then used to compute the splits, then discarded.

However, collisions could cause subtle issues.

str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes)

Member

vene commented Jun 9, 2017

I think uniformity is not important, because the output of this modified y I'm building is only ever seen by np.unique, then used to compute the splits, then discarded.

However, collisions could cause subtle issues.

str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes)

@jnothman

This comment has been minimized.

Show comment
Hide comment
@jnothman

jnothman Jun 9, 2017

Member
Member

jnothman commented Jun 9, 2017

@jnothman jnothman added this to the 0.19 milestone Jun 12, 2017

@jnothman

LGTM

@jnothman jnothman changed the title from [MRG] fix StratifiedShuffleSplit with 2d y to [MRG+1] fix StratifiedShuffleSplit with 2d y Jun 12, 2017

@MechCoder MechCoder merged commit 93563b0 into scikit-learn:master Jun 13, 2017

2 of 3 checks passed

continuous-integration/appveyor/pr AppVeyor build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MechCoder

This comment has been minimized.

Show comment
Hide comment
@MechCoder

MechCoder Jun 13, 2017

Member

Pretty neat. Thank you @vene !

Member

MechCoder commented Jun 13, 2017

Pretty neat. Thank you @vene !

@vene

This comment has been minimized.

Show comment
Hide comment
@vene

vene Jun 13, 2017

Member

thanks @MechCoder!

Member

vene commented Jun 13, 2017

thanks @MechCoder!

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

@rdvelazquez rdvelazquez referenced this pull request Sep 26, 2017

Merged

Revise parameter grid #114

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

[MRG+1] fix StratifiedShuffleSplit with 2d y (scikit-learn#9044)
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
@amueller

This comment has been minimized.

Show comment
Hide comment
@amueller

amueller Mar 23, 2018

Member

I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right?

Member

amueller commented Mar 23, 2018

I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment