Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

Merged
merged 4 commits into from Jun 13, 2017
Merged

Conversation

@vene
Copy link
Member

@vene vene commented Jun 7, 2017

Fixes #9037

What does this implement/fix?

StratifiedShuffleSplit gives unexpected results if given a 2d y, such as a multi-label or multi-task indicator matrix: basically the current behaviour implicitly flattens y, which makes the indices meaningless.

This fix computes the split indices by replacing each row of y with a unique hash. In the multi-label case this corresponds to considering the whole set of active labels as the label for that sample.

This is a WIP since I might have missed some use cases where this is not appropriate.

@@ -667,6 +667,15 @@ def test_stratified_shuffle_split_overlap_train_test_bug():
assert_array_equal(np.intersect1d(train, test), [])


def test_stratified_shuffle_split_multilabel():
# issue 9037
for y in [np.array([[0, 1], [1, 0], [1, 0], [0, 1]]),

This comment has been minimized.

@amueller

amueller Jun 7, 2017
Member

we don't support list of labels right now, right?

This comment has been minimized.

@vene

vene Jun 7, 2017
Author Member

not afaik, why? This is supposed to be a multi-label indicator Y.

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

I could build it explicitly with a (Multi)LabelBinarizer if it makes the test clearer?

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 7, 2017

#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case

@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

Indeed, however I think the breakage is orthogonal to the one I'm addressing here.
I suspect this PR should fix #8670 but not #8913. We could aim for that in this PR too, or separate them.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 8, 2017

@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

I am now thinking they should be fixed separately, since this is much more trivial than #8913.

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 8, 2017

Okay

Copy link
Member

@jnothman jnothman left a comment

Is this still WIP?

X = np.ones_like(y)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
train, test = next(iter(sss.split(X=X, y=y)))
assert_array_equal(np.intersect1d(train, test), [])

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017
Member

we should also test that we produce a complete partition in test, and that it has the right distribution.

assert_array_equal(np.intersect1d(train, test), [])

# complete partition
assert_array_equal(np.union1d(train, test), np.arange(len(y)))

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)

@vene vene changed the title [WIP] fix StratifiedShuffleSplit with 2d y [MRG] fix StratifiedShuffleSplit with 2d y Jun 8, 2017
@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

@jnothman strengthened the test and changed to mrg


# correct stratification of entire rows
# (by design, here y[:, 0] uniquely determines the entire row of y)
assert_equal(np.mean(y[train, 0]), 0.5)

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017
Member

I'm tired, but I'm finding these assertions hard to understand

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

good point

Copy link
Member

@jnothman jnothman left a comment

I'm a little concerned that the test only checks that stratification occurs over one column of y.


if y.ndim == 2:
# for multi-label y, map each distinct row to a unique identifier:
y = np.array([hash(tuple(row)) for row in y])

This comment has been minimized.

@jnothman

jnothman Jun 8, 2017
Member

I had meant to point out that python's hash is far from a uniform/unique hash, but I don't think that's a big concern here

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

Uniformity certainly doesn't matter here, but uniqueness would.

However, these should all be tuples of integers, which I hope python handles well. Another alternative would be [str(row) for row in y] which should guarantee the desired behaviour, what do you think?

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

(here is the implementation of tuplehash)

@ngoix
Copy link
Contributor

@ngoix ngoix commented Jun 8, 2017

LGTM

np.array([[0, 1], [1, 1], [1, 1], [0, 1]])]:
X = np.ones_like(y)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
train, test = next(iter(sss.split(X=X, y=y)))

This comment has been minimized.

@ngoix

ngoix Jun 8, 2017
Contributor

do you really need an iter here?

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

good catch. I think it used to be this way long ago :)

@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".

The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

EDIT: are you concerned about the strength of the test or about its readability for future development?


if y.ndim == 2:
# for multi-label y, map each distinct row to its string repr:
y = np.array([str(row) for row in y])

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)


In [10]: Y.shape
Out[10]: (1000, 10000)

In [11]: Y[:3, :3]
Out[11]:
array([[False,  True, False],
       [False,  True,  True],
       [False,  True,  True]], dtype=bool)

In [12]: %timeit [str(y) for y in Y]
10 loops, best of 3: 178 ms per loop

In [13]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 216 ms per loop

In [14]: Y = Y.astype(np.int)

In [15]: %timeit [str(y) for y in Y]
10 loops, best of 3: 180 ms per loop

In [16]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 580 ms per loop

This comment has been minimized.

@vene

vene Jun 8, 2017
Author Member

In [19]: %memit [hash(tuple(y)) for y in Y]
peak memory: 141.09 MiB, increment: 0.07 MiB

In [20]: %memit [str(y) for y in Y]
peak memory: 141.09 MiB, increment: 0.00 MiB

In [21]: Y = Y.astype(np.bool)

In [22]: %memit [hash(tuple(y)) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

In [23]: %memit [str(y) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

This comment has been minimized.

@raghavrv

raghavrv Jun 8, 2017
Member

Can you try our cython murmurhash? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)

This comment has been minimized.

@cbrummitt

cbrummitt Sep 27, 2017
Contributor

@vene Are you sure str(row) is the right thing to do here? I am finding that for a row with many elements, str(row) ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y that is very wide and has lots of zeros, so y = np.array([str(row) for row in y]) converts y to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?

This comment has been minimized.

@vene

vene Sep 27, 2017
Author Member

@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!

Unfortunately repr has the same issue. A quick fix seems to be " ".join(x.astype('str'), which is even faster than str on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32. Would you like to submit a fix?

This comment has been minimized.

@cbrummitt

cbrummitt Sep 29, 2017
Contributor

@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)

This comment has been minimized.

@cbrummitt

cbrummitt Oct 13, 2017
Contributor

@vene Just submitted a PR: #9922


if y.ndim == 2:
# for multi-label y, map each distinct row to a unique identifier:
y = np.array([hash(tuple(row)) for row in y])

This comment has been minimized.

@raghavrv

raghavrv Jun 8, 2017
Member

This is neat ;)

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Jun 8, 2017

@vene I think our sklearn.utils.murmurhash is faster...

>>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop
@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

It's true that it seems faster, however, I think str is more correct here, because we don't want collisions and we don't care about the distribution.

The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:


In [24]: x = [(1, 2), (1, 2), (3, 4)]

In [25]: np.unique(x)
Out[25]: array([1, 2, 3, 4])

In [26]: np.unique([str(x_) for x_ in x])
Out[26]:
array(['(1, 2)', '(3, 4)'],
      dtype='<U6')
@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

@jnothman, what is your preference between str or murmurhash3_32?

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".
The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

I realise. I'm still not sure it's sufficient, wrt the strength of the test.

@raghavrv
Copy link
Member

@raghavrv raghavrv commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

Ah that was dumb of me! Thx :)

@vene
Copy link
Member Author

@vene vene commented Jun 8, 2017

@jnothman what do you suggest? what other cases do you think I should try to cover?

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 9, 2017

@vene
Copy link
Member Author

@vene vene commented Jun 9, 2017

I think uniformity is not important, because the output of this modified y I'm building is only ever seen by np.unique, then used to compute the splits, then discarded.

However, collisions could cause subtle issues.

str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes)

@jnothman
Copy link
Member

@jnothman jnothman commented Jun 9, 2017

@jnothman jnothman added this to the 0.19 milestone Jun 12, 2017
Copy link
Member

@jnothman jnothman left a comment

LGTM

@jnothman jnothman changed the title [MRG] fix StratifiedShuffleSplit with 2d y [MRG+1] fix StratifiedShuffleSplit with 2d y Jun 12, 2017
@MechCoder MechCoder merged commit 93563b0 into scikit-learn:master Jun 13, 2017
2 of 3 checks passed
2 of 3 checks passed
continuous-integration/appveyor/pr AppVeyor build failed
Details
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@MechCoder
Copy link
Member

@MechCoder MechCoder commented Jun 13, 2017

Pretty neat. Thank you @vene !

@vene
Copy link
Member Author

@vene vene commented Jun 13, 2017

thanks @MechCoder!

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
dmohns added a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
NelleV added a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
AishwaryaRK added a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
@amueller
Copy link
Member

@amueller amueller commented Mar 23, 2018

I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

7 participants