Skip to content

[MRG+1] fix StratifiedShuffleSplit with 2d y #9044

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 13, 2017

Conversation

vene
Copy link
Member

@vene vene commented Jun 7, 2017

Fixes #9037

What does this implement/fix?

StratifiedShuffleSplit gives unexpected results if given a 2d y, such as a multi-label or multi-task indicator matrix: basically the current behaviour implicitly flattens y, which makes the indices meaningless.

This fix computes the split indices by replacing each row of y with a unique hash. In the multi-label case this corresponds to considering the whole set of active labels as the label for that sample.

This is a WIP since I might have missed some use cases where this is not appropriate.

@@ -667,6 +667,15 @@ def test_stratified_shuffle_split_overlap_train_test_bug():
assert_array_equal(np.intersect1d(train, test), [])


def test_stratified_shuffle_split_multilabel():
# issue 9037
for y in [np.array([[0, 1], [1, 0], [1, 0], [0, 1]]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support list of labels right now, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not afaik, why? This is supposed to be a multi-label indicator Y.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could build it explicitly with a (Multi)LabelBinarizer if it makes the test clearer?

@jnothman
Copy link
Member

jnothman commented Jun 7, 2017

#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case

@vene
Copy link
Member Author

vene commented Jun 8, 2017

Indeed, however I think the breakage is orthogonal to the one I'm addressing here.
I suspect this PR should fix #8670 but not #8913. We could aim for that in this PR too, or separate them.

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017 via email

@vene
Copy link
Member Author

vene commented Jun 8, 2017

I am now thinking they should be fixed separately, since this is much more trivial than #8913.

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017

Okay

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still WIP?

X = np.ones_like(y)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
train, test = next(iter(sss.split(X=X, y=y)))
assert_array_equal(np.intersect1d(train, test), [])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also test that we produce a complete partition in test, and that it has the right distribution.

assert_array_equal(np.intersect1d(train, test), [])

# complete partition
assert_array_equal(np.union1d(train, test), np.arange(len(y)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)

@vene vene changed the title [WIP] fix StratifiedShuffleSplit with 2d y [MRG] fix StratifiedShuffleSplit with 2d y Jun 8, 2017
@vene
Copy link
Member Author

vene commented Jun 8, 2017

@jnothman strengthened the test and changed to mrg


# correct stratification of entire rows
# (by design, here y[:, 0] uniquely determines the entire row of y)
assert_equal(np.mean(y[train, 0]), 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tired, but I'm finding these assertions hard to understand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned that the test only checks that stratification occurs over one column of y.


if y.ndim == 2:
# for multi-label y, map each distinct row to a unique identifier:
y = np.array([hash(tuple(row)) for row in y])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had meant to point out that python's hash is far from a uniform/unique hash, but I don't think that's a big concern here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uniformity certainly doesn't matter here, but uniqueness would.

However, these should all be tuples of integers, which I hope python handles well. Another alternative would be [str(row) for row in y] which should guarantee the desired behaviour, what do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(here is the implementation of tuplehash)

@ngoix
Copy link
Contributor

ngoix commented Jun 8, 2017

LGTM

np.array([[0, 1], [1, 1], [1, 1], [0, 1]])]:
X = np.ones_like(y)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0)
train, test = next(iter(sss.split(X=X, y=y)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need an iter here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I think it used to be this way long ago :)

@vene
Copy link
Member Author

vene commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".

The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

EDIT: are you concerned about the strength of the test or about its readability for future development?


if y.ndim == 2:
# for multi-label y, map each distinct row to its string repr:
y = np.array([str(row) for row in y])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)


In [10]: Y.shape
Out[10]: (1000, 10000)

In [11]: Y[:3, :3]
Out[11]:
array([[False,  True, False],
       [False,  True,  True],
       [False,  True,  True]], dtype=bool)

In [12]: %timeit [str(y) for y in Y]
10 loops, best of 3: 178 ms per loop

In [13]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 216 ms per loop

In [14]: Y = Y.astype(np.int)

In [15]: %timeit [str(y) for y in Y]
10 loops, best of 3: 180 ms per loop

In [16]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 580 ms per loop

Copy link
Member Author

@vene vene Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [19]: %memit [hash(tuple(y)) for y in Y]
peak memory: 141.09 MiB, increment: 0.07 MiB

In [20]: %memit [str(y) for y in Y]
peak memory: 141.09 MiB, increment: 0.00 MiB

In [21]: Y = Y.astype(np.bool)

In [22]: %memit [hash(tuple(y)) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

In [23]: %memit [str(y) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try our cython murmurhash? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene Are you sure str(row) is the right thing to do here? I am finding that for a row with many elements, str(row) ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y that is very wide and has lots of zeros, so y = np.array([str(row) for row in y]) converts y to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!

Unfortunately repr has the same issue. A quick fix seems to be " ".join(x.astype('str'), which is even faster than str on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32. Would you like to submit a fix?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vene Just submitted a PR: #9922


if y.ndim == 2:
# for multi-label y, map each distinct row to a unique identifier:
y = np.array([hash(tuple(row)) for row in y])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat ;)

@raghavrv
Copy link
Member

raghavrv commented Jun 8, 2017

@vene I think our sklearn.utils.murmurhash is faster...

>>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop

@vene
Copy link
Member Author

vene commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

It's true that it seems faster, however, I think str is more correct here, because we don't want collisions and we don't care about the distribution.

The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:


In [24]: x = [(1, 2), (1, 2), (3, 4)]

In [25]: np.unique(x)
Out[25]: array([1, 2, 3, 4])

In [26]: np.unique([str(x_) for x_ in x])
Out[26]:
array(['(1, 2)', '(3, 4)'],
      dtype='<U6')

@vene
Copy link
Member Author

vene commented Jun 8, 2017

@jnothman, what is your preference between str or murmurhash3_32?

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017

I'm a little concerned that the test only checks that stratification occurs over one column of y.

See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)".
The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b

I realise. I'm still not sure it's sufficient, wrt the strength of the test.

@raghavrv
Copy link
Member

raghavrv commented Jun 8, 2017

@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms.

Ah that was dumb of me! Thx :)

@vene
Copy link
Member Author

vene commented Jun 8, 2017

@jnothman what do you suggest? what other cases do you think I should try to cover?

@jnothman
Copy link
Member

jnothman commented Jun 9, 2017 via email

@vene
Copy link
Member Author

vene commented Jun 9, 2017

I think uniformity is not important, because the output of this modified y I'm building is only ever seen by np.unique, then used to compute the splits, then discarded.

However, collisions could cause subtle issues.

str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have collisions in this setting (my guess would be yes)

@jnothman
Copy link
Member

jnothman commented Jun 9, 2017 via email

@jnothman jnothman added this to the 0.19 milestone Jun 12, 2017
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jnothman jnothman changed the title [MRG] fix StratifiedShuffleSplit with 2d y [MRG+1] fix StratifiedShuffleSplit with 2d y Jun 12, 2017
@MechCoder MechCoder merged commit 93563b0 into scikit-learn:master Jun 13, 2017
@MechCoder
Copy link
Member

Pretty neat. Thank you @vene !

@vene
Copy link
Member Author

vene commented Jun 13, 2017

thanks @MechCoder!

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
dmohns pushed a commit to dmohns/scikit-learn that referenced this pull request Aug 7, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
AishwaryaRK pushed a commit to AishwaryaRK/scikit-learn that referenced this pull request Aug 29, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
* regression test and fix for 2d stratified shuffle split

* strengthen non-overlap sss tests

* clarify test and comment

* remove iter from tests, use str instead of hash
@amueller
Copy link
Member

I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

StratifiedShuffleSplit generates overlapping train and test indices for multilabel data
7 participants