-
-
Notifications
You must be signed in to change notification settings - Fork 26.1k
[MRG+1] fix StratifiedShuffleSplit with 2d y #9044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -667,6 +667,15 @@ def test_stratified_shuffle_split_overlap_train_test_bug(): | |||
assert_array_equal(np.intersect1d(train, test), []) | |||
|
|||
|
|||
def test_stratified_shuffle_split_multilabel(): | |||
# issue 9037 | |||
for y in [np.array([[0, 1], [1, 0], [1, 0], [0, 1]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't support list of labels right now, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not afaik, why? This is supposed to be a multi-label indicator Y.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could build it explicitly with a (Multi)LabelBinarizer if it makes the test clearer?
#8913 shows that StratifiedShuffleSplit is broken even in the simple binary y case |
I don't mind if they're fixed together or separately, but #8913 seems the
more urgent!
…On 8 June 2017 at 15:10, Vlad Niculae ***@***.***> wrote:
Indeed, however I think the breakage is orthogonal to the one I'm
addressing here.
I suspect this PR should fix #8670
<#8670> but not #8913
<#8913>. We could aim
for that in this PR too, or separate them.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9044 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60odyVgr_ANDWw6Fr0gyRmF95Up0ks5sB4I0gaJpZM4NzAxp>
.
|
I am now thinking they should be fixed separately, since this is much more trivial than #8913. |
Okay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still WIP?
X = np.ones_like(y) | ||
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0) | ||
train, test = next(iter(sss.split(X=X, y=y))) | ||
assert_array_equal(np.intersect1d(train, test), []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should also test that we produce a complete partition in test, and that it has the right distribution.
assert_array_equal(np.intersect1d(train, test), []) | ||
|
||
# complete partition | ||
assert_array_equal(np.union1d(train, test), np.arange(len(y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strengthened this unrelated test as well. (Here because of the high number of classes it's not trivial to check test proportions)
@jnothman strengthened the test and changed to mrg |
|
||
# correct stratification of entire rows | ||
# (by design, here y[:, 0] uniquely determines the entire row of y) | ||
assert_equal(np.mean(y[train, 0]), 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tired, but I'm finding these assertions hard to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little concerned that the test only checks that stratification occurs over one column of y.
sklearn/model_selection/_split.py
Outdated
|
||
if y.ndim == 2: | ||
# for multi-label y, map each distinct row to a unique identifier: | ||
y = np.array([hash(tuple(row)) for row in y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had meant to point out that python's hash is far from a uniform/unique hash, but I don't think that's a big concern here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uniformity certainly doesn't matter here, but uniqueness would.
However, these should all be tuples of integers, which I hope python handles well. Another alternative would be [str(row) for row in y]
which should guarantee the desired behaviour, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here is the implementation of tuplehash)
LGTM |
np.array([[0, 1], [1, 1], [1, 1], [0, 1]])]: | ||
X = np.ones_like(y) | ||
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=0) | ||
train, test = next(iter(sss.split(X=X, y=y))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you really need an iter
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch. I think it used to be this way long ago :)
See the comment "(by design, here y[:, 0] uniquely determines the entire row of y)". The two ys tested here satisfy the property that, for any two rows a, b, a[0] == b[0] iff a == b EDIT: are you concerned about the strength of the test or about its readability for future development? |
|
||
if y.ndim == 2: | ||
# for multi-label y, map each distinct row to its string repr: | ||
y = np.array([str(row) for row in y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman str is guaranteed to do the right thing. It even seems faster on large y, but it seems like it would take up more memory (i need to install memit...)
In [10]: Y.shape
Out[10]: (1000, 10000)
In [11]: Y[:3, :3]
Out[11]:
array([[False, True, False],
[False, True, True],
[False, True, True]], dtype=bool)
In [12]: %timeit [str(y) for y in Y]
10 loops, best of 3: 178 ms per loop
In [13]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 216 ms per loop
In [14]: Y = Y.astype(np.int)
In [15]: %timeit [str(y) for y in Y]
10 loops, best of 3: 180 ms per loop
In [16]: %timeit [hash(tuple(y)) for y in Y]
1 loop, best of 3: 580 ms per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [19]: %memit [hash(tuple(y)) for y in Y]
peak memory: 141.09 MiB, increment: 0.07 MiB
In [20]: %memit [str(y) for y in Y]
peak memory: 141.09 MiB, increment: 0.00 MiB
In [21]: Y = Y.astype(np.bool)
In [22]: %memit [hash(tuple(y)) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB
In [23]: %memit [str(y) for y in Y]
peak memory: 74.33 MiB, increment: 0.00 MiB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try our cython murmurhash
? Maybe it will be faster? + it's optimized for numpy arrays (if you would consider each row as one?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vene Are you sure str(row)
is the right thing to do here? I am finding that for a row with many elements, str(row)
ends up being a string representing just the first 3 and last 3 entries:
'[0 0 0 ..., 0 0 0]'
I have a sparse target matrix y
that is very wide and has lots of zeros, so y = np.array([str(row) for row in y])
converts y
to a very small number of strings that look like the one above. I think this behavior is not what was intended, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbrummitt you're correct, this seems to trigger for vectors with over 1000 elements... Thanks for the catch!
Unfortunately repr
has the same issue. A quick fix seems to be " ".join(x.astype('str')
, which is even faster than str
on arrays smaller than 1000 elements. Another one would be to use murmurhash3_32
. Would you like to submit a fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vene I can try submitting a fix. It may take me several days to get around to it, so feel free to beat me to it if you want :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sklearn/model_selection/_split.py
Outdated
|
||
if y.ndim == 2: | ||
# for multi-label y, map each distinct row to a unique identifier: | ||
y = np.array([hash(tuple(row)) for row in y]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat ;)
@vene I think our >>> y = np.random.randint(0, 10, (2000, 1000))
>>> %timeit murmurhash3_32(y.astype(np.int32, copy=False), 32)
100 loops, best of 3: 15.4 ms per loop |
@raghavrv we have different machines; the difference is not that overwhelming, on my pc it's 108ms. It's true that it seems faster, however, I think The only point of this is to prevent np.unique to automatically convert it into a 2d array, like so:
|
@jnothman, what is your preference between |
I realise. I'm still not sure it's sufficient, wrt the strength of the test. |
Ah that was dumb of me! Thx :) |
@jnothman what do you suggest? what other cases do you think I should try to cover? |
i don't think in this case that the hash uniformity is very important. It
was just a thought whole reviewing
…On 9 Jun 2017 8:19 am, "Vlad Niculae" ***@***.***> wrote:
@jnothman <https://github.com/jnothman> what do you suggest? what other
cases do you think I should try to cover?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9044 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61MftOhAtRLpCV6MeNPRVN9mWSZeks5sCHN9gaJpZM4NzAxp>
.
|
I think uniformity is not important, because the output of this modified However, collisions could cause subtle issues. str(y) is guaranteed to have no collisions but it's a bit slower |
I would doubt collisions are likely with murmurhash..? But str is fine, IMO.
…On 9 June 2017 at 16:13, Vlad Niculae ***@***.***> wrote:
I think uniformity is not important, because the output of this modified y
. I'm building is only ever seen by np.unique, then used to compute the
splits, then discarded.
However, collisions could cause subtle issues.
str(y) is guaranteed to have no collisions but it's a bit slower
murmurhash3_32(y) is fast but it's not clear to me if it can have
collisions in this setting (my guess would be yes)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9044 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz67ciOsoYqhb2RAqWuuqgfnpc3gM9ks5sCOJsgaJpZM4NzAxp>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Pretty neat. Thank you @vene ! |
thanks @MechCoder! |
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
* regression test and fix for 2d stratified shuffle split * strengthen non-overlap sss tests * clarify test and comment * remove iter from tests, use str instead of hash
I'm confused by this. Why is this the right thing to do? Shouldn't we just error? It's pretty unclear what stratification means in a multi-label setup, right? |
Fixes #9037
What does this implement/fix?
StratifiedShuffleSplit gives unexpected results if given a 2d y, such as a multi-label or multi-task indicator matrix: basically the current behaviour implicitly flattens y, which makes the indices meaningless.
This fix computes the split indices by replacing each row of y with a unique hash. In the multi-label case this corresponds to considering the whole set of active labels as the label for that sample.
This is a WIP since I might have missed some use cases where this is not appropriate.