WIP Stratified Shuffle Split #1060

schwarty · 2012-08-24T12:13:07Z

It definitely requires further testing, I'm just interested in knowing if you can find cases where it doesn't behave properly so that it can be fixed. And I would like to be positively sure it works properly before discussing design considerations.

agramfort · 2012-08-24T14:44:18Z

can you comment with the previous use case that was failing and that now works?

cc/ @npinto

schwarty · 2012-08-24T15:03:59Z

Example

y = np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3])
sss = cv.StratifiedShuffleSplit(y, n_iterations=3, train_size=6, test_size=.2, random_state=0)

for train, test in sss:
print 'indices', train, test
print 'values', y[train], y[test]

Output

indices [2, 1, 6, 4, 7, 9] [0, 5, 8]
values [1 1 2 2 3 3] [1 2 3]
indices [2, 0, 6, 5, 8, 7] [1, 4, 9]
values [1 1 2 2 3 3] [1 2 3]
indices [1, 2, 4, 5, 7, 8] [0, 6, 9]
values [1 1 2 2 3 3] [1 2 3]

So it appears to work now, where it previously failed badly. If you tweak the test_size parameter you might end up with test sets too small to contain all the classes (e.g. try test_size=.1). The cv scheme itself is not to modify here I think, but rather the validation function that has to check extra things (typically that the test and train sets are not smaller than the number of classes)

fabianp · 2012-08-24T15:08:17Z

sklearn/cross_validation.py

Not important since you don't use it in the outer loop, but you are redefining the i variable in the nested loop

Apparently, you didn't address @fabianp's comment yet.

@mblondel: I did, the i variable is no longer defined in the outer loop.

GaelVaroquaux · 2012-08-24T16:25:25Z

y = np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3])
sss = cv.StratifiedShuffleSplit(y, n_iterations=3, train_size=6,
test_size=.2, random_state=0)

for train, test in sss:

print 'indices', train, test
print 'values', y[train], y[test]

output

indices [2, 1, 6, 4, 7, 9] [0, 5, 8]
values [1 1 2 2 3 3] [1 2 3]
indices [2, 0, 6, 5, 8, 7] [1, 4, 9]
values [1 1 2 2 3 3] [1 2 3]
indices [1, 2, 4, 5, 7, 8] [0, 6, 9]
values [1 1 2 2 3 3] [1 2 3]

Could you make a unit test out of this? Basically, I think that a good
check would be to check that each fold has instances of 1, 2 and 3.

Something that I don't really like, is that the class are sorted in the
folds. This can be detrimental to some algorithms. I'd rather avoid it.

kyleabeauchamp · 2012-08-24T17:42:59Z

I played with the class for a bit and it seems to do what I want.

Regarding the sorted classes--can't we just throw a permutation on the training and test indices?

amueller · 2012-08-24T18:52:47Z

@kyleabeauchamp Thanks for tackling this. Yes, doing a permutation would be ok. It might be possible to avoid it but I'd rather go for an easy to understand solution, even if we go over the indices once more.

schwarty · 2012-08-25T07:13:46Z

Thanks for the comments, if you all agree, before I make the implementation more efficient and cleaner, I'd rather add the tests and make it rock solid. The other stuff are non/less essential and would come later.

schwarty · 2012-08-25T08:52:51Z

I added tests, and permuted the test and train sets as asked. It should be pretty good now. One comment I have is that the implementation itself is not very efficient and starts to be a bit slow when you get around 1M samples.

ogrisel · 2012-08-25T10:07:16Z

It would be great if someone could volunteer to add benchmarks for CV iterators and other utility functions (such as the score functions in the metrics module) to the benchmark suite from scikit-learn-speed.

Current benchmark source code lives in this folder:

https://github.com/scikit-learn/scikit-learn-speed/blob/master/benchmarks/

and use these templates:

https://github.com/scikit-learn/scikit-learn-speed/blob/master/benchmarks/templates.py

Currently all of those benchmarks use the same template which is focused to bench classes that implement the fit / predict API but nothing prevent us to add other utility functions or classes to the benchmark suite.

Maybe @vene you could add a new benchmark for a non-fit-predict object to have an first example for new benchmark contributors.

mblondel · 2012-08-25T10:27:17Z

sklearn/tests/test_cross_validation.py

Could you add assert_equal or assert_almost_equal assertions too? (to check that the proportions of each class are roughly respected)

Also, I would argue that we don't care if it does a better job than ShuffleSplit. We just care that it does the job as expected. So, I would just remove the above inequality assertions (comparison with ShuffleSplit).

Perhaps we want something like this (incomplete snippet), where we check
that the observed training counts are within +-1 of their desired
values. The +-1 results because desired_train_counts might be a float,
so we can only assume equality to within rounding.

observed_counts = np.bincount(y,minlength=y.max()+1) observed_probs = 1.*observed_counts / observed_counts.sum() desired_train_counts = test_size*observed_probs desired_test_counts = test_size*observed_probs cv = sklearn.cross_validation.StratifiedShuffleSplit(y,indices=True) for train_ind, test_ind in cv: y_train, y_test = y[train_ind],y[test_ind] training_counts = np.array([sum(y_train==i) for i in range(y.max()+1)]) test_counts = np.array([sum(y_test==i) for i in range(y.max()+1)]) np.testing.assert_true(np.abs(train_counts - desired_train_counts).max() <= 1.) np.testing.assert_true(np.abs(test_counts - desired_test_counts).max() <= 1.)

On 08/25/2012 03:38 AM, Mathieu Blondel wrote:

In sklearn/tests/test_cross_validation.py:

@@ -133,6 +133,17 @@ def test_stratified_shuffle_split():
assert_true(train_std[i] <= np.std(np.bincount(y[train])))
assert_true(test_std[i] <= np.std(np.bincount(y[test])))

Also, I would argue that we don't care if it does a better job than
|ShuffleSplit|. We just care that it does the job as expected. So, I
would just remove the above inequality assertions (comparison with
|ShuffleSplit|).

—
Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1060/files#r1459507.

ogrisel · 2012-08-26T09:48:52Z

@schwarty could you adapt the problem reported by Dan on the mailing list into a non regression test?

from sklearn.cross_validation import StratifiedShuffleSplit
import numpy as np

y = np.hstack(([-1] * 800, [1] * 50))
tr_idx, te_idx = iter(StratifiedShuffleSplit(y, 1, test_size=0.3)).next()

print np.unique(y[tr_idx])
# Prints [-1]. I don't get any sample from class `1`.

print len(tr_idx) + len(te_idx) 
# Prints 808. Some samples are lost.

schwarty · 2012-08-27T09:01:49Z

I think the problem reported by Dan actually happens with the previous implementation of the StratifiedShuffleSplit. Basically just check your installation and you should be good!

amueller · 2012-08-27T09:04:30Z

@schwarty Can you still please add a test?

schwarty · 2012-08-27T12:27:39Z

@amueller: Done. I also added additional validation for corner cases, and the associated tests. And I replaced the comparison to the ShuffleSplit by something that should be more relevant.

amueller · 2012-08-27T12:31:25Z

Thanks. Do you think this should be ok now and should I or someone else have another close look?
This should really be in the next release!

amueller · 2012-08-27T12:36:19Z

sklearn/tests/test_cross_validation.py

can you also check that the size of the two sets together gives the total training set size?
And that training and test don't overlap?

Ok that's also done. And I think @GaelVaroquaux would like to have another look before we merge it. But he should be busy for the next couple of days...

…training and testing sets, and that they don't overlap

GaelVaroquaux · 2012-08-29T13:12:45Z

sklearn/cross_validation.py

It seems to me that you could use np.minimum, and be more readable. No?

GaelVaroquaux · 2012-08-29T13:15:10Z

sklearn/cross_validation.py

I think that you code this as:

train = rng.permutation(train) test = rng.permutation(test)

Good to know you can pass a sequence to permutation, thanks for the tip. Done as well.

GaelVaroquaux · 2012-08-29T13:21:58Z

LGTM. +1 for merge. Good work at @schwarty : you draw almost no complaints from me :)

mblondel · 2012-08-29T13:44:50Z

Did you address the scalability issue? I'm working with a 1 million example dataset and wants to use StratifiedShuffleSplit.

schwarty · 2012-08-29T14:01:52Z

sklearn/tests/test_cross_validation.py

@ogrisel : that's the line doing it (147)

Oops sorry I missed it.

schwarty · 2012-08-29T14:06:49Z

@mblondel : I didn't change anything regarding the speed, is it too slow at the moment?

mblondel · 2012-08-29T14:12:45Z

@schwarty Haven't tried yet and won't have time to try before next week. We can merge and optimize later if necessary (correctness is more important).

ogrisel · 2012-08-29T14:16:06Z

LGTM, +1 for merging.

schwarty · 2012-08-29T14:16:47Z

@mblondel I agree, FYI currently it takes around 0.8s per fold

amueller · 2012-08-29T20:22:41Z

Looks good, merging. Thanks a lot for the fix!

amueller · 2012-08-29T20:31:58Z

I get this error:

File "/home/andy/checkout/scikit-learn/sklearn/cross_validation.py", line 945, in sklearn.cross_validation.StratifiedShuffleSplit
Failed example:
    for train_index, test_index in sss:
       print("TRAIN: %s TEST: %s" % (train_index, test_index))
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
Expected:
    TRAIN: [0 3] TEST: [1 2]
    TRAIN: [0 2] TEST: [1 3]
    TRAIN: [1 2] TEST: [0 3]
Got:
    TRAIN: [1 2] TEST: [3 0]
    TRAIN: [0 2] TEST: [1 3]
    TRAIN: [0 2] TEST: [3 1]

amueller · 2012-08-29T20:33:46Z

sklearn/tests/test_cross_validation.py

I don't understand this test (maybe it is to late). This calls the constructor, right? Where does the constructor do input validation?
It shouldn't, that should be done in fit. But I don't see where it happens at all.

amueller · 2012-08-29T20:38:07Z

nondeterministic test failures, my favourite -_-

amueller · 2012-08-29T20:43:56Z

Ok was just a doctest. Hopefully I fixed it and merged.

ogrisel · 2012-08-30T08:59:04Z

@amueller did you merge or just closed the PR this time?

amueller · 2012-08-30T09:06:50Z

It shows up in the commits so I guess I did what I intended for once ;)

GaelVaroquaux · 2012-08-30T14:10:51Z

@schwarty Haven't tried yet and won't have time to try before next
week. We can merge and optimize later if necessary (correctness is more
important).

In terms of speed, np.unique(y) should be computed only once: with many
samples this is not a cheap operation (150ms with 1e6 element and 10
classes on my box).

mblondel · 2012-08-30T14:37:31Z

@GaelVaroquaux: I fixed that in master.

@schwarty: CS 101 "Don't repeat the same computation twice" :)

GaelVaroquaux · 2012-08-30T14:45:04Z

@GaelVaroquaux: I fixed that in master.

Arhh, gut! I have really bad Internet connection, so I am a hard time
following the high speed train that is the scikit.

amueller · 2012-08-30T14:48:45Z

Yeah I really have problems catching up with all that is going on! Crazy :)

mblondel · 2012-08-30T14:50:57Z

@GaelVaroquaux : no, I fixed that thanks to your remark! :)

Am I the only one who hate the new notification system in github? I'm flooded with notifications now. I preferred the old system: notifications on mentions, new PRs and commit comments...

GaelVaroquaux · 2012-08-30T14:52:29Z

Am I the only one who hate the new notification system in github? I'm flooded
with notifications now. I preferred the old system: notifications on mentions,
new PRs and commit comments...

Same thing here, its a nightmare. I have the feeling that it is killing
my productivity.

amueller · 2012-08-30T14:52:38Z

@mblondel yeah it does flood my inbox :-/

Yannick Schwartz added 5 commits August 24, 2012 12:09

New stratified shuffle split version that only return indices arrays

487f1c0

stratified shuffle split can return masks

d3b081a

Fixed StratifiedShuffleSplit issue for unbalanced classes

e91b951

Fixed n_test issue in StratifiedShuffleSplit

5789ca7

pep8 fix

23c07a3

fabianp reviewed Aug 24, 2012
View reviewed changes

GaelVaroquaux mentioned this pull request Aug 24, 2012

Maybe fixes issue #1051 #1058

Closed

Yannick Schwartz added 4 commits August 25, 2012 09:27

Added new tests for StratifiedShuffleSplit

a826a3d

Fixed SSS test

0dee43c

Removed redefinition of variable i in SSS

db4743c

Permute the train and test sets in SSS to avoid class-sorted folds

1d89a94

mblondel reviewed Aug 25, 2012
View reviewed changes

Yannick Schwartz added 2 commits August 27, 2012 13:46

Added validation for some corner cases in SSS

2f35942

Updated tests for SSS

33b6486

amueller reviewed Aug 27, 2012
View reviewed changes

Added tests for the StratifiedShuffleSplit to check the sizes of the …

d826921

…training and testing sets, and that they don't overlap

GaelVaroquaux reviewed Aug 29, 2012
View reviewed changes

Minor cleanup of StratifiedShuffleSplit

c3d3b79

schwarty reviewed Aug 29, 2012
View reviewed changes

amueller reviewed Aug 29, 2012
View reviewed changes

amueller closed this Aug 29, 2012

amueller mentioned this pull request Aug 29, 2012

StratifiedShuffleSplit is broken. #1051

Closed

Uh oh!

WIP Stratified Shuffle Split #1060

WIP Stratified Shuffle Split #1060

Uh oh!

Conversation

schwarty commented Aug 24, 2012

Uh oh!

agramfort commented Aug 24, 2012

Uh oh!

schwarty commented Aug 24, 2012

Example

Output

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Aug 24, 2012

Uh oh!

kyleabeauchamp commented Aug 24, 2012

Uh oh!

amueller commented Aug 24, 2012

Uh oh!

schwarty commented Aug 25, 2012

Uh oh!

schwarty commented Aug 25, 2012

Uh oh!

ogrisel commented Aug 25, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Aug 26, 2012

Uh oh!

schwarty commented Aug 27, 2012

Uh oh!

amueller commented Aug 27, 2012

Uh oh!

schwarty commented Aug 27, 2012

Uh oh!

amueller commented Aug 27, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GaelVaroquaux commented Aug 29, 2012

Uh oh!

mblondel commented Aug 29, 2012

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

schwarty commented Aug 29, 2012

Uh oh!

mblondel commented Aug 29, 2012

Uh oh!

ogrisel commented Aug 29, 2012

Uh oh!

schwarty commented Aug 29, 2012

Uh oh!

amueller commented Aug 29, 2012

Uh oh!

amueller commented Aug 29, 2012

Uh oh!

Choose a reason for hiding this comment