Skip to content

Conversation

@schwarty
Copy link

It definitely requires further testing, I'm just interested in knowing if you can find cases where it doesn't behave properly so that it can be fixed. And I would like to be positively sure it works properly before discussing design considerations.

@agramfort
Copy link
Member

can you comment with the previous use case that was failing and that now works?

cc/ @npinto

@schwarty
Copy link
Author

Example

y = np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3])
sss = cv.StratifiedShuffleSplit(y, n_iterations=3, train_size=6, test_size=.2, random_state=0)

for train, test in sss:
print 'indices', train, test
print 'values', y[train], y[test]

Output

indices [2, 1, 6, 4, 7, 9] [0, 5, 8]
values [1 1 2 2 3 3] [1 2 3]
indices [2, 0, 6, 5, 8, 7] [1, 4, 9]
values [1 1 2 2 3 3] [1 2 3]
indices [1, 2, 4, 5, 7, 8] [0, 6, 9]
values [1 1 2 2 3 3] [1 2 3]

So it appears to work now, where it previously failed badly. If you tweak the test_size parameter you might end up with test sets too small to contain all the classes (e.g. try test_size=.1). The cv scheme itself is not to modify here I think, but rather the validation function that has to check extra things (typically that the test and train sets are not smaller than the number of classes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not important since you don't use it in the outer loop, but you are redefining the i variable in the nested loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, you didn't address @fabianp's comment yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mblondel: I did, the i variable is no longer defined in the outer loop.

@GaelVaroquaux
Copy link
Member

y = np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3])
sss = cv.StratifiedShuffleSplit(y, n_iterations=3, train_size=6,
test_size=.2, random_state=0)

for train, test in sss:

print 'indices', train, test
print 'values', y[train], y[test]

output

indices [2, 1, 6, 4, 7, 9] [0, 5, 8]
values [1 1 2 2 3 3] [1 2 3]
indices [2, 0, 6, 5, 8, 7] [1, 4, 9]
values [1 1 2 2 3 3] [1 2 3]
indices [1, 2, 4, 5, 7, 8] [0, 6, 9]
values [1 1 2 2 3 3] [1 2 3]

Could you make a unit test out of this? Basically, I think that a good
check would be to check that each fold has instances of 1, 2 and 3.

Something that I don't really like, is that the class are sorted in the
folds. This can be detrimental to some algorithms. I'd rather avoid it.

@kyleabeauchamp
Copy link
Contributor

I played with the class for a bit and it seems to do what I want.

Regarding the sorted classes--can't we just throw a permutation on the training and test indices?

@amueller
Copy link
Member

@kyleabeauchamp Thanks for tackling this. Yes, doing a permutation would be ok. It might be possible to avoid it but I'd rather go for an easy to understand solution, even if we go over the indices once more.

@schwarty
Copy link
Author

Thanks for the comments, if you all agree, before I make the implementation more efficient and cleaner, I'd rather add the tests and make it rock solid. The other stuff are non/less essential and would come later.

@schwarty
Copy link
Author

I added tests, and permuted the test and train sets as asked. It should be pretty good now. One comment I have is that the implementation itself is not very efficient and starts to be a bit slow when you get around 1M samples.

@ogrisel
Copy link
Member

ogrisel commented Aug 25, 2012

It would be great if someone could volunteer to add benchmarks for CV iterators and other utility functions (such as the score functions in the metrics module) to the benchmark suite from scikit-learn-speed.

Current benchmark source code lives in this folder:

https://github.com/scikit-learn/scikit-learn-speed/blob/master/benchmarks/

and use these templates:

https://github.com/scikit-learn/scikit-learn-speed/blob/master/benchmarks/templates.py

Currently all of those benchmarks use the same template which is focused to bench classes that implement the fit / predict API but nothing prevent us to add other utility functions or classes to the benchmark suite.

Maybe @vene you could add a new benchmark for a non-fit-predict object to have an first example for new benchmark contributors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add assert_equal or assert_almost_equal assertions too? (to check that the proportions of each class are roughly respected)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I would argue that we don't care if it does a better job than ShuffleSplit. We just care that it does the job as expected. So, I would just remove the above inequality assertions (comparison with ShuffleSplit).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we want something like this (incomplete snippet), where we check
that the observed training counts are within +-1 of their desired
values. The +-1 results because desired_train_counts might be a float,
so we can only assume equality to within rounding.

observed_counts = np.bincount(y,minlength=y.max()+1)

observed_probs = 1.*observed_counts / observed_counts.sum()


desired_train_counts = test_size*observed_probs

desired_test_counts = test_size*observed_probs


cv = sklearn.cross_validation.StratifiedShuffleSplit(y,indices=True)

for train_ind, test_ind in cv:

y_train, y_test = y[train_ind],y[test_ind]

training_counts = np.array([sum(y_train==i) for i in range(y.max()+1)])

test_counts = np.array([sum(y_test==i) for i in range(y.max()+1)])
np.testing.assert_true(np.abs(train_counts - desired_train_counts).max() 
<= 1.)

np.testing.assert_true(np.abs(test_counts - desired_test_counts).max() 
<= 1.)

On 08/25/2012 03:38 AM, Mathieu Blondel wrote:

In sklearn/tests/test_cross_validation.py:

@@ -133,6 +133,17 @@ def test_stratified_shuffle_split():
assert_true(train_std[i] <= np.std(np.bincount(y[train])))
assert_true(test_std[i] <= np.std(np.bincount(y[test])))

Also, I would argue that we don't care if it does a better job than
|ShuffleSplit|. We just care that it does the job as expected. So, I
would just remove the above inequality assertions (comparison with
|ShuffleSplit|).


Reply to this email directly or view it on GitHub
https://github.com/scikit-learn/scikit-learn/pull/1060/files#r1459507.

@ogrisel
Copy link
Member

ogrisel commented Aug 26, 2012

@schwarty could you adapt the problem reported by Dan on the mailing list into a non regression test?

from sklearn.cross_validation import StratifiedShuffleSplit
import numpy as np

y = np.hstack(([-1] * 800, [1] * 50))
tr_idx, te_idx = iter(StratifiedShuffleSplit(y, 1, test_size=0.3)).next()

print np.unique(y[tr_idx])
# Prints [-1]. I don't get any sample from class `1`.

print len(tr_idx) + len(te_idx) 
# Prints 808. Some samples are lost.

@schwarty
Copy link
Author

I think the problem reported by Dan actually happens with the previous implementation of the StratifiedShuffleSplit. Basically just check your installation and you should be good!

@amueller
Copy link
Member

@schwarty Can you still please add a test?

@schwarty
Copy link
Author

@amueller: Done. I also added additional validation for corner cases, and the associated tests. And I replaced the comparison to the ShuffleSplit by something that should be more relevant.

@amueller
Copy link
Member

Thanks. Do you think this should be ok now and should I or someone else have another close look?
This should really be in the next release!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also check that the size of the two sets together gives the total training set size?
And that training and test don't overlap?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok that's also done. And I think @GaelVaroquaux would like to have another look before we merge it. But he should be busy for the next couple of days...

…training and testing sets, and that they don't overlap
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that you could use np.minimum, and be more readable. No?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you code this as:

train = rng.permutation(train)
test = rng.permutation(test)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know you can pass a sequence to permutation, thanks for the tip. Done as well.

@GaelVaroquaux
Copy link
Member

LGTM. +1 for merge. Good work at @schwarty : you draw almost no complaints from me :)

@mblondel
Copy link
Member

Did you address the scalability issue? I'm working with a 1 million example dataset and wants to use StratifiedShuffleSplit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel : that's the line doing it (147)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry I missed it.

@schwarty
Copy link
Author

@mblondel : I didn't change anything regarding the speed, is it too slow at the moment?

@mblondel
Copy link
Member

@schwarty Haven't tried yet and won't have time to try before next week. We can merge and optimize later if necessary (correctness is more important).

@ogrisel
Copy link
Member

ogrisel commented Aug 29, 2012

LGTM, +1 for merging.

@schwarty
Copy link
Author

@mblondel I agree, FYI currently it takes around 0.8s per fold

@amueller
Copy link
Member

Looks good, merging. Thanks a lot for the fix!

@amueller
Copy link
Member

I get this error:

File "/home/andy/checkout/scikit-learn/sklearn/cross_validation.py", line 945, in sklearn.cross_validation.StratifiedShuffleSplit
Failed example:
    for train_index, test_index in sss:
       print("TRAIN: %s TEST: %s" % (train_index, test_index))
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
Expected:
    TRAIN: [0 3] TEST: [1 2]
    TRAIN: [0 2] TEST: [1 3]
    TRAIN: [1 2] TEST: [0 3]
Got:
    TRAIN: [1 2] TEST: [3 0]
    TRAIN: [0 2] TEST: [1 3]
    TRAIN: [0 2] TEST: [3 1]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this test (maybe it is to late). This calls the constructor, right? Where does the constructor do input validation?
It shouldn't, that should be done in fit. But I don't see where it happens at all.

@amueller
Copy link
Member

nondeterministic test failures, my favourite -_-

@amueller
Copy link
Member

Ok was just a doctest. Hopefully I fixed it and merged.

@ogrisel
Copy link
Member

ogrisel commented Aug 30, 2012

@amueller did you merge or just closed the PR this time?

@amueller
Copy link
Member

It shows up in the commits so I guess I did what I intended for once ;)

@GaelVaroquaux
Copy link
Member

@schwarty Haven't tried yet and won't have time to try before next
week. We can merge and optimize later if necessary (correctness is more
important).

In terms of speed, np.unique(y) should be computed only once: with many
samples this is not a cheap operation (150ms with 1e6 element and 10
classes on my box).

@mblondel
Copy link
Member

@GaelVaroquaux: I fixed that in master.

@schwarty: CS 101 "Don't repeat the same computation twice" :)

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux: I fixed that in master.

Arhh, gut! I have really bad Internet connection, so I am a hard time
following the high speed train that is the scikit.

@amueller
Copy link
Member

Yeah I really have problems catching up with all that is going on! Crazy :)

@mblondel
Copy link
Member

@GaelVaroquaux : no, I fixed that thanks to your remark! :)

Am I the only one who hate the new notification system in github? I'm flooded with notifications now. I preferred the old system: notifications on mentions, new PRs and commit comments...

@GaelVaroquaux
Copy link
Member

Am I the only one who hate the new notification system in github? I'm flooded
with notifications now. I preferred the old system: notifications on mentions,
new PRs and commit comments...

Same thing here, its a nightmare. I have the feeling that it is killing
my productivity.

@amueller
Copy link
Member

@mblondel yeah it does flood my inbox :-/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants