New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG+1] fix sampling in stratified shuffle split #6472

Merged
merged 13 commits into from Sep 9, 2016

Conversation

Projects
None yet
9 participants
@amueller
Member

amueller commented Mar 1, 2016

Fixes #6471.
This breaks some tests for stratified shuffle split. From my understanding, it actually now has the correct statistics, and the tests are wrong.

I'll remove the asserts etc if we can agree on this implementation (and how the tests need to be changed).

This tries to adjust t_i and n_i before the sampling happens, and makes sure sum(t_i) == n_test and sum(n_i) == n_train.

Also, it adds in the "missing" samples based on their class frequency. So on expectation, this should be doing exactly the right thing (I think, I didn't have any coffee today yet).

The ugly bit is

np.bincount(rng.permutation(np.repeat(range(n_classes), left_over_per_class))[:missing_train], minlength=n_classes)

which is the only way I new to say "sample missing_train many points from the classes in left_over_per_class". I feel there should be a better way to say that.

The first test that breaks checks that if something with class_counts = [4, 3, 5] is split into 8 training and test points, the class probabilities in training and test set are the same.
With the given random state, it is split into [2, 1, 1] and [2, 2, 4], which I think is fine, the test disagrees.

@amueller amueller changed the title from fix sampling in stratified shuffle split, break tests that test sampl… to fix sampling in stratified shuffle split, break tests that test sampling Mar 1, 2016

@amueller

This comment has been minimized.

Member

amueller commented Mar 1, 2016

Ah, it's actually equivalent with

            to_n_i = np.random.choice(n_classes, missing_train, replace=False,
                                      p=left_over_per_class / 
                                      left_over_per_class.sum()),

I think

n_i = np.round(n_train * p_i).astype(int)
# n_i = number of samples per class in training set
n_i = np.floor(n_train * p_i).astype(int)
# n_i = number of samples per class in test set

This comment has been minimized.

@raghavrv

raghavrv Mar 2, 2016

Member

(Minor nitpick) n_i --> t_i

@@ -1006,6 +1006,13 @@ def _iter_indices(self, X, y, labels):
yield train, test
def _sample_without_replacement(class_counts, n_samples, rng):
n_classes = len(class_counts)

This comment has been minimized.

@raghavrv

raghavrv Mar 2, 2016

Member

(Minor nitpick) We could pass the n_classes?

This comment has been minimized.

@amueller

amueller Mar 2, 2016

Member

we could but I'd rather make the logic of the function simple. What happens if you pass the wrong length?

@raghavrv

This comment has been minimized.

Member

raghavrv commented Mar 2, 2016

Is this a fix that should go back to cross_validation.py (and also be backported to the recently released version?)

@lesteve

This comment has been minimized.

Member

lesteve commented Mar 2, 2016

numpy.random.RandomState.choice was added in numpy 1.7, according to the numpy doc, so you get some errors like this:

    choice = rng.choice(n_classes, n_samples, replace=False,

AttributeError: 'mtrand.RandomState' object has no attribute 'choice'
@lesteve

This comment has been minimized.

Member

lesteve commented Mar 2, 2016

I do agree there was a problem that train was getting too many samples as you noted in the associated issue and it's great that you managed to remove the dodgy second-pass "unassigned samples" logic.

@GaelVaroquaux

This comment has been minimized.

Member

GaelVaroquaux commented Mar 2, 2016

numpy.random.RandomState.choice was added in numpy 1.7,

Use sklearn.utils.random.choice

To find such a thing, don't hesitate to do a git grep in the scikit-learn
codebase.

@vighneshbirodkar

This comment has been minimized.

Contributor

vighneshbirodkar commented Mar 2, 2016

@amueller What exactly should I be looking for here ?

@amueller

This comment has been minimized.

Member

amueller commented Mar 2, 2016

@amueller

This comment has been minimized.

Member

amueller commented Mar 2, 2016

hm maybe the best choice would be to remove the randomness in picking t_i and n_i and using a mode of the multinomial hypergeometric instead of sampling from it. Not sure if it is easy to compute that. I asked on stackoverflow ^^

@amueller

This comment has been minimized.

Member

amueller commented Mar 3, 2016

ok so I can't spend that much more time on it, but I think the newest version without sampling is ok.
I'm using a bad approximation of the mode of the multivariate hypergeometric. Here "bad approximation" means that it can be off by one for each class. So that should not be that bad for us. This is all about how to do the rounding.

What I'm doing now is computing the most likely draw from the original class_counts with n_train many point. Then, from the class_counts minus what is in the training part, I get the most likely outcome for drawing n_test many points.

There is an alternative, which is "draw n_train points from class_counts, draw n_test points from class_counts, and make sure that we they don't sum up to more than class_counts". While this behavior might be a bit more "intuitive", the "ensure that they don't sum up to more than class_counts" part is more or less what was buggy in #6121.

So I'd rather stay with the simpler semantics of doing one draw and then the other draw from what's left over.

I have no idea what to do with the failing tests, though.
The first test is now failing because of the following situation:

(Pdb) p np.bincount(y[test])
array([2, 1, 1, 1])
(Pdb) p np.bincount(y[train])
array([2, 3, 3, 2])
(Pdb) p np.bincount(y)
array([4, 4, 4, 3])

 x: array([ 0.2,  0.3,  0.3,  0.2])
 y: array([ 0.4,  0.2,  0.2,  0.2])
>>  raise AssertionError('\nArrays are not almost equal to 1 decimals\n\n(mismatch 25.0%)\n x: array([ 0.2,  0.3,  0.3,  0.2])\n y: array([ 0.4,  0.2,  0.2,  0.2])')

The distribution between training and test part are not the same to one digit.
But that's impossible in this setup! I checked and a training set of [2, 3, 3, 2] is actually the true mode. So we do find the most likely configuration of the training set. As we have n_train + n_test = n_samples in this example, we have no choice in how to create the test set.
I have no idea how to fix the test, or why it was passing before. Probably because of one of the bugs.

@amueller

This comment has been minimized.

Member

amueller commented Mar 3, 2016

ping @GaelVaroquaux who wrote the tests ;)

@amueller

This comment has been minimized.

Member

amueller commented Mar 3, 2016

So indeed, the test that is failing passes on master with this configuration:

ipdb> p np.bincount(y)
array([4, 4, 4, 3])
ipdb> p np.bincount(y[train])
array([3, 3, 3, 2])
ipdb> p np.bincount(y[test])
array([1, 1, 1, 1])

So it violated the n_train / n_test sizes and put one more in the training set than it should have, to balance the classes.
@GaelVaroquaux was that expected behavior?
I mean it kind of makes sense, but it is non-obvious to me.

@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 7, 2016

I can't think clearly enough on how to attain a balance between.

a] maintaining the train_size and test_size
b] maintaining the proportionality of the class labels between the train and test size :(

@MechCoder

This comment has been minimized.

Member

MechCoder commented Mar 7, 2016

It is also not clear to me, how this block of the previous code preserves the class proportionality in the previous code?. Was it simply that it was not tested enough?

        if len(train) + len(test) < n_train + n_test:
            # We complete by affecting randomly the missing indexes
            missing_indices = np.where(bincount(train + test,
                                                minlength=len(y)) == 0)[0]
            missing_indices = rng.permutation(missing_indices)
            n_missing_train = n_train - len(train)
            n_missing_test = n_test - len(test)

            if n_missing_train > 0:
                train.extend(missing_indices[:n_missing_train])
            if n_missing_test > 0:
                test.extend(missing_indices[-n_missing_test:])
@amueller

This comment has been minimized.

Member

amueller commented Mar 8, 2016

@MechCoder it doesn't. It just samples randomly. There are very strict tests that passed.

My intuition of the problem was: n_train and n_test are specified directly by the user, so these are hard limits. The stratification is a "best effort" kind of thing. The best possible is to pick a mode of the multivariate hypergeometric, but doing that exactly would require quite a bit of code. So I do an approximation that might be off by one per class (I have not proven this bound).

@amueller amueller added this to the 0.18 milestone Mar 8, 2016

@amueller amueller changed the title from fix sampling in stratified shuffle split, break tests that test sampling to [MRG] fix sampling in stratified shuffle split Aug 26, 2016

@amueller

This comment has been minimized.

Member

amueller commented Aug 26, 2016

This should be good now. Reviews please?

@agramfort

This comment has been minimized.

Member

agramfort commented Aug 27, 2016

CIs are not happy

@amueller

This comment has been minimized.

Member

amueller commented Aug 29, 2016

Next try. Actually computing the mode (approximately) now, not sampling, to satisfy the proportion tests. But now breaking ties at random to satisfy the IID test.

@amueller

This comment has been minimized.

Member

amueller commented Aug 29, 2016

Hm is there a reason for n_test = ceil(test_size * n_samples) in _validate_shuffle_split? I would have expected round

@amueller

This comment has been minimized.

Member

amueller commented Aug 29, 2016

more tests, making the regression more explicit.

# multivariate hypergeometric given by class_counts and n_draws
continuous = n_draws * class_counts / class_counts.sum()
# floored means we don't overshoot n_samples, but probably undershoot
floored = np.floor(continuous)

This comment has been minimized.

@jnothman

jnothman Aug 30, 2016

Member

this is identical to using // above?

This comment has been minimized.

@jnothman

This comment has been minimized.

@ogrisel

ogrisel Sep 8, 2016

Member

non-volatile memory?

This comment has been minimized.

@jnothman

jnothman Sep 10, 2016

Member

never mind.

floored[inds] += 1
need_to_add -= add_now
if need_to_add == 0:
break

This comment has been minimized.

@jnothman

jnothman Aug 30, 2016

Member

indentation

# add according to remainder, but break ties
# randomly to avoid biases
for value in values:
inds, = np.where(remainder == value)

This comment has been minimized.

@jnothman

jnothman Aug 30, 2016

Member

It's a bit hard to read that over a sorted array this is just a groupby operation. Should you be using groupby?

This comment has been minimized.

@amueller

amueller Aug 30, 2016

Member

It's ugly right now. I'll try refactoring using groupby

@jnothman

This comment has been minimized.

Member

jnothman commented Aug 30, 2016

LGTM

@jnothman jnothman changed the title from [MRG] fix sampling in stratified shuffle split to [MRG+1] fix sampling in stratified shuffle split Aug 31, 2016

@amueller amueller added the Bug label Aug 31, 2016

@amueller

This comment has been minimized.

Member

amueller commented Sep 8, 2016

CI error is server error in curl

It is the mostly likely outcome of drawing n_draws many
samples from the population given by class_counts.

This comment has been minimized.

@ogrisel

ogrisel Sep 8, 2016

Member

Could you please add a doctest that would serve both as an example to make it easier to understand what the function is doing and also serve as a unittest / sanity check on a couple simple cases?

This comment has been minimized.

@amueller

amueller Sep 8, 2016

Member

are doctests run on private functions?

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

I don't know, but I would expect so. Have have you checked?

This comment has been minimized.

@amueller

amueller Sep 9, 2016

Member

I have, they are :)

need_to_add -= add_now
if need_to_add == 0:
break
return floored.astype(np.int)

This comment has been minimized.

@ogrisel

ogrisel Sep 8, 2016

Member

Actually wouldn't it be possible to just import that function from sklearn.model_selection instead of the copy-paste?

This comment has been minimized.

@amueller

amueller Sep 8, 2016

Member

I don't like importing any of the new stuff into the old stuff because then changes in the new stuff might affect the old stuff. Not that big a danger here, but it might get messy.

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

Fair enough.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 8, 2016

Weird the appveyor error report has happened on @agramfort's account instead of sklearn-ci. Anyway this is not a big deal and can be ignored.

There are still some inline comments to tackle in the diff view but other than that +1 as well.

@agramfort

This comment has been minimized.

Member

agramfort commented Sep 8, 2016

@amueller

This comment has been minimized.

Member

amueller commented Sep 8, 2016

@ogrisel I added (hopefully helpful) doctests.

array([0, 1, 1, 0])
>>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]),
... n_draws=2, rng=42)
array([1, 1, 0, 0])

This comment has been minimized.

@ogrisel

ogrisel Sep 9, 2016

Member

Nice. To me those examples explain very intuitively what the function actually does.

@ogrisel

This comment has been minimized.

Member

ogrisel commented Sep 9, 2016

LGTM, let me squash-merge that. Thanks for the fix @amueller.

@ogrisel ogrisel merged commit 31a4691 into scikit-learn:master Sep 9, 2016

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@amueller

This comment has been minimized.

Member

amueller commented Sep 9, 2016

thanks for the reviews and merge :)

@amueller amueller deleted the amueller:stratified_shuffle_split_simplify branch Sep 9, 2016

yangarbiter added a commit to yangarbiter/scikit-learn that referenced this pull request Sep 10, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)
Fix sampling in stratified shuffle split, break tests that test sampling.

rsmith54 pushed a commit to rsmith54/scikit-learn that referenced this pull request Sep 14, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)
Fix sampling in stratified shuffle split, break tests that test sampling.

TomDLT added a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)
Fix sampling in stratified shuffle split, break tests that test sampling.

Sundrique added a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)
Fix sampling in stratified shuffle split, break tests that test sampling.

paulha added a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] fix sampling in stratified shuffle split (scikit-learn#6472)
Fix sampling in stratified shuffle split, break tests that test sampling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment