New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+1] fix sampling in stratified shuffle split #6472
Changes from all commits
dea5a0e
3d13d37
2644a6e
143addf
f227e8c
7fdf852
7c59f2f
1903e85
ae20158
e2883e3
e982ace
c865c83
4bf494c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,6 +27,7 @@ | |
from .utils.validation import (_is_arraylike, _num_samples, | ||
column_or_1d) | ||
from .utils.multiclass import type_of_target | ||
from .utils.random import choice | ||
from .externals.joblib import Parallel, delayed, logger | ||
from .externals.six import with_metaclass | ||
from .externals.six.moves import zip | ||
|
@@ -414,9 +415,9 @@ def __init__(self, labels, n_folds=3): | |
|
||
if n_folds > n_labels: | ||
raise ValueError( | ||
("Cannot have number of folds n_folds={0} greater" | ||
" than the number of labels: {1}.").format(n_folds, | ||
n_labels)) | ||
("Cannot have number of folds n_folds={0} greater" | ||
" than the number of labels: {1}.").format(n_folds, | ||
n_labels)) | ||
|
||
# Weight labels by their number of occurrences | ||
n_samples_per_label = np.bincount(labels) | ||
|
@@ -906,6 +907,59 @@ def _validate_shuffle_split(n, test_size, train_size): | |
return int(n_train), int(n_test) | ||
|
||
|
||
def _approximate_mode(class_counts, n_draws, rng): | ||
"""Computes approximate mode of multivariate hypergeometric. | ||
|
||
This is an approximation to the mode of the multivariate | ||
hypergeometric given by class_counts and n_draws. | ||
It shouldn't be off by more than one. | ||
|
||
It is the mostly likely outcome of drawing n_draws many | ||
samples from the population given by class_counts. | ||
|
||
Parameters | ||
---------- | ||
class_counts : ndarray of int | ||
Population per class. | ||
n_draws : int | ||
Number of draws (samples to draw) from the overall population. | ||
rng : random state | ||
Used to break ties. | ||
|
||
Returns | ||
------- | ||
sampled_classes : ndarray of int | ||
Number of samples drawn from each class. | ||
np.sum(sampled_classes) == n_draws | ||
""" | ||
# this computes a bad approximation to the mode of the | ||
# multivariate hypergeometric given by class_counts and n_draws | ||
continuous = n_draws * class_counts / class_counts.sum() | ||
# floored means we don't overshoot n_samples, but probably undershoot | ||
floored = np.floor(continuous) | ||
# we add samples according to how much "left over" probability | ||
# they had, until we arrive at n_samples | ||
need_to_add = int(n_draws - floored.sum()) | ||
if need_to_add > 0: | ||
remainder = continuous - floored | ||
values = np.sort(np.unique(remainder))[::-1] | ||
# add according to remainder, but break ties | ||
# randomly to avoid biases | ||
for value in values: | ||
inds, = np.where(remainder == value) | ||
# if we need_to_add less than what's in inds | ||
# we draw randomly from them. | ||
# if we need to add more, we add them all and | ||
# go to the next value | ||
add_now = min(len(inds), need_to_add) | ||
inds = choice(inds, size=add_now, replace=False, random_state=rng) | ||
floored[inds] += 1 | ||
need_to_add -= add_now | ||
if need_to_add == 0: | ||
break | ||
return floored.astype(np.int) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually wouldn't it be possible to just import that function from There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't like importing any of the new stuff into the old stuff because then changes in the new stuff might affect the old stuff. Not that big a danger here, but it might get messy. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair enough. |
||
|
||
|
||
class StratifiedShuffleSplit(BaseShuffleSplit): | ||
"""Stratified ShuffleSplit cross validation iterator | ||
|
||
|
@@ -991,39 +1045,24 @@ def __init__(self, y, n_iter=10, test_size=0.1, train_size=None, | |
def _iter_indices(self): | ||
rng = check_random_state(self.random_state) | ||
cls_count = bincount(self.y_indices) | ||
p_i = cls_count / float(self.n) | ||
n_i = np.round(self.n_train * p_i).astype(int) | ||
t_i = np.minimum(cls_count - n_i, | ||
np.round(self.n_test * p_i).astype(int)) | ||
|
||
for n in range(self.n_iter): | ||
# if there are ties in the class-counts, we want | ||
# to make sure to break them anew in each iteration | ||
n_i = _approximate_mode(cls_count, self.n_train, rng) | ||
class_counts_remaining = cls_count - n_i | ||
t_i = _approximate_mode(class_counts_remaining, self.n_test, rng) | ||
|
||
train = [] | ||
test = [] | ||
|
||
for i, cls in enumerate(self.classes): | ||
for i, _ in enumerate(self.classes): | ||
permutation = rng.permutation(cls_count[i]) | ||
cls_i = np.where((self.y == cls))[0][permutation] | ||
|
||
train.extend(cls_i[:n_i[i]]) | ||
test.extend(cls_i[n_i[i]:n_i[i] + t_i[i]]) | ||
|
||
# Because of rounding issues (as n_train and n_test are not | ||
# dividers of the number of elements per class), we may end | ||
# up here with less samples in train and test than asked for. | ||
if len(train) + len(test) < self.n_train + self.n_test: | ||
# We complete by affecting randomly the missing indexes | ||
missing_idx = np.where(bincount(train + test, | ||
minlength=len(self.y)) == 0, | ||
)[0] | ||
missing_idx = rng.permutation(missing_idx) | ||
n_missing_train = self.n_train - len(train) | ||
n_missing_test = self.n_test - len(test) | ||
|
||
if n_missing_train > 0: | ||
train.extend(missing_idx[:n_missing_train]) | ||
if n_missing_test > 0: | ||
test.extend(missing_idx[-n_missing_test:]) | ||
perm_indices_class_i = np.where( | ||
(i == self.y_indices))[0][permutation] | ||
|
||
train.extend(perm_indices_class_i[:n_i[i]]) | ||
test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]) | ||
train = rng.permutation(train) | ||
test = rng.permutation(test) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,6 +30,7 @@ | |
from ..externals.six.moves import zip | ||
from ..utils.fixes import bincount | ||
from ..utils.fixes import signature | ||
from ..utils.random import choice | ||
from ..base import _pprint | ||
from ..gaussian_process.kernels import Kernel as GPKernel | ||
|
||
|
@@ -1098,6 +1099,73 @@ def _iter_indices(self, X, y, labels): | |
yield train, test | ||
|
||
|
||
def _approximate_mode(class_counts, n_draws, rng): | ||
"""Computes approximate mode of multivariate hypergeometric. | ||
|
||
This is an approximation to the mode of the multivariate | ||
hypergeometric given by class_counts and n_draws. | ||
It shouldn't be off by more than one. | ||
|
||
It is the mostly likely outcome of drawing n_draws many | ||
samples from the population given by class_counts. | ||
|
||
Parameters | ||
---------- | ||
class_counts : ndarray of int | ||
Population per class. | ||
n_draws : int | ||
Number of draws (samples to draw) from the overall population. | ||
rng : random state | ||
Used to break ties. | ||
|
||
Returns | ||
------- | ||
sampled_classes : ndarray of int | ||
Number of samples drawn from each class. | ||
np.sum(sampled_classes) == n_draws | ||
|
||
Examples | ||
-------- | ||
>>> from sklearn.model_selection._split import _approximate_mode | ||
>>> _approximate_mode(class_counts=np.array([4, 2]), n_draws=3, rng=0) | ||
array([2, 1]) | ||
>>> _approximate_mode(class_counts=np.array([5, 2]), n_draws=4, rng=0) | ||
array([3, 1]) | ||
>>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]), | ||
... n_draws=2, rng=0) | ||
array([0, 1, 1, 0]) | ||
>>> _approximate_mode(class_counts=np.array([2, 2, 2, 1]), | ||
... n_draws=2, rng=42) | ||
array([1, 1, 0, 0]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice. To me those examples explain very intuitively what the function actually does. |
||
""" | ||
# this computes a bad approximation to the mode of the | ||
# multivariate hypergeometric given by class_counts and n_draws | ||
continuous = n_draws * class_counts / class_counts.sum() | ||
# floored means we don't overshoot n_samples, but probably undershoot | ||
floored = np.floor(continuous) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is identical to using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nvm There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. non-volatile memory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. never mind. |
||
# we add samples according to how much "left over" probability | ||
# they had, until we arrive at n_samples | ||
need_to_add = int(n_draws - floored.sum()) | ||
if need_to_add > 0: | ||
remainder = continuous - floored | ||
values = np.sort(np.unique(remainder))[::-1] | ||
# add according to remainder, but break ties | ||
# randomly to avoid biases | ||
for value in values: | ||
inds, = np.where(remainder == value) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's a bit hard to read that over a sorted array this is just a groupby operation. Should you be using groupby? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's ugly right now. I'll try refactoring using groupby |
||
# if we need_to_add less than what's in inds | ||
# we draw randomly from them. | ||
# if we need to add more, we add them all and | ||
# go to the next value | ||
add_now = min(len(inds), need_to_add) | ||
inds = choice(inds, size=add_now, replace=False, random_state=rng) | ||
floored[inds] += 1 | ||
need_to_add -= add_now | ||
if need_to_add == 0: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In our tests, we always have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found that this array: np.concatenate([[i] * (100 + i) for i in range(11)]) yielded need_to_add = 7. Since there are no ties it will execute the loop 7 times, so that covers the edge case I was mentioning above. Adding it to the tests would be great, i.e.: diff --git a/sklearn/model_selection/tests/test_split.py b/sklearn/model_selection/tests/test_split.py
index f127ab1..d413018 100644
--- a/sklearn/model_selection/tests/test_split.py
+++ b/sklearn/model_selection/tests/test_split.py
@@ -551,7 +551,8 @@ def test_stratified_shuffle_split_iter():
np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
np.array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2] * 2),
np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4]),
- np.array([-1] * 800 + [1] * 50)
+ np.array([-1] * 800 + [1] * 50),
+ np.concatenate([[i] * (100 + i) for i in range(11)])
]
for y in ys: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This comment was taken into account. |
||
break | ||
return floored.astype(np.int) | ||
|
||
|
||
class StratifiedShuffleSplit(BaseShuffleSplit): | ||
"""Stratified ShuffleSplit cross-validator | ||
|
||
|
@@ -1181,12 +1249,14 @@ def _iter_indices(self, X, y, labels=None): | |
(n_test, n_classes)) | ||
|
||
rng = check_random_state(self.random_state) | ||
p_i = class_counts / float(n_samples) | ||
n_i = np.round(n_train * p_i).astype(int) | ||
t_i = np.minimum(class_counts - n_i, | ||
np.round(n_test * p_i).astype(int)) | ||
|
||
for _ in range(self.n_splits): | ||
# if there are ties in the class-counts, we want | ||
# to make sure to break them anew in each iteration | ||
n_i = _approximate_mode(class_counts, n_train, rng) | ||
class_counts_remaining = class_counts - n_i | ||
t_i = _approximate_mode(class_counts_remaining, n_test, rng) | ||
|
||
train = [] | ||
test = [] | ||
|
||
|
@@ -1196,23 +1266,6 @@ def _iter_indices(self, X, y, labels=None): | |
|
||
train.extend(perm_indices_class_i[:n_i[i]]) | ||
test.extend(perm_indices_class_i[n_i[i]:n_i[i] + t_i[i]]) | ||
|
||
# Because of rounding issues (as n_train and n_test are not | ||
# dividers of the number of elements per class), we may end | ||
# up here with less samples in train and test than asked for. | ||
if len(train) + len(test) < n_train + n_test: | ||
# We complete by affecting randomly the missing indexes | ||
missing_indices = np.where(bincount(train + test, | ||
minlength=len(y)) == 0)[0] | ||
missing_indices = rng.permutation(missing_indices) | ||
n_missing_train = n_train - len(train) | ||
n_missing_test = n_test - len(test) | ||
|
||
if n_missing_train > 0: | ||
train.extend(missing_indices[:n_missing_train]) | ||
if n_missing_test > 0: | ||
test.extend(missing_indices[-n_missing_test:]) | ||
|
||
train = rng.permutation(train) | ||
test = rng.permutation(test) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -479,24 +479,30 @@ def test_stratified_shuffle_split_init(): | |
def test_stratified_shuffle_split_iter(): | ||
ys = [np.array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3]), | ||
np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]), | ||
np.array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2]), | ||
np.array([0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2] * 2), | ||
np.array([1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4]), | ||
np.array([-1] * 800 + [1] * 50) | ||
] | ||
|
||
for y in ys: | ||
sss = cval.StratifiedShuffleSplit(y, 6, test_size=0.33, | ||
random_state=0) | ||
test_size = np.ceil(0.33 * len(y)) | ||
train_size = len(y) - test_size | ||
for train, test in sss: | ||
assert_array_equal(np.unique(y[train]), np.unique(y[test])) | ||
# Checks if folds keep classes proportions | ||
p_train = (np.bincount(np.unique(y[train], return_inverse=True)[1]) | ||
/ float(len(y[train]))) | ||
p_test = (np.bincount(np.unique(y[test], return_inverse=True)[1]) | ||
/ float(len(y[test]))) | ||
p_train = (np.bincount(np.unique(y[train], | ||
return_inverse=True)[1]) / | ||
float(len(y[train]))) | ||
p_test = (np.bincount(np.unique(y[test], | ||
return_inverse=True)[1]) / | ||
float(len(y[test]))) | ||
assert_array_almost_equal(p_train, p_test, 1) | ||
assert_equal(y[train].size + y[test].size, y.size) | ||
assert_array_equal(np.intersect1d(train, test), []) | ||
assert_equal(len(train) + len(test), y.size) | ||
assert_equal(len(train), train_size) | ||
assert_equal(len(test), test_size) | ||
assert_array_equal(np.lib.arraysetops.intersect1d(train, test), []) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious, why replace np.intersect1d by np.lib.arraysetops.intersect1d? AFAICT they are identical: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not my doing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😕 this is not what the diff is saying ... but it doesn't really matter. I reckon you copied and pasted from another test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I backported the new implementation of |
||
|
||
|
||
def test_stratified_shuffle_split_even(): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a doctest that would serve both as an example to make it easier to understand what the function is doing and also serve as a unittest / sanity check on a couple simple cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are doctests run on private functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, but I would expect so. Have have you checked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have, they are :)