Description
Description
Unfortunately, probably due to the stratification balancing, the train and test size can vary between iteration. This is pretty highly undesirable and it should always rebalance to same sizes. Otherwise combining all the folds and using efficient vectorized operations to do computations in a matrix with an extra dimension is not easily possible.
This honestly feels like a poor stratification algorithm. At worst this is a flat out bug. At best, its a mode of operation, and a Boolean flag should be added such as something like force_exact_division=True
. Of course, in some cases it is impossible to balance the classes, but even in those cases, it could still prefer to guarantee equal distribution, rather than guarantee equal class sizes since we are talking about only fractional remainders here - and the class sizes are not going to be perfect percentage wise balanced anyway as it can be impossible for example if only 1 value belongs to a class. This is surprisingly awkward functionality.
Steps/Code to Reproduce
import numpy as np
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
X, y = np.arange(7590)[:,np.newaxis], np.random.RandomState(0).randint(0, 3, 7590)
for (train, test) in kfold.split(X, y):
print(len(train), len(test))
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
for (train, test) in kfold.split(X, y):
assert(len(test) == 7590/5)
assert(len(train) == 7590*4/5)
Expected Results
6072 1518
6072 1518
6072 1518
6072 1518
6072 1518
Actual Results
6071 1519
6071 1519
6072 1518
6072 1518
6074 1516
Traceback (most recent call last):
File "<ipython-input-106-1c41a73d61b2>", line 8, in <module>
assert(len(test) == 7590/5)
AssertionError
Seems to always be in order from min(len(train))
to max(len(train))
Versions
Windows-10-10.0.18362-SP0
Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
NumPy 1.17.3
SciPy 1.3.1
Scikit-Learn 0.21.3