Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit #15

Open
meltedhead opened this issue Mar 31, 2021 · 3 comments

Comments

@meltedhead
Copy link

meltedhead commented Mar 31, 2021

Hi trent-b:

Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

def mlb_train_test_split(labels, test_size, train_size, random_state=0):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore", category=FutureWarning)
        msss = MultilabelStratifiedShuffleSplit(
            test_size=test_size, train_size=train_size, random_state=random_state
        )
    train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
    return train_idx, test_idx

i then call the function as :

train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

@trent-b
Copy link
Owner

trent-b commented Apr 1, 2021

meltedhead,

Thank you for catching this bug. I do not think I ever tested with train_size set to a value other than None. As a workaround, you could do the following:

_, test_idx = mlb_train_test_split(labels, test_size=1200, train_size=None, random_state=0)
subset_labels = labels[test_idx].copy()
train_idx, test_idx = mlb_train_test_split(subset_labels, test_size=1000, train_size=None, random_state=1)
print('Num train labels:', len(subset_labels[train_idx]), '; proportions:', np.mean(subset_labels[train_idx], axis=0))
print('Num test labels:', len(subset_labels[test_idx]), '; proportions:', np.mean(subset_labels[test_idx], axis=0))

@nlassaux
Copy link

nlassaux commented Apr 12, 2023

Hi there,

I don't know if it helps, but I can see the same in that case with only test_size:

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

y = np.random.randint(2, size=(600, 40))
X = np.random.randint(2, size=(600, 5))

expected_test_size = 64
mskf = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=expected_test_size)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", len(train_index), "TEST:", len(test_index))

The above prints:

TRAIN: 529 TEST: 71
TRAIN: 533 TEST: 67
TRAIN: 531 TEST: 69
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 530 TEST: 70
TRAIN: 532 TEST: 68
TRAIN: 532 TEST: 68
TRAIN: 533 TEST: 67
TRAIN: 533 TEST: 67

but I expected:

TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64
TRAIN: 536 TEST: 64

@nlassaux
Copy link

Ah, just read that in the doc of MultilabelStratifiedShuffleSplit:

Train and test sizes may be slightly different from desired due to the
preference of stratification over perfectly sized folds.

Knowing that the above case should be very well distributed, I wonder if an acceptable solution with the given test size is that uncommon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants