Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting a dataset from tfds results in inaccurate partition sizes #665

Closed
vmirly opened this issue Jun 12, 2019 · 6 comments
Closed

Splitting a dataset from tfds results in inaccurate partition sizes #665

vmirly opened this issue Jun 12, 2019 · 6 comments
Labels
bug Something isn't working

Comments

@vmirly
Copy link

vmirly commented Jun 12, 2019

I am trying to split the iris dataset into train/test with 2/3 for training and 1/3 for testing. So, I used the percent as follows:

import tensorflow_datasets as tfds

first_67_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:67])
last_33_percent = tfds.Split.TRAIN.subsplit(tfds.percent[-33:])

ds_train_orig = tfds.load('iris', split=first_67_percent)
ds_test = tfds.load('iris', split=last_33_percent)

but this results in 117 samples for training and 33 for test which is not correct. It should be 100 for training and 50 for test. I also tried the subsplit as follows but got similar results:

split_train, split_test = tfds.Split.TRAIN.subsplit([2, 1])

ds_train_orig = tfds.load('iris', split=split_train)
ds_test = tfds.load('iris', split=split_test)

this time, it gives me 116 train and 34 test samples. I have also posted this question on Stackoverflow: https://stackoverflow.com/questions/56553357/splitting-a-tensorflow-dataset-using-tfds-percent

@vmirly vmirly added the bug Something isn't working label Jun 12, 2019
@vmirly vmirly changed the title Splitting a dataset from tfds results in inaccurate splits Splitting a dataset from tfds results in inaccurate partition sizes Jun 12, 2019
@Conchylicultor
Copy link
Member

Thanks for reporting. This is a know issue due to the implementation which round the percent if the dataset number of example % 100 !=0.
We are currently in the process of updating this with a better implementation.

@vmirly
Copy link
Author

vmirly commented Jun 12, 2019

Ok, cool. That sounds great! 👍

For now, I figured I can load the entire data and then use ds_train = ds.take(100) and ds_test = ds.skip(100) to split it into train/test.

@Conchylicultor
Copy link
Member

Be careful with this as the data is not guarantee to be generated in the same order for the train set unless as_dataset(shuffle_files=False). So you may end up with overlapping train/test sets

@vmirly
Copy link
Author

vmirly commented Jun 16, 2019

Yes, thanks.
I reshuffle twice, once in the original version before splitting to train/test, and then in the train set after split. In the first shuffle, I reshuffle_each_iteration=False:

ds = tf.data.Dataset.range(15)
ds = ds.shuffle(15, reshuffle_each_iteration=False)


ds_train = ds.take(10)
ds_test = ds.skip(10)

ds_train = ds_train.shuffle(10).repeat(10)
ds_test = ds_test.shuffle(5)
ds_test = ds_test.repeat(10)

set_train = set()
for i,item in enumerate(ds_train):
    set_train.add(item.numpy())

set_test = set()
for i,item in enumerate(ds_test):
    set_test.add(item.numpy())

print(set_train, set_test)

If I don't do that, repeating the train and test dataset will result in overlapped samples.

@kaushikacharya
Copy link

this time, it gives me 116 train and 34 test samples. I have also posted this question on Stackoverflow: https://stackoverflow.com/questions/56553357/splitting-a-tensorflow-dataset-using-tfds-percent

Related issue: #292

@vmirly vmirly closed this as completed Jul 3, 2019
@vmirly
Copy link
Author

vmirly commented Jul 3, 2019

Thanks , I closed the issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants