Splitting a dataset from tfds results in inaccurate partition sizes #665

vmirly · 2019-06-12T02:02:39Z

I am trying to split the iris dataset into train/test with 2/3 for training and 1/3 for testing. So, I used the percent as follows:

import tensorflow_datasets as tfds

first_67_percent = tfds.Split.TRAIN.subsplit(tfds.percent[:67])
last_33_percent = tfds.Split.TRAIN.subsplit(tfds.percent[-33:])

ds_train_orig = tfds.load('iris', split=first_67_percent)
ds_test = tfds.load('iris', split=last_33_percent)

but this results in 117 samples for training and 33 for test which is not correct. It should be 100 for training and 50 for test. I also tried the subsplit as follows but got similar results:

split_train, split_test = tfds.Split.TRAIN.subsplit([2, 1])

ds_train_orig = tfds.load('iris', split=split_train)
ds_test = tfds.load('iris', split=split_test)

this time, it gives me 116 train and 34 test samples. I have also posted this question on Stackoverflow: https://stackoverflow.com/questions/56553357/splitting-a-tensorflow-dataset-using-tfds-percent

Conchylicultor · 2019-06-12T15:43:02Z

Thanks for reporting. This is a know issue due to the implementation which round the percent if the dataset number of example % 100 !=0.
We are currently in the process of updating this with a better implementation.

vmirly · 2019-06-12T16:15:35Z

Ok, cool. That sounds great! 👍

For now, I figured I can load the entire data and then use ds_train = ds.take(100) and ds_test = ds.skip(100) to split it into train/test.

Conchylicultor · 2019-06-13T16:07:30Z

Be careful with this as the data is not guarantee to be generated in the same order for the train set unless as_dataset(shuffle_files=False). So you may end up with overlapping train/test sets

vmirly · 2019-06-16T00:45:18Z

Yes, thanks.
I reshuffle twice, once in the original version before splitting to train/test, and then in the train set after split. In the first shuffle, I reshuffle_each_iteration=False:

ds = tf.data.Dataset.range(15)
ds = ds.shuffle(15, reshuffle_each_iteration=False)


ds_train = ds.take(10)
ds_test = ds.skip(10)

ds_train = ds_train.shuffle(10).repeat(10)
ds_test = ds_test.shuffle(5)
ds_test = ds_test.repeat(10)

set_train = set()
for i,item in enumerate(ds_train):
    set_train.add(item.numpy())

set_test = set()
for i,item in enumerate(ds_test):
    set_test.add(item.numpy())

print(set_train, set_test)

If I don't do that, repeating the train and test dataset will result in overlapped samples.

kaushikacharya · 2019-07-03T17:23:34Z

this time, it gives me 116 train and 34 test samples. I have also posted this question on Stackoverflow: https://stackoverflow.com/questions/56553357/splitting-a-tensorflow-dataset-using-tfds-percent

Related issue: #292

vmirly · 2019-07-03T18:02:15Z

Thanks , I closed the issue then.

vmirly added the bug Something isn't working label Jun 12, 2019

vmirly changed the title ~~Splitting a dataset from tfds results in inaccurate splits~~ Splitting a dataset from tfds results in inaccurate partition sizes Jun 12, 2019

vmirly closed this as completed Jul 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting a dataset from tfds results in inaccurate partition sizes #665

Splitting a dataset from tfds results in inaccurate partition sizes #665

vmirly commented Jun 12, 2019 •

edited

Loading

Conchylicultor commented Jun 12, 2019

vmirly commented Jun 12, 2019 •

edited

Loading

Conchylicultor commented Jun 13, 2019

vmirly commented Jun 16, 2019

kaushikacharya commented Jul 3, 2019

vmirly commented Jul 3, 2019

Splitting a dataset from tfds results in inaccurate partition sizes #665

Splitting a dataset from tfds results in inaccurate partition sizes #665

Comments

vmirly commented Jun 12, 2019 • edited Loading

Conchylicultor commented Jun 12, 2019

vmirly commented Jun 12, 2019 • edited Loading

Conchylicultor commented Jun 13, 2019

vmirly commented Jun 16, 2019

kaushikacharya commented Jul 3, 2019

vmirly commented Jul 3, 2019

vmirly commented Jun 12, 2019 •

edited

Loading

vmirly commented Jun 12, 2019 •

edited

Loading