-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting a dataset from tfds results in inaccurate partition sizes #665
Comments
Thanks for reporting. This is a know issue due to the implementation which round the percent if the dataset number of example % 100 !=0. |
Ok, cool. That sounds great! 👍 For now, I figured I can load the entire data and then use |
Be careful with this as the data is not guarantee to be generated in the same order for the train set unless |
Yes, thanks. ds = tf.data.Dataset.range(15)
ds = ds.shuffle(15, reshuffle_each_iteration=False)
ds_train = ds.take(10)
ds_test = ds.skip(10)
ds_train = ds_train.shuffle(10).repeat(10)
ds_test = ds_test.shuffle(5)
ds_test = ds_test.repeat(10)
set_train = set()
for i,item in enumerate(ds_train):
set_train.add(item.numpy())
set_test = set()
for i,item in enumerate(ds_test):
set_test.add(item.numpy())
print(set_train, set_test) If I don't do that, repeating the train and test dataset will result in overlapped samples. |
Related issue: #292 |
Thanks , I closed the issue then. |
I am trying to split the iris dataset into train/test with 2/3 for training and 1/3 for testing. So, I used the percent as follows:
but this results in 117 samples for training and 33 for test which is not correct. It should be 100 for training and 50 for test. I also tried the subsplit as follows but got similar results:
this time, it gives me 116 train and 34 test samples. I have also posted this question on Stackoverflow: https://stackoverflow.com/questions/56553357/splitting-a-tensorflow-dataset-using-tfds-percent
The text was updated successfully, but these errors were encountered: