Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

oxford_flowers102 bad splits #3022

Closed
TeaPearce opened this issue Feb 6, 2021 · 5 comments · Fixed by #3023
Closed

oxford_flowers102 bad splits #3022

TeaPearce opened this issue Feb 6, 2021 · 5 comments · Fixed by #3023
Labels
bug Something isn't working contributions welcome

Comments

@TeaPearce
Copy link

The train/val/test splits in the tfds oxford_flowers102 don't match up with established splits.

Training on the train and val splits, only acheive around 91% accuracy with finetuning. Should acheive 98%, e.g. table 6 here. If one reads in the entire dataset and creates a random split, this is acheivable. Has also been noted on stackoverflow here.

@TeaPearce TeaPearce added the bug Something isn't working label Feb 6, 2021
@jatin-code777
Copy link
Contributor

jatin-code777 commented Feb 7, 2021

This is not an issue within TFDS itself but perhaps a bug in the dataset itself.
I redownloaded the original dataset and confirmed that the splits given in the dataset match those in TFDS.
image

These also match with the table 6 in the paper:
Table 6

In my opinion, the fix for this should come from the dataset itself.

@vijayphoenix
Copy link
Contributor

vijayphoenix commented Feb 7, 2021

A workaround would be to do something like this

>>> import tensorflow_datasets as tfds
>>> test, train, validation = tfds.load('oxford_flowers102', split=['train', 'test', 'validation'])

>>> sum(1 for _ in train)
6149

Perhaps, we can add a warning in the dataset description.

For warning example, see:

WARNING: The integer labels used are defined by the authors and do not match
those from the other ImageNet datasets provided by Tensorflow datasets.
See the original [label list](https://github.com/PatrykChrabaszcz/Imagenet32_Scripts/blob/master/map_clsloc.txt),
and the [labels used by this dataset](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/image_classification/imagenet_resized_labels.txt).
Additionally, the original authors 1 index there labels which we convert to
0 indexed by subtracting one.

@Conchylicultor
Copy link
Member

TFDS provide the datasets as close as the original datasets authors. As pointed out above, TFDS splits match the splits as defined by the Oxford author. So I'm making this bug as working as intended.

Note: Our documentation already provide the number of examples: https://www.tensorflow.org/datasets/catalog/oxford_flowers102

Or programatically:

info = tfds.builder('oxford_flowers102)
info.split['test'].num_examples

Or

test, train, validation = tfds.load('oxford_flowers102', split=['train', 'test', 'validation'])
print(len(train))

@TeaPearce
Copy link
Author

Thank all, sounds sensible. I don't know the history of when/how/why the dataset splits evolved, but wanted to document it somewhere.

@vijayphoenix
Copy link
Contributor

You can find the splitting and slicing doc here
https://www.tensorflow.org/datasets/splits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working contributions welcome
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants