[bugfix] Make CUB200 labels 0-indexed. #6702

kdexd · 2022-10-04T18:25:25Z

Bug description:

CUB200 dataset in torchvision.prototype.datasets module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). A related issue from the past was with Flowers102 (torchvision.datasets module, #5766).

This PR simply shifts the labels and makes them zero-indexed for consistency with other datasets.

CUB200 dataset in `torchvision.prototype.datasets` module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). Similar issue occurred with Flowers102 (`torchvision.datasets` module, #5766).

pmeier

Thanks @kdexd!

kdexd · 2022-10-04T21:50:40Z

The pleasure is mine! On a related note, my ongoing project requires evaluating large vision models on a suite of ~25 image classification datasets, specifically a big subset of datasets used to evaluate CLIP models (see Figure below).

Some datasets are not yet implemented in torchvision.prototype.datasets, but are present in torchvision.datasets. In my development code I simply added a wrapper class to have a consistent API with new datasets (IterDataPipe), like so:

class TorchvisionWrapIterDataPipe(dpi.IterDataPipe):
    def __init__(self, name: str, root: str | Path, split: str, **kwargs):
        # Get the dataset from `torchvision.datasets` module.
        DatasetClass = getattr(torchvision.datasets, name)
        self._inner = DatasetClass(str(root), split, download=True, **kwargs)

        # Wrap the dataset as a MapDataPipe, then convert to iterable.
        _dp = MapToIterConverter(SequenceWrapper(self._inner))
        self._dp = hint_sharding(hint_shuffling(_dp))

    def __len__(self):
        return len(self._inner)

    def __iter__(self):
        for image, label in self._dp:
            yield {"image": image, "label": label}


FGVCAircraft = partial(TorchvisionWrapIterDataPipe, "FGVCAircraft")
Flowers102 = partial(TorchvisionWrapIterDataPipe, "Flowers102")
STL10 = partial(TorchvisionWrapIterDataPipe, "STL10", folds=0)
RenderedSST2 = partial(TorchvisionWrapIterDataPipe, "RenderedSST2")

Some other datasets like RESISC45 and HatefulMemes are not present in either modules at all. I believe that these datasets will eventually be commonly used to evaluate CLIP-like models and large SSL models. Torchvision users would find these useful if they are ready for evaluation out-of-the-box. If there is sufficient interest, would you accept pull requests for these datasets? In that case I will actually re-implement a consistent API for these datasets rather than a quick wrapper like shown above.

pmeier · 2022-10-05T13:19:39Z

If there is sufficient interest, would you accept pull requests for these datasets? In that case I will actually re-implement a consistent API for these datasets rather than a quick wrapper like shown above.

Thanks a lot for the interest in contributing. Could you open an issue so this won't get lost? As of now, our new API is not finalized and thus it makes little sense to port more datasets now. However, we are working hard to get this done and would notify you when we can start adding new datasets.

Summary: CUB200 dataset in `torchvision.prototype.datasets` module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). Similar issue occurred with Flowers102 (`torchvision.datasets` module, #5766). Reviewed By: datumbox Differential Revision: D40138731 fbshipit-source-id: ce42adf4a3ae8e25110db06f2421b24c5169cfc4

Make CUB200 labels 0-indexed.

a319677

CUB200 dataset in `torchvision.prototype.datasets` module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). Similar issue occurred with Flowers102 (`torchvision.datasets` module, #5766).

facebook-github-bot added the cla signed label Oct 4, 2022

kdexd changed the title ~~Make CUB200 labels 0-indexed.~~ [bugfix] Make CUB200 labels 0-indexed. Oct 4, 2022

datumbox requested a review from pmeier October 4, 2022 19:05

pmeier approved these changes Oct 4, 2022

View reviewed changes

datumbox added bug prototype module: datasets labels Oct 4, 2022

datumbox merged commit 71885b0 into pytorch:main Oct 4, 2022

kdexd deleted the patch-1 branch October 4, 2022 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Make CUB200 labels 0-indexed. #6702

[bugfix] Make CUB200 labels 0-indexed. #6702

kdexd commented Oct 4, 2022

pmeier left a comment

kdexd commented Oct 4, 2022

pmeier commented Oct 5, 2022

[bugfix] Make CUB200 labels 0-indexed. #6702

[bugfix] Make CUB200 labels 0-indexed. #6702

Conversation

kdexd commented Oct 4, 2022

Bug description:

pmeier left a comment

Choose a reason for hiding this comment

kdexd commented Oct 4, 2022

pmeier commented Oct 5, 2022