Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Make CUB200 labels 0-indexed. #6702

Merged
merged 1 commit into from
Oct 4, 2022
Merged

[bugfix] Make CUB200 labels 0-indexed. #6702

merged 1 commit into from
Oct 4, 2022

Conversation

kdexd
Copy link
Contributor

@kdexd kdexd commented Oct 4, 2022

Bug description:

CUB200 dataset in torchvision.prototype.datasets module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). A related issue from the past was with Flowers102 (torchvision.datasets module, #5766).

This PR simply shifts the labels and makes them zero-indexed for consistency with other datasets.

CUB200 dataset in `torchvision.prototype.datasets` module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). Similar issue occurred with Flowers102 (`torchvision.datasets` module, #5766).
@kdexd kdexd changed the title Make CUB200 labels 0-indexed. [bugfix] Make CUB200 labels 0-indexed. Oct 4, 2022
@datumbox datumbox requested a review from pmeier October 4, 2022 19:05
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kdexd!

@kdexd
Copy link
Contributor Author

kdexd commented Oct 4, 2022

The pleasure is mine! On a related note, my ongoing project requires evaluating large vision models on a suite of ~25 image classification datasets, specifically a big subset of datasets used to evaluate CLIP models (see Figure below).

image

Some datasets are not yet implemented in torchvision.prototype.datasets, but are present in torchvision.datasets. In my development code I simply added a wrapper class to have a consistent API with new datasets (IterDataPipe), like so:

class TorchvisionWrapIterDataPipe(dpi.IterDataPipe):
    def __init__(self, name: str, root: str | Path, split: str, **kwargs):
        # Get the dataset from `torchvision.datasets` module.
        DatasetClass = getattr(torchvision.datasets, name)
        self._inner = DatasetClass(str(root), split, download=True, **kwargs)

        # Wrap the dataset as a MapDataPipe, then convert to iterable.
        _dp = MapToIterConverter(SequenceWrapper(self._inner))
        self._dp = hint_sharding(hint_shuffling(_dp))

    def __len__(self):
        return len(self._inner)

    def __iter__(self):
        for image, label in self._dp:
            yield {"image": image, "label": label}


FGVCAircraft = partial(TorchvisionWrapIterDataPipe, "FGVCAircraft")
Flowers102 = partial(TorchvisionWrapIterDataPipe, "Flowers102")
STL10 = partial(TorchvisionWrapIterDataPipe, "STL10", folds=0)
RenderedSST2 = partial(TorchvisionWrapIterDataPipe, "RenderedSST2")

Some other datasets like RESISC45 and HatefulMemes are not present in either modules at all. I believe that these datasets will eventually be commonly used to evaluate CLIP-like models and large SSL models. Torchvision users would find these useful if they are ready for evaluation out-of-the-box. If there is sufficient interest, would you accept pull requests for these datasets? In that case I will actually re-implement a consistent API for these datasets rather than a quick wrapper like shown above.

@datumbox datumbox merged commit 71885b0 into pytorch:main Oct 4, 2022
@kdexd kdexd deleted the patch-1 branch October 4, 2022 22:00
@pmeier
Copy link
Collaborator

pmeier commented Oct 5, 2022

If there is sufficient interest, would you accept pull requests for these datasets? In that case I will actually re-implement a consistent API for these datasets rather than a quick wrapper like shown above.

Thanks a lot for the interest in contributing. Could you open an issue so this won't get lost? As of now, our new API is not finalized and thus it makes little sense to port more datasets now. However, we are working hard to get this done and would notify you when we can start adding new datasets.

facebook-github-bot pushed a commit that referenced this pull request Oct 7, 2022
Summary: CUB200 dataset in `torchvision.prototype.datasets` module formed labels using file paths. This resulted in labels being 1-indexed (1-200) instead of 0-indexed (0-199). Similar issue occurred with Flowers102 (`torchvision.datasets` module, #5766).

Reviewed By: datumbox

Differential Revision: D40138731

fbshipit-source-id: ce42adf4a3ae8e25110db06f2421b24c5169cfc4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants