add prototype for CIFAR datasets #4511

pmeier · 2021-09-30T11:42:01Z

This overlaps with #4510 in some parts, so this will have to be rebased after the other PR is merged.

Conflicts: torchvision/prototype/datasets/_builtin/__init__.py torchvision/prototype/datasets/utils/_internal.py

fmassa

Thanks for the PR!

Making a high-level comment here, I think we should benchmark our datasets while we add them here to ensure we don't introduce performance regressions (which is probably the case in CIFAR).
Can you add this to the TODO list of what to add for datasets? Probably running it for a few epochs to take into account tar extraction etc as well.

fmassa · 2021-10-05T13:33:20Z

torchvision/prototype/datasets/_builtin/cifar.py

+        (_, category_idx), (_, image_array_flat) = data
+
+        image_array = image_array_flat.reshape((3, 32, 32)).transpose(1, 2, 0)
+        image_buffer = image_buffer_from_array(image_array)


While this keeps the APIs consistent, it is worth noting that this will bring significant overheads for the dataset, making it much slower than the current CIFAR datasets in torchvision, as we will be encoding + decoding for every image, while the CIFAR datasets already store the images in decoded format.

I want us to keep this in mind for the future, maybe defaulting to not re-encoding + decoding for speed.

I've changed the default behavior so that we don't need to encode and decode in every step. This only applies to datasets where this possible (IIRC only CIFAR and MNIST). "Normal" image datasets are untouched by this.

fmassa · 2021-10-05T13:37:37Z

torchvision/prototype/datasets/utils/_internal.py

+    def __iter__(self) -> Iterator[D]:
+        for sequence in self.datapipe:
+            yield from iter(sequence)


Is this the equivalent of flattening the sequence? If yes, do we have something already available in datapipes?

Should be doable with https://github.com/pytorch/pytorch/blob/2a5116e1599be7d7fa1be9572f47c316716b74c3/torch/utils/data/datapipes/iter/grouping.py#L102. Otherwise we can always send these datapipes upstream.

Let's avoid writing new datapipes if they are redundant with upstream blocks already

pmeier · 2021-10-05T16:14:35Z

@fmassa

Can you add this to the TODO list of what to add for datasets? Probably running it for a few epochs to take into account tar extraction etc as well.

Good point. I'll design a benchmark utility, so we can do this for every dataset.

pmeier · 2021-10-05T16:16:09Z

torchvision/prototype/datasets/_builtin/cifar.py

+        archive_dp = TarArchiveReader(archive_dp)
+        archive_dp: IterDataPipe = Filter(archive_dp, functools.partial(self._is_data_file, config=config))
+        archive_dp: IterDataPipe = Mapper(archive_dp, self._unpickle)


@ejguan any idea how to appease mypy here, without slapping : IterDataPipe everywhere? Otherwise I'm inclined to blanket ignore var-annotated here.

The easiest way should be adding annotation to the variable at the beginning:

Suggested change

archive_dp = TarArchiveReader(archive_dp)

archive_dp: IterDataPipe = Filter(archive_dp, functools.partial(self._is_data_file, config=config))

archive_dp: IterDataPipe = Mapper(archive_dp, self._unpickle)

archive_dp: IterDataPipe

archive_dp = resource_dps[0]

archive_dp = TarArchiveReader(archive_dp)

archive_dp = Filter(archive_dp, functools.partial(self._is_data_file, config=config))

archive_dp = Mapper(archive_dp, self._unpickle)

Summary: * add prototype for CIFAR datasets Reviewed By: NicolasHug Differential Revision: D31505571 fbshipit-source-id: 33a656a3d3c176752491f400aeb0e9327856493e

* add prototype for CIFAR datasets Conflicts: torchvision/prototype/datasets/_builtin/__init__.py torchvision/prototype/datasets/utils/_internal.py * fix mypy * cleanup * more cleanup * revert unrelated changes * fix code format * avoid decoding twice by default * revert unrelated change * cleanup

add prototype for CIFAR datasets

105a0fb

Conflicts: torchvision/prototype/datasets/_builtin/__init__.py torchvision/prototype/datasets/utils/_internal.py

pmeier added module: datasets prototype labels Sep 30, 2021

pmeier requested a review from fmassa September 30, 2021 11:42

fix mypy

07d28d8

facebook-github-bot added the cla signed label Sep 30, 2021

pmeier added 6 commits September 30, 2021 14:35

Merge branch 'main' into datasets/cifar

cbedace

cleanup

3341792

more cleanup

58b37b1

revert unrelated changes

341130c

Merge branch 'main' into datasets/cifar

8631850

fix code format

df455c5

pytorch-probot bot added the ciflow/default label Oct 5, 2021

fmassa approved these changes Oct 5, 2021

View reviewed changes

pmeier added 4 commits October 5, 2021 17:09

avoid decoding twice by default

fab6cd6

revert unrelated change

bb91402

Merge branch 'main' into datasets/cifar

6e0119c

cleanup

50fc909

pmeier commented Oct 5, 2021

View reviewed changes

pmeier merged commit f630e67 into pytorch:main Oct 5, 2021

pmeier deleted the datasets/cifar branch October 5, 2021 16:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add prototype for CIFAR datasets #4511

add prototype for CIFAR datasets #4511

Uh oh!

pmeier commented Sep 30, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

fmassa left a comment

Uh oh!

fmassa Oct 5, 2021

Uh oh!

pmeier Oct 5, 2021

Uh oh!

fmassa Oct 5, 2021

Uh oh!

pmeier Oct 5, 2021

Uh oh!

fmassa Oct 6, 2021

Uh oh!

pmeier commented Oct 5, 2021

Uh oh!

pmeier Oct 5, 2021

Uh oh!

ejguan Oct 6, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

add prototype for CIFAR datasets #4511

add prototype for CIFAR datasets #4511

Uh oh!

Conversation

pmeier commented Sep 30, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

fmassa Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

fmassa Oct 6, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier commented Oct 5, 2021

Uh oh!

pmeier Oct 5, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Oct 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pmeier commented Sep 30, 2021 •

edited by pytorch-probot bot

Loading

ejguan Oct 6, 2021 •

edited

Loading