[DataPipe] Adding BatcherMapDataPipe #68197

NivekT · 2021-11-11T20:16:29Z

Stack from ghstack:

-> [DataPipe] Adding BatcherMapDataPipe #68197

Implements part of #57031

cc @VitalyFedyunin @ejguan @NivekT

Differential Revision: D32440963

[ghstack-poisoned]

pytorch-probot · 2021-11-11T20:16:32Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/d825c1c1220bdb8a80abf6fad5cae5990a33f5e6/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-cuda11.5-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit	`ciflow/all`, `ciflow/android`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
ios-12-5-1-arm64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-custom-ops	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-arm64-metal	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-coreml	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
ios-12-5-1-x86-64-full-jit	`ciflow/all`, `ciflow/ios`, `ciflow/macos`	🚫 skipped
libtorch-linux-bionic-cuda11.5-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
macos-10-15-py3-arm64	`ciflow/all`, `ciflow/macos`	🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64	`ciflow/all`, `ciflow/macos`	🚫 skipped
macos-11-py3-x86-64	`ciflow/all`, `ciflow/macos`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7-debug	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

ghstack-source-id: ee40523 Pull Request resolved: #68197

facebook-github-bot · 2021-11-11T20:16:38Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/68197
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
↩️ [fb-only] Re-run with SSH instructions
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit d825c1c (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

linux-bionic-cuda11.5-py3.6-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-12-01T21:15:58.5231832Z Build left local git repository checkout dirty

2021-12-01T21:15:57.4691875Z real	43m7.065s
2021-12-01T21:15:57.4692419Z user	43m41.556s
2021-12-01T21:15:57.4692846Z sys	4m15.200s
2021-12-01T21:15:57.4693278Z + assert_git_not_dirty
2021-12-01T21:15:57.4694714Z + [[ linux-bionic-cuda11.5-py3.6-gcc7-default != *rocm* ]]
2021-12-01T21:15:57.4696104Z + [[ linux-bionic-cuda11.5-py3.6-gcc7-default != *xla* ]]
2021-12-01T21:15:57.4697115Z ++ git status --porcelain
2021-12-01T21:15:58.5229097Z + git_status='?? third_party/flatbuffers/'
2021-12-01T21:15:58.5230092Z + [[ -n ?? third_party/flatbuffers/ ]]
2021-12-01T21:15:58.5231126Z + echo 'Build left local git repository checkout dirty'
2021-12-01T21:15:58.5231832Z Build left local git repository checkout dirty
2021-12-01T21:15:58.5232546Z + echo 'git status --porcelain:'
2021-12-01T21:15:58.5233184Z git status --porcelain:
2021-12-01T21:15:58.5234094Z + echo '?? third_party/flatbuffers/'
2021-12-01T21:15:58.5234657Z ?? third_party/flatbuffers/
2021-12-01T21:15:58.5235372Z + exit 1
2021-12-01T21:15:58.5235737Z + cleanup
2021-12-01T21:15:58.5236570Z + retcode=1
2021-12-01T21:15:58.5236948Z + set +x
2021-12-01T21:15:58.5277209Z ##[error]Process completed with exit code 1.
2021-12-01T21:15:58.5342421Z ##[group]Run # Ensure the working directory gets chowned back to the current user

ci.pytorch.org: 1 failed

Failed: pr/pytorch-linux-bionic-rocm4.3.1-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@VitalyFedyunin

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

ghstack-source-id: 269b5fb Pull Request resolved: #68197

@VitalyFedyunin

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

@VitalyFedyunin

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

ghstack-source-id: 550487d Pull Request resolved: #68197

NivekT · 2021-11-15T21:16:25Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin

cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

ghstack-source-id: 27deb6c Pull Request resolved: #68197

NivekT · 2021-11-16T15:49:36Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ejguan

LGTM, and you may need to change pyi for Dataset.

ejguan · 2021-11-17T21:18:30Z

torch/utils/data/datapipes/map/grouping.py

+        if self.length is not None:
+            return self.length
+        if isinstance(self.datapipe, Sized) and self.unbatch_level == 0:
+            if self.drop_last:
+                self.length = len(self.datapipe) // self.batch_size
+            else:
+                self.length = (len(self.datapipe) + self.batch_size - 1) // self.batch_size
+            return self.length
+        raise TypeError("{} instance doesn't have valid length".format(type(self).__name__))


IMO, we should assume __len__ is always implemented for MapDataPipe

I agree, though there are cases where it can be hard to compute (especially when one element in the input DataPipe can expands into many).

The unbatch_level is super annoying here. Otherwise, we don't need to check if self.datapipe is Sized.

I think you still need to check for it before you call len(self.datapipe) right?

In either case, maybe we should get rid of unbatch_level? I feel that users can call .unbatch prior to this if they wanted that operation. I decided to include it only because IterDataPipe's version has that argument and I hesitate to make the two different.

If we assume prior datapipe has __len__, then we can directly call len(self.datapipe.

In either case, maybe we should get rid of unbatch_level? I feel that users can call .unbatch prior to this if they wanted that operation. I decided to include it only because IterDataPipe's version has that argument and I hesitate to make the two different.

Yeah. I feel like unbatch_level making me confused even for IterDataPipe. At least based on my experience with Vision and Text, no one really uses it.

If that is the case, I think we should remove it for simplicity. I will make a PR and let the domains users have a quick look. I agree that it is unlikely that people are using it.

Sound great. Thanks

@VitalyFedyunin

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

ghstack-source-id: 840a1ff Pull Request resolved: #68197

NivekT · 2021-11-17T22:16:49Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

ghstack-source-id: bf52bdf Pull Request resolved: #68197

NivekT · 2021-11-18T16:25:16Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

NivekT · 2021-12-01T19:54:20Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

ghstack-source-id: 7f6c748 Pull Request resolved: #68197

NivekT · 2021-12-01T19:57:06Z

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@VitalyFedyunin

… removing 'unbatch_level' argument and functionality" Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594) [ghstack-poisoned]

@VitalyFedyunin

…ch_level' argument and functionality" Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594) [ghstack-poisoned]

…rgument and functionality (#68594) Summary: Pull Request resolved: #68594 Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32532594 Pulled By: NivekT fbshipit-source-id: 7276ce76ba2a3f207c9dfa58803a48e320adefed

[DataPipe] Adding BatcherMapDataPipe

17c7e5b

[ghstack-poisoned]

NivekT mentioned this pull request Nov 11, 2021

[DataPipe] Adding UnBatcherMapDataPipe #68196

Closed

pytorch-probot bot added the ciflow/default label Nov 11, 2021

facebook-github-bot added the cla signed label Nov 11, 2021

NivekT added a commit that referenced this pull request Nov 11, 2021

[DataPipe] Adding BatcherMapDataPipe

7b6c1d4

ghstack-source-id: ee40523 Pull Request resolved: #68197

NivekT marked this pull request as draft November 11, 2021 20:16

NivekT added the module: data torch.utils.data label Nov 11, 2021

Update on "[DataPipe] Adding BatcherMapDataPipe"

721fb3d

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

NivekT added a commit that referenced this pull request Nov 13, 2021

[DataPipe] Adding BatcherMapDataPipe

e5b7c0b

ghstack-source-id: 269b5fb Pull Request resolved: #68197

Update on "[DataPipe] Adding BatcherMapDataPipe"

da76b03

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

Update on "[DataPipe] Adding BatcherMapDataPipe"

b58abe1

cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]

NivekT added a commit that referenced this pull request Nov 15, 2021

[DataPipe] Adding BatcherMapDataPipe

98f1548

ghstack-source-id: 550487d Pull Request resolved: #68197

NivekT requested review from VitalyFedyunin and ejguan November 15, 2021 21:15

NivekT marked this pull request as ready for review November 15, 2021 21:16

Update on "[DataPipe] Adding BatcherMapDataPipe"

5ba88e4

cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Nov 16, 2021

[DataPipe] Adding BatcherMapDataPipe

1b01701

ghstack-source-id: 27deb6c Pull Request resolved: #68197

ejguan approved these changes Nov 17, 2021

View reviewed changes

Update on "[DataPipe] Adding BatcherMapDataPipe"

7cf65a6

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Nov 17, 2021

[DataPipe] Adding BatcherMapDataPipe

ebf6a98

ghstack-source-id: 840a1ff Pull Request resolved: #68197

NivekT mentioned this pull request Nov 18, 2021

[DataPipe] Simplify BatcherIterDataPipe by removing 'unbatch_level' argument and functionality #68594

Closed

Update on "[DataPipe] Adding BatcherMapDataPipe"

7f7a668

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Nov 18, 2021

[DataPipe] Adding BatcherMapDataPipe

f2ea806

ghstack-source-id: bf52bdf Pull Request resolved: #68197

Update on "[DataPipe] Adding BatcherMapDataPipe"

f6253e0

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

Update on "[DataPipe] Adding BatcherMapDataPipe"

d825c1c

Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]

NivekT added a commit that referenced this pull request Dec 1, 2021

[DataPipe] Adding BatcherMapDataPipe

2d9cccb

ghstack-source-id: 7f6c748 Pull Request resolved: #68197

facebook-github-bot closed this in 0465f64 Dec 2, 2021

facebook-github-bot deleted the gh/nivekt/20/head branch December 6, 2021 15:17

NivekT added the release notes: dataloader release notes category label Feb 1, 2022

[DataPipe] Adding BatcherMapDataPipe #68197

[DataPipe] Adding BatcherMapDataPipe #68197

Uh oh!

Conversation

NivekT commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-probot bot commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚛️ CI Flow

Uh oh!

facebook-github-bot commented Nov 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

linux-bionic-cuda11.5-py3.6-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (1/1)

ci.pytorch.org: 1 failed

Uh oh!

NivekT commented Nov 15, 2021

Uh oh!

NivekT commented Nov 16, 2021

Uh oh!

ejguan left a comment

Choose a reason for hiding this comment

Uh oh!

ejguan Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

NivekT Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

NivekT Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

NivekT Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Nov 17, 2021

Choose a reason for hiding this comment

Uh oh!

NivekT commented Nov 17, 2021

Uh oh!

NivekT commented Nov 18, 2021

Uh oh!

NivekT commented Dec 1, 2021

Uh oh!

NivekT commented Dec 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NivekT commented Nov 11, 2021 •

edited

Loading

pytorch-probot bot commented Nov 11, 2021 •

edited

Loading

facebook-github-bot commented Nov 11, 2021 •

edited

Loading