Skip to content

Conversation

NivekT
Copy link
Contributor

@NivekT NivekT commented Nov 11, 2021

Stack from ghstack:

Implements part of #57031

cc @VitalyFedyunin @ejguan @NivekT

Differential Revision: D32440963

@pytorch-probot
Copy link

pytorch-probot bot commented Nov 11, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/d825c1c1220bdb8a80abf6fad5cae5990a33f5e6/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-cuda11.5-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
pytorch-linux-xenial-py3-clang5-android-ndk-r19c-gradle-custom-build-single-full-jit ciflow/all, ciflow/android, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
ios-12-5-1-arm64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-custom-ops ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-arm64-metal ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64 ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-coreml ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
ios-12-5-1-x86-64-full-jit ciflow/all, ciflow/ios, ciflow/macos 🚫 skipped
libtorch-linux-bionic-cuda11.5-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
macos-10-15-py3-arm64 ciflow/all, ciflow/macos 🚫 skipped
macos-10-15-py3-lite-interpreter-x86-64 ciflow/all, ciflow/macos 🚫 skipped
macos-11-py3-x86-64 ciflow/all, ciflow/macos 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7-debug ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

NivekT added a commit that referenced this pull request Nov 11, 2021
ghstack-source-id: ee40523
Pull Request resolved: #68197
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Nov 11, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit d825c1c (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 1/2 non-scanned failure(s)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

See GitHub Actions build linux-bionic-cuda11.5-py3.6-gcc7 / test (default, 1, 2, linux.4xlarge.nvidia.gpu) (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

2021-12-01T21:15:58.5231832Z Build left local git repository checkout dirty
2021-12-01T21:15:57.4691875Z real	43m7.065s
2021-12-01T21:15:57.4692419Z user	43m41.556s
2021-12-01T21:15:57.4692846Z sys	4m15.200s
2021-12-01T21:15:57.4693278Z + assert_git_not_dirty
2021-12-01T21:15:57.4694714Z + [[ linux-bionic-cuda11.5-py3.6-gcc7-default != *rocm* ]]
2021-12-01T21:15:57.4696104Z + [[ linux-bionic-cuda11.5-py3.6-gcc7-default != *xla* ]]
2021-12-01T21:15:57.4697115Z ++ git status --porcelain
2021-12-01T21:15:58.5229097Z + git_status='?? third_party/flatbuffers/'
2021-12-01T21:15:58.5230092Z + [[ -n ?? third_party/flatbuffers/ ]]
2021-12-01T21:15:58.5231126Z + echo 'Build left local git repository checkout dirty'
2021-12-01T21:15:58.5231832Z Build left local git repository checkout dirty
2021-12-01T21:15:58.5232546Z + echo 'git status --porcelain:'
2021-12-01T21:15:58.5233184Z git status --porcelain:
2021-12-01T21:15:58.5234094Z + echo '?? third_party/flatbuffers/'
2021-12-01T21:15:58.5234657Z ?? third_party/flatbuffers/
2021-12-01T21:15:58.5235372Z + exit 1
2021-12-01T21:15:58.5235737Z + cleanup
2021-12-01T21:15:58.5236570Z + retcode=1
2021-12-01T21:15:58.5236948Z + set +x
2021-12-01T21:15:58.5277209Z ##[error]Process completed with exit code 1.
2021-12-01T21:15:58.5342421Z ##[group]Run # Ensure the working directory gets chowned back to the current user

ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@NivekT NivekT marked this pull request as draft November 11, 2021 20:16
@NivekT NivekT added the module: data torch.utils.data label Nov 11, 2021
NivekT added a commit that referenced this pull request Nov 13, 2021
ghstack-source-id: 269b5fb
Pull Request resolved: #68197
NivekT added a commit that referenced this pull request Nov 15, 2021
ghstack-source-id: 550487d
Pull Request resolved: #68197
@NivekT NivekT marked this pull request as ready for review November 15, 2021 21:16
@NivekT
Copy link
Contributor Author

NivekT commented Nov 15, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT added a commit that referenced this pull request Nov 16, 2021
ghstack-source-id: 27deb6c
Pull Request resolved: #68197
@NivekT
Copy link
Contributor Author

NivekT commented Nov 16, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and you may need to change pyi for Dataset.

Comment on lines 64 to 72
if self.length is not None:
return self.length
if isinstance(self.datapipe, Sized) and self.unbatch_level == 0:
if self.drop_last:
self.length = len(self.datapipe) // self.batch_size
else:
self.length = (len(self.datapipe) + self.batch_size - 1) // self.batch_size
return self.length
raise TypeError("{} instance doesn't have valid length".format(type(self).__name__))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should assume __len__ is always implemented for MapDataPipe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though there are cases where it can be hard to compute (especially when one element in the input DataPipe can expands into many).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unbatch_level is super annoying here. Otherwise, we don't need to check if self.datapipe is Sized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you still need to check for it before you call len(self.datapipe) right?

In either case, maybe we should get rid of unbatch_level? I feel that users can call .unbatch prior to this if they wanted that operation. I decided to include it only because IterDataPipe's version has that argument and I hesitate to make the two different.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume prior datapipe has __len__, then we can directly call len(self.datapipe.

In either case, maybe we should get rid of unbatch_level? I feel that users can call .unbatch prior to this if they wanted that operation. I decided to include it only because IterDataPipe's version has that argument and I hesitate to make the two different.

Yeah. I feel like unbatch_level making me confused even for IterDataPipe. At least based on my experience with Vision and Text, no one really uses it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is the case, I think we should remove it for simplicity. I will make a PR and let the domains users have a quick look. I agree that it is unlikely that people are using it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound great. Thanks

Implements part of #57031

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Nov 17, 2021
ghstack-source-id: 840a1ff
Pull Request resolved: #68197
@NivekT
Copy link
Contributor Author

NivekT commented Nov 17, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Implements part of #57031

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Nov 18, 2021
ghstack-source-id: bf52bdf
Pull Request resolved: #68197
@NivekT
Copy link
Contributor Author

NivekT commented Nov 18, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Implements part of #57031

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963)

[ghstack-poisoned]
@NivekT
Copy link
Contributor Author

NivekT commented Dec 1, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Implements part of #57031

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Dec 1, 2021
ghstack-source-id: 7f6c748
Pull Request resolved: #68197
@NivekT
Copy link
Contributor Author

NivekT commented Dec 1, 2021

@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

NivekT added a commit that referenced this pull request Dec 1, 2021
… removing 'unbatch_level' argument and functionality"


Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well.

I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either.

This PR simplifies the DataPipe by removing the argument.

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594)

[ghstack-poisoned]
NivekT added a commit that referenced this pull request Dec 1, 2021
…ch_level' argument and functionality"


Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well.

I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either.

This PR simplifies the DataPipe by removing the argument.

cc @VitalyFedyunin ejguan @NivekT

Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594)

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this pull request Dec 2, 2021
…rgument and functionality (#68594)

Summary:
Pull Request resolved: #68594

Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well.

I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either.

This PR simplifies the DataPipe by removing the argument.

cc VitalyFedyunin ejguan NivekT

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D32532594

Pulled By: NivekT

fbshipit-source-id: 7276ce76ba2a3f207c9dfa58803a48e320adefed
@facebook-github-bot facebook-github-bot deleted the gh/nivekt/20/head branch December 6, 2021 15:17
@NivekT NivekT added the release notes: dataloader release notes category label Feb 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed module: data torch.utils.data release notes: dataloader release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants