-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[DataPipe] Adding BatcherMapDataPipe #68197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit d825c1c (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]
cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]
cc @VitalyFedyunin ejguan @NivekT [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and you may need to change pyi for Dataset.
if self.length is not None: | ||
return self.length | ||
if isinstance(self.datapipe, Sized) and self.unbatch_level == 0: | ||
if self.drop_last: | ||
self.length = len(self.datapipe) // self.batch_size | ||
else: | ||
self.length = (len(self.datapipe) + self.batch_size - 1) // self.batch_size | ||
return self.length | ||
raise TypeError("{} instance doesn't have valid length".format(type(self).__name__)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should assume __len__
is always implemented for MapDataPipe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, though there are cases where it can be hard to compute (especially when one element in the input DataPipe can expands into many).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unbatch_level is super annoying here. Otherwise, we don't need to check if self.datapipe is Sized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you still need to check for it before you call len(self.datapipe)
right?
In either case, maybe we should get rid of unbatch_level
? I feel that users can call .unbatch
prior to this if they wanted that operation. I decided to include it only because IterDataPipe
's version has that argument and I hesitate to make the two different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assume prior datapipe has __len__
, then we can directly call len(self.datapipe
.
In either case, maybe we should get rid of unbatch_level? I feel that users can call .unbatch prior to this if they wanted that operation. I decided to include it only because IterDataPipe's version has that argument and I hesitate to make the two different.
Yeah. I feel like unbatch_level
making me confused even for IterDataPipe. At least based on my experience with Vision and Text, no one really uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is the case, I think we should remove it for simplicity. I will make a PR and let the domains users have a quick look. I agree that it is unlikely that people are using it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sound great. Thanks
Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Implements part of #57031 cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32440963](https://our.internmc.facebook.com/intern/diff/D32440963) [ghstack-poisoned]
@NivekT has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
… removing 'unbatch_level' argument and functionality" Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594) [ghstack-poisoned]
…ch_level' argument and functionality" Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc @VitalyFedyunin ejguan @NivekT Differential Revision: [D32532594](https://our.internmc.facebook.com/intern/diff/D32532594) [ghstack-poisoned]
…rgument and functionality (#68594) Summary: Pull Request resolved: #68594 Based on my conversation with ejguan [here](#68197 (review)), we both believe that having the `unbatch_level` argument and functionality is making this DataPipe unnecessarily complicated, because users can call `.unbatch` before `.batch` if they would like to do so. That will likely be cleaner as well. I also checked other libraries (for example, [TensorFlow](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#unbatch)), and I do not see them provide the ability the `unbatch` within the `batch` function either. This PR simplifies the DataPipe by removing the argument. cc VitalyFedyunin ejguan NivekT Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D32532594 Pulled By: NivekT fbshipit-source-id: 7276ce76ba2a3f207c9dfa58803a48e320adefed
Stack from ghstack:
Implements part of #57031
cc @VitalyFedyunin @ejguan @NivekT
Differential Revision: D32440963