-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataPipe] Add RoundRobinDemux #903
Conversation
ee71bfc
to
13a0798
Compare
Failed tests should be fixed the the PR in PyTorch Core landed. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Failed tests should be fixed by #905 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM % some noob/minor comments
|
||
|
||
@functional_datapipe("round_robin_demux") | ||
class RoundRobinDemultiplexerIterDataPipe(IterDataPipe): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: why demux
is PyTorch core (https://github.com/pytorch/pytorch/blob/7a2930b357a4e62bb0bab53bb0d23c607b6ede38/torch/utils/data/datapipes/iter/combining.py#L305-L306) but round_robin_demux
is in TorchData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historical reason. We put lots of DataPipe in core but we plan to move them to torchdata eventually. To reduce the amount of work, I think it's better to put round_robin_demux
to torchdata directly.
|
||
datapipe = datapipe.enumerate() | ||
container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size) | ||
return [_ChildDataPipe(container, i).map(_drop_index) for i in range(num_instances)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a Python noob, TIL I learnt X(...)
in Python may not return a instance of X
https://www.geeksforgeeks.org/__new__-in-python/
... because other class constructors can be called by new method or it can simply return other objects as an instance of this class.
Will _ChildDataPipe.__init__
be called for each instance in the list? ~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah.
return datapipe | ||
|
||
datapipe = datapipe.enumerate() | ||
container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious: can we just do
_DemultiplexerIterDataPipe(
datapipe,
num_instances,
partial(num_instances, _round_robin_fn),
False,
buffer_size
)
where _round_robin_fn
is:
def _round_robin_fn(num_instances: int, idx_data) -> int:
idx, _ = idx_data
return idx % num_instances
Is that because you want to override _DemultiplexerIterDataPipe.get_length_by_instance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. demux
doesn't provide a valid length but this DataPipe
should have a deterministic result. Let me know if you have better way to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noob question : what's "deterministic result" here mean?
Especially I was somehow under the impression that in general try to avoid query len(datapipe)
for large datapipe , since it probably needs to read all data and it's expensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the prior DataPipe provides a length, we should be able to get the length ahead of time.
For demux
which uses a classify_fn
, there is no way for us to know the length and the length can vary across epoches.
in general try to avoid query
len(datapipe)
for large datapipe , since it probably needs to read all data and it's expensive?
It's true. But, we do provide DataPipe
that can manually assign length to the DataPipe
@functional_datapipe("set_length") |
Unrelated question: Why demux and unzip data pipes are in |
Aha. We categorized DataPipe into a few sectors. See: https://pytorch.org/data/beta/index.html We put them together as they are the counter party of When #293 is done, we probably need to make them into |
1b169ed
to
9c8fa69
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
9c8fa69
to
2ea1de0
Compare
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
The only reason I added this DataPipe without directly using
enumereate().demux().drop_index()
is thisRoundRobinDemux
should provide a valid length. So, this PR needs pytorch/pytorch#89216 landedAnd, this DataPipe will be used for
Proto2RS
.Side change: Move
Unzip
tocombining.py
as well.