Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DataPipe] Add RoundRobinDemux #903

Closed
wants to merge 3 commits into from
Closed

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Nov 17, 2022

The only reason I added this DataPipe without directly using enumereate().demux().drop_index() is this RoundRobinDemux should provide a valid length. So, this PR needs pytorch/pytorch#89216 landed

And, this DataPipe will be used for Proto2RS.

Side change: Move Unzip to combining.py as well.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 17, 2022
@ejguan
Copy link
Contributor Author

ejguan commented Nov 17, 2022

Failed tests should be fixed the the PR in PyTorch Core landed.

@ejguan ejguan requested a review from NivekT November 17, 2022 21:58
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor Author

ejguan commented Nov 18, 2022

Failed tests should be fixed by #905

@ejguan ejguan requested a review from wenleix November 18, 2022 16:20
Copy link
Contributor

@wenleix wenleix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % some noob/minor comments



@functional_datapipe("round_robin_demux")
class RoundRobinDemultiplexerIterDataPipe(IterDataPipe):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historical reason. We put lots of DataPipe in core but we plan to move them to torchdata eventually. To reduce the amount of work, I think it's better to put round_robin_demux to torchdata directly.

torchdata/datapipes/iter/util/combining.py Outdated Show resolved Hide resolved

datapipe = datapipe.enumerate()
container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size)
return [_ChildDataPipe(container, i).map(_drop_index) for i in range(num_instances)]
Copy link
Contributor

@wenleix wenleix Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a Python noob, TIL I learnt X(...) in Python may not return a instance of X

https://www.geeksforgeeks.org/__new__-in-python/

... because other class constructors can be called by new method or it can simply return other objects as an instance of this class.

Will _ChildDataPipe.__init__ be called for each instance in the list? ~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.

torchdata/datapipes/iter/util/combining.py Outdated Show resolved Hide resolved
return datapipe

datapipe = datapipe.enumerate()
container = _RoundRobinDemultiplexerIterDataPipe(datapipe, num_instances, buffer_size=buffer_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: can we just do

_DemultiplexerIterDataPipe(
    datapipe, 
    num_instances, 
    partial(num_instances, _round_robin_fn), 
    False, 
    buffer_size
)

where _round_robin_fn is:

def _round_robin_fn(num_instances: int, idx_data) -> int:
        idx, _ = idx_data
        return idx % num_instances

Is that because you want to override _DemultiplexerIterDataPipe.get_length_by_instance ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. demux doesn't provide a valid length but this DataPipe should have a deterministic result. Let me know if you have better way to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noob question : what's "deterministic result" here mean?

Especially I was somehow under the impression that in general try to avoid query len(datapipe) for large datapipe , since it probably needs to read all data and it's expensive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the prior DataPipe provides a length, we should be able to get the length ahead of time.
For demux which uses a classify_fn, there is no way for us to know the length and the length can vary across epoches.

in general try to avoid query len(datapipe) for large datapipe , since it probably needs to read all data and it's expensive?

It's true. But, we do provide DataPipe that can manually assign length to the DataPipe

@functional_datapipe("set_length")

@wenleix
Copy link
Contributor

wenleix commented Nov 19, 2022

Unrelated question: Why demux and unzip data pipes are in combining.py ? (shouldn't they be in splitting.py)?

@ejguan
Copy link
Contributor Author

ejguan commented Nov 21, 2022

Why demux and unzip data pipes are in combining.py ? (shouldn't they be in splitting.py)?

Aha. We categorized DataPipe into a few sectors. See: https://pytorch.org/data/beta/index.html We put them together as they are the counter party of mux, zip, etc., and we don't have such a mount of DataPipes.

When #293 is done, we probably need to make them into combining.py and splitting.py.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan added the topic: new feature topic category label Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants