Skip to content

Conversation

wenleix
Copy link
Contributor

@wenleix wenleix commented Feb 6, 2023

Summary:
Move ShardingFilterIterDataPipe into a dedicated file.

Also, propose to have a dedicated parent class (_ShardingIterDataPipe) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable Callable[[Iterable], Iterable]. But open to other discussions.

Open question: Should
ShardingRoundRobinDispatcherIterDataPipe also be considered as a _ShardingIterDataPipe? (e.g. this sharding is executed by replicating (the metadata), while ShardingRoundRobinDispatcherIterDataPipe hints too expensive to replicate so requires round robin data exchange/dispatch).

Differential Revision:
D43014692

X-link: pytorch/pytorch#94095

D43014692

@facebook-github-bot facebook-github-bot added CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported labels Feb 6, 2023
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

wenleix added a commit to wenleix/data that referenced this pull request Feb 6, 2023
…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

fbshipit-source-id: 223a9503fc5c3c78b2aebb30fe9665bb081f37ee
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

1 similar comment
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

wenleix added a commit to wenleix/data that referenced this pull request Feb 6, 2023
…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

fbshipit-source-id: 74713ac091b1dcb215687a9f572dc96d79d89c7e
wenleix added a commit to wenleix/data that referenced this pull request Feb 6, 2023
…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

fbshipit-source-id: 917937c78c6839a43b8595f649ada9fe829a01fd
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

wenleix added a commit to wenleix/data that referenced this pull request Feb 6, 2023
…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

fbshipit-source-id: a1d3bcb02309ca7ee4d991a31945d4550f4174b9
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

wenleix added a commit to wenleix/data that referenced this pull request Feb 6, 2023
…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

Reviewed By: ejguan

fbshipit-source-id: d7b450b89989f0a05c6e70fb8d7290893918a32a
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

…rch#987)

Summary:
Pull Request resolved: meta-pytorch#987

X-link: pytorch/pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

Reviewed By: seemethere

fbshipit-source-id: 79a2ee53d279a4dc8da84eb2af3e5c801ff3d1ea
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D43014692

wenleix added a commit to wenleix/pytorch that referenced this pull request Feb 7, 2023
Summary:
X-link: meta-pytorch/data#987

Pull Request resolved: pytorch#94095

Differential Revision:
D43014692

Move `ShardingFilterIterDataPipe` into a dedicated file.

Also, propose to have a dedicated parent class (`_ShardingIterDataPipe`) for sharding data pipe, as this seems more like a "system/engine-level" datapipe that gives strong hints to RS on how to execute, and needs first-class citizen treatment in RS (compared with other "user-level" datapipe that are mostly composable `Callable[[Iterable], Iterable]`.  But open to other discussions.

### Open question
Should
[ShardingRoundRobinDispatcherIterDataPipe](https://github.com/pytorch/data/blob/01fc76200354501b057bb439b43a1f05f609dd0a/torchdata/datapipes/iter/util/sharding.py#L16-L17) also be considered as a `_ShardingIterDataPipe`? (e.g. this sharding is executed by replicating (the metadata), while `ShardingRoundRobinDispatcherIterDataPipe` hints too expensive to replicate so requires round robin data exchange/dispatch).

D43014692

Test Plan:
sandcastle and CI

How to run unit tests in buck related to such changes? :)

Reviewed By: seemethere

fbshipit-source-id: fcd2a4e57b15fdf7411c959a38c377ac114f0ecb
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 914d3dd.

@wenleix wenleix deleted the export-D43014692 branch February 7, 2023 09:06
@pmeier
Copy link
Contributor

pmeier commented Feb 8, 2023

This is BC breaking. The original module (torch.utils.data.datapipes.iter.grouping) as well as the class (ShardingFilterIterDataPipe) were public. For example, torchvision uses it here.

For us it is not a large problem to change, but other libraries that rely on that might break in the upcoming release.

@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2023

@pmeier You are right. How about we reference those two classes back to grouping from sharding.py and add a deprecation warning? @wenleix

@pmeier
Copy link
Contributor

pmeier commented Feb 8, 2023

Just to give some unsolicited advice: I would push to make all modules below torch.utils.data.datapipes.iter or equivalent torchdata.datapipes.iter private while you are still in beta. You can expose all the functionality in the top namespaces and don't have to worry about BC for the structure in the long run. For an example, have a look at how we are doing it for our new functionality, e.g. https://github.com/pytorch/vision/tree/main/torchvision/prototype/transforms

Using such an architecture would have prevented this, because the datapipe would still be exposed under torch.utils.data.datapipes.iter and you are free to move it around as you like.

@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2023

@pmeier
For ShardingFilter, it is exposed at parent level as torch.utils.data.datapipes.iter.ShardingFilter.
https://github.com/pytorch/pytorch/blob/fe0e28ab87ddda38a82efe74dd3eaf05e926f117/torch/utils/data/datapipes/iter/__init__.py#L57

@ejguan
Copy link
Contributor

ejguan commented Feb 8, 2023

@pmeier
Copy link
Contributor

pmeier commented Feb 8, 2023

For ShardingFilter, it is exposed at parent level as torch.utils.data.datapipes.iter.ShardingFilter.

Huh, I guess we never looked for that given that the module it is in is public. But fair enough, I'm gonna change that on our side.

@wenleix
Copy link
Contributor Author

wenleix commented Feb 9, 2023

How about we reference those two classes back to grouping from sharding.py and add a deprecation warning? @wenleix

Sounds good. Let me import them back . Thanks for reporting this!

wenleix added a commit to wenleix/pytorch that referenced this pull request Feb 10, 2023
…eprecated (pytorch#94527)

Summary:
Pull Request resolved: pytorch#94527

pytorch#94095 moves this into `tud.datapipes.iter.sharding`. However, since previously this is a public API, this is a BC break change.

As discussed in meta-pytorch/data#987 (comment), we will have backward compatbile support but with deprecated warning.

Test Plan:
```
buck test mode/opt //caffe2/test:datapipe
```

Reviewed By: ejguan

Differential Revision: D43161015

fbshipit-source-id: 82049979981b83aefd247843fbc2c004a7e7a90a
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Feb 10, 2023
…eprecated (#94527)

Summary:
#94095 moves this into `tud.datapipes.iter.sharding`. However, since previously this is a public API, this is a BC break change.

As discussed in meta-pytorch/data#987 (comment), we will have backward compatbile support but with deprecated warning.

Differential Revision: D43161015

Pull Request resolved: #94527
Approved by: https://github.com/ejguan, https://github.com/NivekT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants