Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DistribtuedReadingService #727

Closed
wants to merge 2 commits into from

Conversation

ejguan
Copy link
Contributor

@ejguan ejguan commented Aug 11, 2022

Add DistributedReadingService

  • Single process
  • Share shuffle seeds across distributed process
  • Automatically distributed sharding

Add tests for both DataLoader2 and DataLoader.

  • Spawn processes
  • Elastic training

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 11, 2022
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@VitalyFedyunin
Copy link
Contributor

Just to be clear, does the first version assumes 1 process per rank?

Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about the distributed design in general:

  1. Are we expecting users to always use something like torch.multiprocessing.spawn to start distributed training? And that will properly start/clean up all the processes?
  2. To what extent is the optimization from Second prototype to do pre-sharding work in single process #555 compatible with this? Maybe it can work for every node?

test/test_distributed.py Show resolved Hide resolved
torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved
torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved
torchdata/dataloader2/reading_service.py Outdated Show resolved Hide resolved
@ejguan
Copy link
Contributor Author

ejguan commented Aug 17, 2022

Just to be clear, does the first version assumes 1 process per rank?

Yeah. I think our next step is to support mixed reading service (distributed reading service + multiprocessing reading service)

@ejguan
Copy link
Contributor Author

ejguan commented Aug 17, 2022

  1. Are we expecting users to always use something like torch.multiprocessing.spawn to start distributed training? And that will properly start/clean up all the processes?

Nope. I can add a different test for elastic training.

@ejguan
Copy link
Contributor Author

ejguan commented Aug 18, 2022

This PR should be ready to review. cc: @NivekT @VitalyFedyunin
In terms of multiprocessing + distributed, I will figure out the design of mixed reading services first then add such feature.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor Author

ejguan commented Aug 18, 2022

Actually, there is one thing left, which is doc and tutorial for DistributedReadingService. I will do a separate PR to document DataLoader2 and DistributedReadingService.

@ejguan
Copy link
Contributor Author

ejguan commented Aug 19, 2022

I will wait until pytorch/pytorch#83741 is landed and released into nightly because it will also use the updated API.

@ejguan ejguan force-pushed the distributed_rs branch 4 times, most recently from e4cd5fe to 43b06eb Compare September 15, 2022 13:55
@ejguan
Copy link
Contributor Author

ejguan commented Sep 15, 2022

Failing DataPipe tests because the nightly binaries for mac haven't been updated yet.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan requested a review from NivekT September 20, 2022 19:14
Copy link
Contributor

@NivekT NivekT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

nit comment:
IIUC, the difference between this and _test_distributed_rs is DL1 vs DL2?

  • If so, we can potential remove the duplicate code (not urgent).
  • Alternatively, we can just label _test_distributed_dl and _test_distributed_dataloader (and elastic_dl/elastic_training) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.

test/test_distributed.py Outdated Show resolved Hide resolved
@ejguan
Copy link
Contributor Author

ejguan commented Sep 20, 2022

IIUC, the difference between this and _test_distributed_rs is DL1 vs DL2?

  • If so, we can potential remove the duplicate code (not urgent).
  • Alternatively, we can just label _test_distributed_dl and _test_distributed_dataloader (and elastic_dl/elastic_training) so it will be obvious from the first glance that those two are mostly the same but testing different version of DL.

Fixed

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan ejguan force-pushed the distributed_rs branch 3 times, most recently from b2ad947 to 8cad29d Compare September 21, 2022 15:26
@ejguan ejguan force-pushed the distributed_rs branch 2 times, most recently from 77ad0ef to 4aa1272 Compare September 21, 2022 17:24
@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ejguan
Copy link
Contributor Author

ejguan commented Sep 22, 2022

I will land this PR until PyTorch nightly is updated.

@facebook-github-bot
Copy link
Contributor

@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants