Add fake process group #102180

ezyang · 2023-05-24T17:40:19Z

Stack from ghstack (oldest at bottom):

-> Add fake process group #102180

Signed-off-by: Edward Z. Yang ezyang@meta.com

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

pytorch-bot · 2023-05-24T17:40:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102180

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac40616:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9b3fb29acddb13eb81946ef5d23abe3b3ce253ca Pull Request resolved: #102180

wanchaol

looks great, thanks!

wanchaol · 2023-05-24T17:44:27Z

torch/testing/_internal/distributed/fake_pg.py

+class FakeProcessGroup(dist.ProcessGroup):
+    pass
+
+class FakeStore(dist.Store):


we probably don't need a FakeStore and instead we can just use HashStore I suppose. But that could be in a separate PR.

Can't use HashStore. For example, FSDP will attempt to do a barrier. The barrier will block you until enough writes into the store have happened. If we're doing fake PG there will be no other writes and you'll deadlock. It's best to have the store error if you try to do anything with it and route around it differently.

Hmmm I think FakePG barrier should be a no-op, and in terms of init_processs_group store based barrier it seems we already skip the barrier so it won't write anything to the HashStore. So either fake_store or hash_store could work I feel (I can give it a try and see if that's feasible or not)

FSDP does a barrier which is why I ended up doing FakeStore. But yeah, try some stuff out, the goal is to be able to run FSDP end-to-end with the fake group with only one node.

test/distributed/test_fake_pg.py

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 1f5983b9437c2ab2b0f218ab82c27a81f3b385ae Pull Request resolved: #102180

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9fce552f9de4f4986ab881d32771d047c328ea54 Pull Request resolved: #102180

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 7a14fc8e5a4edcef98acf748ccceac03bb1ff276 Pull Request resolved: #102180

ezyang · 2023-05-24T19:28:17Z

@pytorchbot merge

pytorchmergebot · 2023-05-24T19:30:15Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

albanD

Very cool!
Do you have a small script showing the right way to use this?

ezyang · 2023-06-05T15:39:07Z

the test cases are pretty good

albanD · 2023-06-05T22:14:11Z

Ok!
But can I actually forward/backward with that Module? What should I expect the memory behavior of this thing to be (representative of what would happen with nccl or not?)

wanchaol · 2023-06-06T18:32:47Z

Ok! But can I actually forward/backward with that Module? What should I expect the memory behavior of this thing to be (representative of what would happen with nccl or not?)

@albanD the memory behavior should be representative of each rank behavior as if you are running a real multiprocessing job with nccl process group. i.e. if specify rank=1 and torch.cuda.set_device(1) when you initialize the fake pg, all of the memory behaviors should be similar as if you are in rank 1 of a real multiprocessing job. (as under the hood each rank still allocate the same amount of data for c10d collectivese)

insujang · 2023-07-03T15:42:51Z

May I ask why you implemented additional FakeProcessGroup implementation while we have MockProcessGroup in torch/testing/_internal/distributed/distributed_utils.py? I understand that MockProcessGroup is just mocking a process group and does nothing while FakeProcessGroup aims to provide fake communication as well, but it is a superset and seems no reason to maintain MockProcessGroup. Will you deprecate MockProcessGroup later?

Add fake process group

6d6979a

Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]

ezyang requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fegin, kiukchung and d4l3k as code owners May 24, 2023 17:40

pytorch-bot bot added the release notes: distributed (c10d) release notes category label May 24, 2023

github-actions bot requested review from albanD, antoniojkim, bdhirsh, jbschlosser, miladm, SherlockNoMad, voznesenskym and wconstab May 24, 2023 17:40

ezyang added a commit that referenced this pull request May 24, 2023

Add fake process group

75af2b9

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9b3fb29acddb13eb81946ef5d23abe3b3ce253ca Pull Request resolved: #102180

wanchaol approved these changes May 24, 2023

View reviewed changes

wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label May 24, 2023

wanchaol reviewed May 24, 2023

View reviewed changes

test/distributed/test_fake_pg.py Show resolved Hide resolved

Update on "Add fake process group"

737a3fa

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request May 24, 2023

Add fake process group

2bee7ea

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 1f5983b9437c2ab2b0f218ab82c27a81f3b385ae Pull Request resolved: #102180

Update on "Add fake process group"

64a7627

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request May 24, 2023

Add fake process group

7ccfbd0

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9fce552f9de4f4986ab881d32771d047c328ea54 Pull Request resolved: #102180

Update on "Add fake process group"

ac40616

Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]

ezyang added a commit that referenced this pull request May 24, 2023

Add fake process group

377f585

Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 7a14fc8e5a4edcef98acf748ccceac03bb1ff276 Pull Request resolved: #102180

pytorchmergebot added the merging label May 24, 2023

pytorchmergebot added Merged and removed merging labels May 24, 2023

pytorchmergebot closed this in c903b12 May 24, 2023

albanD reviewed Jun 5, 2023

View reviewed changes

facebook-github-bot deleted the gh/ezyang/2113/head branch June 8, 2023 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fake process group #102180

Add fake process group #102180

ezyang commented May 24, 2023 •

edited

pytorch-bot bot commented May 24, 2023 •

edited

wanchaol left a comment

wanchaol May 24, 2023 •

edited

ezyang May 24, 2023

wanchaol May 24, 2023

ezyang May 24, 2023

ezyang commented May 24, 2023

pytorchmergebot commented May 24, 2023

albanD left a comment

ezyang commented Jun 5, 2023

albanD commented Jun 5, 2023

wanchaol commented Jun 6, 2023

insujang commented Jul 3, 2023

Add fake process group #102180

Add fake process group #102180

Conversation

ezyang commented May 24, 2023 • edited

pytorch-bot bot commented May 24, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102180

✅ No Failures

wanchaol left a comment

Choose a reason for hiding this comment

wanchaol May 24, 2023 • edited

Choose a reason for hiding this comment

ezyang May 24, 2023

Choose a reason for hiding this comment

wanchaol May 24, 2023

Choose a reason for hiding this comment

ezyang May 24, 2023

Choose a reason for hiding this comment

ezyang commented May 24, 2023

pytorchmergebot commented May 24, 2023

Merge started

albanD left a comment

Choose a reason for hiding this comment

ezyang commented Jun 5, 2023

albanD commented Jun 5, 2023

wanchaol commented Jun 6, 2023

insujang commented Jul 3, 2023

ezyang commented May 24, 2023 •

edited

pytorch-bot bot commented May 24, 2023 •

edited

wanchaol May 24, 2023 •

edited