Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fake process group #102180

Closed
wants to merge 4 commits into from
Closed

Add fake process group #102180

wants to merge 4 commits into from

Conversation

ezyang
Copy link
Contributor

@ezyang ezyang commented May 24, 2023

Stack from ghstack (oldest at bottom):

Signed-off-by: Edward Z. Yang ezyang@meta.com

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented May 24, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102180

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ac40616:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the release notes: distributed (c10d) release notes category label May 24, 2023
ezyang added a commit that referenced this pull request May 24, 2023
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

ghstack-source-id: 9b3fb29acddb13eb81946ef5d23abe3b3ce253ca
Pull Request resolved: #102180
Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks!

class FakeProcessGroup(dist.ProcessGroup):
pass

class FakeStore(dist.Store):
Copy link
Contributor

@wanchaol wanchaol May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably don't need a FakeStore and instead we can just use HashStore I suppose. But that could be in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use HashStore. For example, FSDP will attempt to do a barrier. The barrier will block you until enough writes into the store have happened. If we're doing fake PG there will be no other writes and you'll deadlock. It's best to have the store error if you try to do anything with it and route around it differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I think FakePG barrier should be a no-op, and in terms of init_processs_group store based barrier it seems we already skip the barrier so it won't write anything to the HashStore. So either fake_store or hash_store could work I feel (I can give it a try and see if that's feasible or not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FSDP does a barrier which is why I ended up doing FakeStore. But yeah, try some stuff out, the goal is to be able to run FSDP end-to-end with the fake group with only one node.

@wanchaol wanchaol added the ciflow/trunk Trigger trunk jobs on your pull request label May 24, 2023
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request May 24, 2023
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

ghstack-source-id: 1f5983b9437c2ab2b0f218ab82c27a81f3b385ae
Pull Request resolved: #102180
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request May 24, 2023
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

ghstack-source-id: 9fce552f9de4f4986ab881d32771d047c328ea54
Pull Request resolved: #102180
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

[ghstack-poisoned]
ezyang added a commit that referenced this pull request May 24, 2023
Signed-off-by: Edward Z. Yang <ezyangmeta.com>

ghstack-source-id: 7a14fc8e5a4edcef98acf748ccceac03bb1ff276
Pull Request resolved: #102180
@ezyang
Copy link
Contributor Author

ezyang commented May 24, 2023

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Copy link
Collaborator

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!
Do you have a small script showing the right way to use this?

@ezyang
Copy link
Contributor Author

ezyang commented Jun 5, 2023

the test cases are pretty good

@albanD
Copy link
Collaborator

albanD commented Jun 5, 2023

Ok!
But can I actually forward/backward with that Module? What should I expect the memory behavior of this thing to be (representative of what would happen with nccl or not?)

@wanchaol
Copy link
Contributor

wanchaol commented Jun 6, 2023

Ok! But can I actually forward/backward with that Module? What should I expect the memory behavior of this thing to be (representative of what would happen with nccl or not?)

@albanD the memory behavior should be representative of each rank behavior as if you are running a real multiprocessing job with nccl process group. i.e. if specify rank=1 and torch.cuda.set_device(1) when you initialize the fake pg, all of the memory behaviors should be similar as if you are in rank 1 of a real multiprocessing job. (as under the hood each rank still allocate the same amount of data for c10d collectivese)

@facebook-github-bot facebook-github-bot deleted the gh/ezyang/2113/head branch June 8, 2023 17:08
@insujang
Copy link

insujang commented Jul 3, 2023

May I ask why you implemented additional FakeProcessGroup implementation while we have MockProcessGroup in torch/testing/_internal/distributed/distributed_utils.py? I understand that MockProcessGroup is just mocking a process group and does nothing while FakeProcessGroup aims to provide fake communication as well, but it is a superset and seems no reason to maintain MockProcessGroup. Will you deprecate MockProcessGroup later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants