New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fake process group #102180
Add fake process group #102180
Conversation
Signed-off-by: Edward Z. Yang <ezyang@meta.com> [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/102180
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit ac40616: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9b3fb29acddb13eb81946ef5d23abe3b3ce253ca Pull Request resolved: #102180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thanks!
class FakeProcessGroup(dist.ProcessGroup): | ||
pass | ||
|
||
class FakeStore(dist.Store): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably don't need a FakeStore and instead we can just use HashStore I suppose. But that could be in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't use HashStore. For example, FSDP will attempt to do a barrier. The barrier will block you until enough writes into the store have happened. If we're doing fake PG there will be no other writes and you'll deadlock. It's best to have the store error if you try to do anything with it and route around it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I think FakePG barrier should be a no-op, and in terms of init_processs_group
store based barrier it seems we already skip the barrier so it won't write anything to the HashStore. So either fake_store or hash_store could work I feel (I can give it a try and see if that's feasible or not)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FSDP does a barrier which is why I ended up doing FakeStore. But yeah, try some stuff out, the goal is to be able to run FSDP end-to-end with the fake group with only one node.
Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 1f5983b9437c2ab2b0f218ab82c27a81f3b385ae Pull Request resolved: #102180
Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 9fce552f9de4f4986ab881d32771d047c328ea54 Pull Request resolved: #102180
Signed-off-by: Edward Z. Yang <ezyangmeta.com> [ghstack-poisoned]
Signed-off-by: Edward Z. Yang <ezyangmeta.com> ghstack-source-id: 7a14fc8e5a4edcef98acf748ccceac03bb1ff276 Pull Request resolved: #102180
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!
Do you have a small script showing the right way to use this?
the test cases are pretty good |
Ok! |
@albanD the memory behavior should be representative of each rank behavior as if you are running a real multiprocessing job with nccl process group. i.e. if specify |
May I ask why you implemented additional |
Stack from ghstack (oldest at bottom):
Signed-off-by: Edward Z. Yang ezyang@meta.com