Make FakeProcessGroup traceable #113314

fegin · 2023-11-08T23:32:53Z

Stack from ghstack (oldest at bottom):

-> Make FakeProcessGroup traceable #113314

This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the wait() seems to be optimized away when the world_size is 1.

Differential Revision: D51136463

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @aakhundov @kadeng

This PR mimics what we have done to trace ProcessGroup. Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/) [ghstack-poisoned]

pytorch-bot · 2023-11-08T23:32:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113314

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 858ef91 with merge base 376217c ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

This PR mimics what we have done to trace ProcessGroup. Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/) ghstack-source-id: 206942938 Pull Request resolved: #113314

wanchaol · 2023-11-08T23:37:15Z

torch/_dynamo/variables/builder.py

@@ -660,6 +661,12 @@ def index_source(key):
                source=self.source,
                guards=self.make_guards(GuardBuilder.ID_MATCH),
            )
+        elif FakeProcessGroupVariable.is_process_group(value):


could we make this check directly inside ProcessGroupVariable, I think FakePG is just another type of PG so we should try to make that works directly

Good point. I guess we won't actually do something special for FakePG. Let me change it.

This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1. Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/) cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng [ghstack-poisoned]

Pull Request resolved: #113314 This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1. ghstack-source-id: 206958093 @exported-using-ghexport Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/)

wanchaol

lgtm

facebook-github-bot · 2023-11-10T16:01:47Z

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

pytorchmergebot · 2023-11-10T16:03:28Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

This PR mimics what we have done to trace ProcessGroup. This allows use to use FakeProcessGroup with torch.compile. FakeProcessGroup allows us to use world_size > 1 without creating multiple processes thus enabling the usage of PDB to debug bucketing DDP allreduce in the Inductor. We can theoretically use GLOO with world_size==1 to achieve the same goal. However, the `wait()` seems to be optimized away when the world_size is 1. Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/) Pull Request resolved: pytorch#113314 Approved by: https://github.com/wanchaol

Make FakeProcessGroup traceable

38a8ce0

This PR mimics what we have done to trace ProcessGroup. Differential Revision: [D51136463](https://our.internmc.facebook.com/intern/diff/D51136463/) [ghstack-poisoned]

fegin requested review from mrshenli, zhaojuanmao, rohan-varma, H-Huang, awgu, kwen2501, wanchaol, fduwjj, wz337 and LucasLLC as code owners November 8, 2023 23:32

github-actions bot added module: dynamo ciflow/inductor labels Nov 8, 2023

fegin marked this pull request as draft November 8, 2023 23:33

fegin added the ciflow/trunk Trigger trunk jobs on your pull request label Nov 8, 2023

wanchaol reviewed Nov 8, 2023

View reviewed changes

fegin marked this pull request as ready for review November 9, 2023 20:15

fegin requested a review from wanchaol November 10, 2023 00:10

wanchaol approved these changes Nov 10, 2023

View reviewed changes

pytorchmergebot added the merging label Nov 10, 2023

pytorchmergebot added the Merged label Nov 10, 2023

pytorchmergebot closed this in 08641a3 Nov 10, 2023

pytorchmergebot removed the merging label Nov 10, 2023

facebook-github-bot deleted the gh/fegin/180/head branch November 14, 2023 15:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make FakeProcessGroup traceable #113314

Make FakeProcessGroup traceable #113314

fegin commented Nov 8, 2023 •

edited

pytorch-bot bot commented Nov 8, 2023 •

edited

wanchaol Nov 8, 2023

fegin Nov 9, 2023

wanchaol left a comment

facebook-github-bot commented Nov 10, 2023

pytorchmergebot commented Nov 10, 2023

Make FakeProcessGroup traceable #113314

Make FakeProcessGroup traceable #113314

Conversation

fegin commented Nov 8, 2023 • edited

pytorch-bot bot commented Nov 8, 2023 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/113314

✅ No Failures

wanchaol Nov 8, 2023

Choose a reason for hiding this comment

fegin Nov 9, 2023

Choose a reason for hiding this comment

wanchaol left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 10, 2023

pytorchmergebot commented Nov 10, 2023

Merge started

fegin commented Nov 8, 2023 •

edited

pytorch-bot bot commented Nov 8, 2023 •

edited