-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix #116520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116520
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 95c550c with merge base 4c6e842 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
… its own prefix" Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs. <img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e"> If the screenshot does not work, one can also use the link: https://www.dropbox.com/s/x7ruhnqq7pm544f/Screenshot%202023-12-28%20at%203.26.13%E2%80%AFPM.png?dl=0 cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol wz337 tianyu-l wconstab yf225 [ghstack-poisoned]
I think if we switch to your way, its not any more unique, its just a different way. e.g. pg0 rank4 could be == pg1 rank0 either way is ‘correct’ but which one is better?
RE that comment, agreed its saying more what your way does. But i wrote it badly after thinking of using global rank info for logs. So lets decide and then make them consistent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LGTM. I was confused earlier. No need to add additional global rank.
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…efix (#116520) Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs. The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs. <img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e"> Pull Request resolved: #116520 Approved by: https://github.com/wconstab ghstack dependencies: #116218
…efix (#116520) Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs. The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs. <img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e"> Pull Request resolved: #116520 Approved by: https://github.com/wconstab ghstack dependencies: #116218
Stack from ghstack (oldest at bottom):
Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.
The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set
rank_
anduid_
to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @wz337 @tianyu-l @wconstab @yf225