Skip to content

Conversation

fduwjj
Copy link
Contributor

@fduwjj fduwjj commented Dec 28, 2023

Stack from ghstack (oldest at bottom):

Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set rank_ and uid_ to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

image

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @wz337 @tianyu-l @wconstab @yf225

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Dec 28, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/116520

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 95c550c with merge base 4c6e842 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 28, 2023
@fduwjj fduwjj changed the title [C10d] Fix Log Fix [C10d] Fix Log Prefix in NCCLPG so that each instance gets its own prefix Dec 28, 2023
… its own prefix"


Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">


If the screenshot does not work, one can also use the link: https://www.dropbox.com/s/x7ruhnqq7pm544f/Screenshot%202023-12-28%20at%203.26.13%E2%80%AFPM.png?dl=0


cc mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol wz337 tianyu-l wconstab yf225

[ghstack-poisoned]
@wconstab
Copy link
Contributor

wconstab commented Jan 2, 2024

I think if we switch to your way, its not any more unique, its just a different way.

e.g. pg0 rank4 could be == pg1 rank0

either way is ‘correct’ but which one is better?

a prefix that is unique to this process group and rank"

RE that comment, agreed its saying more what your way does. But i wrote it badly after thinking of using global rank info for logs. So lets decide and then make them consistent

Copy link
Contributor

@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LGTM. I was confused earlier. No need to add additional global rank.

@fduwjj
Copy link
Contributor Author

fduwjj commented Jan 2, 2024

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 2, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@facebook-github-bot facebook-github-bot deleted the gh/fduwjj/132/head branch January 6, 2024 15:22
wconstab pushed a commit that referenced this pull request Jan 26, 2024
…efix (#116520)

Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">

Pull Request resolved: #116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
wconstab pushed a commit that referenced this pull request Jan 27, 2024
…efix (#116520)

Somehow the logprefix only have ProcessGroup 0 rank [global rank]. This does not give the expected result as per the comment says "a prefix that is unique to this process group and rank". So this PR fix it and make it different for different subPGs.

The reason is that we set the prefix static which is shared across all NCCLPG instances and whoever calls this function first will set `rank_` and `uid_` to the prefix. We always initialize PG 0 first that's why we always see PG[0] + global ranks for all subPGs.

<img width="484" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/7fbb0226-7e25-4306-9cee-22e17b00bc8e">

Pull Request resolved: #116520
Approved by: https://github.com/wconstab
ghstack dependencies: #116218
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants