New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance new_group
doc to mention using NCCL concurrently.
#48872
Enhance new_group
doc to mention using NCCL concurrently.
#48872
Conversation
Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]
Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) ghstack-source-id: 117932333 Pull Request resolved: #48872
💊 CI failures summary and remediationsAs of commit 411ba6d (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 6 times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying this! I'm assuming "one process group at a time" means that collectives issued by one PG must be completely finished (i.e. not just enqueued, but actually executed by the GPU) before we can kick off a collective via another PG. Is it worth it to make this more explicit?
Codecov Report
@@ Coverage Diff @@
## gh/pritamdamania87/189/base #48872 +/- ##
===============================================================
- Coverage 80.74% 80.74% -0.01%
===============================================================
Files 1868 1868
Lines 201644 201644
===============================================================
- Hits 162823 162820 -3
- Misses 38821 38824 +3 |
Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]
Pull Request resolved: #48872 Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. ghstack-source-id: 118060680 Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)
Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]
Pull Request resolved: #48872 Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. ghstack-source-id: 118148014 Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)
This pull request has been merged in 7584161. |
Stack from ghstack:
new_group
doc to mention using NCCL concurrently. #48872 Enhancenew_group
doc to mention using NCCL concurrently.Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.
However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
Differential Revision: D25351778