Enhance `new_group` doc to mention using NCCL concurrently. #48872

pritamdamania87 · 2020-12-05T02:22:34Z

Stack from ghstack:

Enhance new_group doc to mention using NCCL concurrently. #48872 Enhance new_group doc to mention using NCCL concurrently.

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: D25351778

Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]

Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) ghstack-source-id: 117932333 Pull Request resolved: #48872

dr-ci · 2020-12-05T02:34:22Z

💊 CI failures summary and remediations

As of commit 411ba6d (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 6 times.

rohan-varma

Thanks for clarifying this! I'm assuming "one process group at a time" means that collectives issued by one PG must be completely finished (i.e. not just enqueued, but actually executed by the GPU) before we can kick off a collective via another PG. Is it worth it to make this more explicit?

codecov · 2020-12-05T05:59:49Z

Codecov Report

Merging #48872 (411ba6d) into gh/pritamdamania87/189/base (c29f516) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/189/base   #48872      +/-   ##
===============================================================
- Coverage                        80.74%   80.74%   -0.01%     
===============================================================
  Files                             1868     1868              
  Lines                           201644   201644              
===============================================================
- Hits                            162823   162820       -3     
- Misses                           38821    38824       +3

Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]

Pull Request resolved: #48872 Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. ghstack-source-id: 118060680 Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/) [ghstack-poisoned]

Pull Request resolved: #48872 Using NCCL communicators concurrently is not safe and this is documented in NCCL docs. However, this is not documented in PyTorch and we should add documentation for ProcessGroupNCCL so that users are aware of this limitation. ghstack-source-id: 118148014 Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

facebook-github-bot · 2020-12-09T21:19:30Z

This pull request has been merged in 7584161.

pritamdamania87 requested review from mingzhe09088, mrshenli, rohan-varma and zhaojuanmao as code owners December 5, 2020 02:22

facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 5, 2020

rohan-varma approved these changes Dec 5, 2020

View reviewed changes

facebook-github-bot closed this in 7584161 Dec 9, 2020

facebook-github-bot added the Merged label Dec 9, 2020

facebook-github-bot deleted the gh/pritamdamania87/189/head branch December 13, 2020 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance `new_group` doc to mention using NCCL concurrently. #48872

Enhance `new_group` doc to mention using NCCL concurrently. #48872

pritamdamania87 commented Dec 5, 2020 •

edited

dr-ci bot commented Dec 5, 2020 •

edited

rohan-varma left a comment

codecov bot commented Dec 5, 2020 •

edited

facebook-github-bot commented Dec 9, 2020

Enhance new_group doc to mention using NCCL concurrently. #48872

Enhance new_group doc to mention using NCCL concurrently. #48872

Conversation

pritamdamania87 commented Dec 5, 2020 • edited

dr-ci bot commented Dec 5, 2020 • edited

💊 CI failures summary and remediations

rohan-varma left a comment

Choose a reason for hiding this comment

codecov bot commented Dec 5, 2020 • edited

Codecov Report

facebook-github-bot commented Dec 9, 2020

Enhance `new_group` doc to mention using NCCL concurrently. #48872

Enhance `new_group` doc to mention using NCCL concurrently. #48872

pritamdamania87 commented Dec 5, 2020 •

edited

dr-ci bot commented Dec 5, 2020 •

edited

codecov bot commented Dec 5, 2020 •

edited