Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance new_group doc to mention using NCCL concurrently. #48872

Closed

Conversation

pritamdamania87
Copy link
Contributor

@pritamdamania87 pritamdamania87 commented Dec 5, 2020

Stack from ghstack:

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: D25351778

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added cla signed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 5, 2020
pritamdamania87 pushed a commit that referenced this pull request Dec 5, 2020
Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

ghstack-source-id: 117932333
Pull Request resolved: #48872
@dr-ci
Copy link

dr-ci bot commented Dec 5, 2020

💊 CI failures summary and remediations

As of commit 411ba6d (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 6 times.

Copy link
Member

@rohan-varma rohan-varma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this! I'm assuming "one process group at a time" means that collectives issued by one PG must be completely finished (i.e. not just enqueued, but actually executed by the GPU) before we can kick off a collective via another PG. Is it worth it to make this more explicit?

@codecov
Copy link

codecov bot commented Dec 5, 2020

Codecov Report

Merging #48872 (411ba6d) into gh/pritamdamania87/189/base (c29f516) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@                       Coverage Diff                       @@
##           gh/pritamdamania87/189/base   #48872      +/-   ##
===============================================================
- Coverage                        80.74%   80.74%   -0.01%     
===============================================================
  Files                             1868     1868              
  Lines                           201644   201644              
===============================================================
- Hits                            162823   162820       -3     
- Misses                           38821    38824       +3     

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 8, 2020
Pull Request resolved: #48872

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
ghstack-source-id: 118060680

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)
Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)

[ghstack-poisoned]
pritamdamania87 pushed a commit that referenced this pull request Dec 9, 2020
Pull Request resolved: #48872

Using NCCL communicators concurrently is not safe and this is
documented in NCCL docs.

However, this is not documented in PyTorch and we should add documentation for
ProcessGroupNCCL so that users are aware of this limitation.
ghstack-source-id: 118148014

Differential Revision: [D25351778](https://our.internmc.facebook.com/intern/diff/D25351778/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 7584161.

@facebook-github-bot facebook-github-bot deleted the gh/pritamdamania87/189/head branch December 13, 2020 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants