Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NCCL] Support NCCL Send/Recv #44921

Closed
wants to merge 14 commits into from

Conversation

mingzhe09088
Copy link
Contributor

@mingzhe09088 mingzhe09088 commented Sep 18, 2020

Stack from ghstack:

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.

Differential Revision: D23709848

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Sep 18, 2020

💊 CI failures summary and remediations

As of commit d98ef7e (more details on the Dr. CI page):


Commit d98ef7e was recently pushed. Waiting for builds...


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 53 times.

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 18, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112391583

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
torch/csrc/distributed/c10d/init.cpp Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.hpp Outdated Show resolved Hide resolved
torch/testing/_internal/distributed/distributed_test.py Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.cpp Outdated Show resolved Hide resolved
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.

Todo: add more tests. 

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
@mingzhe09088 mingzhe09088 changed the title [WIP][NCCL] Support NCCL Send/Recv [NCCL] Support NCCL Send/Recv Sep 21, 2020
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 21, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112545024

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/lib/c10d/ProcessGroupNCCL.cpp Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 23, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112722354

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23709848/)!
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 23, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112735516

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23709848/)!
Copy link
Contributor

@pritamdamania87 pritamdamania87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @jiayisuse can you take another look at the PR too?

torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 24, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112769784

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23709848/)!
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 24, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112835901

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23709848/)!
@codecov
Copy link

codecov bot commented Sep 24, 2020

Codecov Report

Merging #44921 into gh/mingzhe09088/1/base will decrease coverage by 0.15%.
The diff coverage is 22.04%.

Impacted file tree graph

@@                    Coverage Diff                     @@
##           gh/mingzhe09088/1/base   #44921      +/-   ##
==========================================================
- Coverage                   68.32%   68.16%   -0.16%     
==========================================================
  Files                         410      410              
  Lines                       52981    53165     +184     
==========================================================
+ Hits                        36199    36240      +41     
- Misses                      16782    16925     +143     
Impacted Files Coverage Δ
torch/distributed/distributed_c10d.py 27.83% <20.45%> (-0.52%) ⬇️
.../testing/_internal/distributed/distributed_test.py 29.55% <21.98%> (-0.59%) ⬇️
torch/distributed/__init__.py 88.88% <100.00%> (+1.38%) ⬆️
torch/testing/_internal/expecttest.py 78.57% <0.00%> (+1.02%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ab88c3...d98ef7e. Read the comment docs.

torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
torch/distributed/distributed_c10d.py Outdated Show resolved Hide resolved
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
mingzhe09088 pushed a commit that referenced this pull request Sep 25, 2020
Pull Request resolved: #44921

This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.
ghstack-source-id: 112935178

Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23709848/)!
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See #43995 for more context.


Differential Revision: [D23709848](https://our.internmc.facebook.com/intern/diff/D23709848/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 59083d6.

facebook-github-bot pushed a commit that referenced this pull request Sep 24, 2021
#65601)

Summary:
Pull Request resolved: #65601

I believe this feature was supported one year ago:
#44921

#Closes: #65525
ghstack-source-id: 138918961

Test Plan: N/A

Reviewed By: pritamdamania87, mingzhe09088

Differential Revision: D31163535

fbshipit-source-id: 9321a0a5137a3e265e2b54bd78730ac28c7acd55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants