Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.7 patch] Disallow creation of ProcessGroupNCCL without GPUs. (#45… #46073

Merged
merged 1 commit into from
Oct 12, 2020

Conversation

pritamdamania87
Copy link
Contributor

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls barrier() this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like barrier() the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc

)

Summary:

Note: This PR has been merged into master at b5a2f04 after the 1.7 branch cut
(see original PR: #45642). This PR is to merge it into the 1.7 branch.

---- Original Commit Description Follows ---

Pull Request resolved: #45642

Prior to #45181, initializing a
NCCL process group would work even if no GPUs were present. Although, now since
init_process_group calls `barrier()` this would fail.

In general the problem was that we could initialize ProcessGroupNCCL without
GPUs and then if we called a method like `barrier()` the process would crash
since we do % numGPUs resulting in division by zero.
ghstack-source-id: 113490343

Test Plan: waitforbuildbot

Reviewed By: osalpekar

Differential Revision: D24038839

fbshipit-source-id: a1f1db52cabcfb83e06c1a11ae9744afbf03f8dc
@codecov
Copy link

codecov bot commented Oct 9, 2020

Codecov Report

Merging #46073 into release/1.7 will decrease coverage by 0.00%.
The diff coverage is 62.50%.

Impacted file tree graph

@@               Coverage Diff               @@
##           release/1.7   #46073      +/-   ##
===============================================
- Coverage        68.46%   68.46%   -0.01%     
===============================================
  Files              406      406              
  Lines            52267    52275       +8     
===============================================
+ Hits             35785    35790       +5     
- Misses           16482    16485       +3     
Impacted Files Coverage Δ
torch/testing/_internal/common_distributed.py 67.36% <62.50%> (-0.17%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 653d766...9af59ca. Read the comment docs.

@malfet malfet merged commit 305515d into release/1.7 Oct 12, 2020
@facebook-github-bot facebook-github-bot deleted the nccl_gpu_1.7_cherry_pick branch January 27, 2021 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants