New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribute GPUs in round robin mode for distributed_test #46389
Distribute GPUs in round robin mode for distributed_test #46389
Conversation
The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one. See pytorch#46248 for details
💊 CI failures summary and remediationsAs of commit 4ee880e (more details on the Dr. CI page):
1 failure not recognized by patterns:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 3 times. |
💊 CI failures summary and remediationsAs of commit 4ee880e (more details on the Dr. CI page):
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet: ci.pytorch.org: 2 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 3 times. |
visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process] | ||
) | ||
# Each rank has to get the GPU with the index equal to its rank | ||
i: [i + gpu_num * world_size for gpu_num in range(nGPUs_per_process)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Flamefire
Would be correct if I assume world_size=2
and 4 GPUs in total, then we have the following? Before this change, we have:
rank 0 -> gpu [0, 1]
rank 1 -> gpu [2, 3]
After this change, we have:
rank 0 -> gpu [0, 2]
rank 1 -> gpu [1, 3]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please help me understand why the above change fixes the problem described below? Thx.
Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes your assumption is correct. For an in-depth analysis see the issue where I posted many details, here only the summary:
During the barrier that happens very early (on creation of the process group) each process creates a communicator with GPU idx equal to its rank:
pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp
Line 1389 in 2e2fe8c
int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs); |
The problem with the old distribution is, that rank 1 (in your example) wants to use GPU 2 afterwards and hence needs a new communicator. But rank 0 wants to (continue to) use GPU 0 and hence does not need a new communicator. But because creation of a communicator is a collective operation this will fail due to rank 1 waiting for rank 0 that never appears.
Later Rank 0 might want to create a communicator for GPU 0 and 1 and joins the still waiting rank 1 in creating one, but now there is a mismatch: Rank 0 is already further down and expects 4 total GPU ranks (2 per process) while rank 1 only expects 2. This leads to a (correct) system error in NCCL code, but the real problem is already earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Hi @Flamefire! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
@Flamefire Do we still need this PR since the barrier() in ProcessGroup initialization has been replaced with a store based barrier in #49930 |
I just applied that (as far as possible) to PyTorch 1.7.0 and reran the test: Working. |
The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.
See #46248 for details
Note that this is a workaround only! The real issue is much harder to solve and might affect more. See #46248 (comment)