Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribute GPUs in round robin mode for distributed_test #46389

Closed

Conversation

Flamefire
Copy link
Collaborator

The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.

See #46248 for details

Note that this is a workaround only! The real issue is much harder to solve and might affect more. See #46248 (comment)

The ProcessGroupNCCL::barrier implementation assumes that when
1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL
communicator reuse this then leads to rank 0 using the (kinda)
temporary communicator while the other processes might use other GPUs
leading to them trying to create a new communicator and waiting for
rank 0 until that creates a new (potentially unrelated) one.

See pytorch#46248 for details
@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2020
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 15, 2020

💊 CI failures summary and remediations

As of commit 4ee880e (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR---

1 failure not recognized by patterns:

Job Step Action
CircleCI docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c Check if image should be built 🔁 rerun
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

@dr-ci
Copy link

dr-ci bot commented Oct 15, 2020

💊 CI failures summary and remediations

As of commit 4ee880e (more details on the Dr. CI page):


  • 2/3 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)
  • 1/3 broken upstream at merge base 2d6fd22 since Oct 14

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:


ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

@malfet malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 15, 2020
visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process]
)
# Each rank has to get the GPU with the index equal to its rank
i: [i + gpu_num * world_size for gpu_num in range(nGPUs_per_process)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Flamefire

Would be correct if I assume world_size=2 and 4 GPUs in total, then we have the following? Before this change, we have:

rank 0 -> gpu [0, 1]
rank 1 -> gpu [2, 3]

After this change, we have:

rank 0 -> gpu [0, 2]
rank 1 -> gpu [1, 3]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help me understand why the above change fixes the problem described below? Thx.

Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your assumption is correct. For an in-depth analysis see the issue where I posted many details, here only the summary:

During the barrier that happens very early (on creation of the process group) each process creates a communicator with GPU idx equal to its rank:

int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);

The problem with the old distribution is, that rank 1 (in your example) wants to use GPU 2 afterwards and hence needs a new communicator. But rank 0 wants to (continue to) use GPU 0 and hence does not need a new communicator. But because creation of a communicator is a collective operation this will fail due to rank 1 waiting for rank 0 that never appears.

Later Rank 0 might want to create a communicator for GPU 0 and 1 and joins the still waiting rank 1 in creating one, but now there is a mismatch: Rank 0 is already further down and expects 4 total GPU ranks (2 per process) while rank 1 only expects 2. This leads to a (correct) system error in NCCL code, but the real problem is already earlier.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

Hi @Flamefire!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@pritamdamania87
Copy link
Contributor

@Flamefire Do we still need this PR since the barrier() in ProcessGroup initialization has been replaced with a store based barrier in #49930

@Flamefire
Copy link
Collaborator Author

I just applied that (as far as possible) to PyTorch 1.7.0 and reran the test: Working.
So I guess that is fine 👍

@Flamefire Flamefire closed this Jan 13, 2021
@Flamefire Flamefire deleted the fix_test_DistributedDataParallel branch March 21, 2024 12:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants