Distribute GPUs in round robin mode for distributed_test #46389

Flamefire · 2020-10-15T12:48:37Z

The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.

See #46248 for details

Note that this is a workaround only! The real issue is much harder to solve and might affect more. See #46248 (comment)

The ProcessGroupNCCL::barrier implementation assumes that when 1 GPU/rank is used the GPU-Index equals the rank. Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one. See pytorch#46248 for details

facebook-github-bot · 2020-10-15T12:48:52Z

💊 CI failures summary and remediations

As of commit 4ee880e (more details on the Dr. CI page):

1/1 failures introduced in this PR---

1 failure not recognized by patterns:

Job	Step	Action
^{docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c}	^{Check if image should be built}	🔁 rerun

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

dr-ci · 2020-10-15T13:03:04Z

💊 CI failures summary and remediations

As of commit 4ee880e (more details on the Dr. CI page):

2/3 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)
1/3 broken upstream at merge base 2d6fd22 since Oct 14

🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet:

docker-pytorch-linux-xenial-py3-clang5-android-ndk-r19c since Oct 14
- 🔁 rerun

ci.pytorch.org: 2 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.8-py3.6-test
Failed: pr/pytorch-linux-bionic-rocm3.8-py3.6

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 3 times.

mrshenli · 2020-10-20T22:09:26Z

torch/testing/_internal/distributed/distributed_test.py

-                    visible_devices[i * nGPUs_per_process: (i + 1) * nGPUs_per_process]
-                )
+                # Each rank has to get the GPU with the index equal to its rank
+                i: [i + gpu_num * world_size for gpu_num in range(nGPUs_per_process)]


Hey @Flamefire

Would be correct if I assume world_size=2 and 4 GPUs in total, then we have the following? Before this change, we have:

rank 0 -> gpu [0, 1]
rank 1 -> gpu [2, 3]

After this change, we have:

rank 0 -> gpu [0, 2]
rank 1 -> gpu [1, 3]

Could you please help me understand why the above change fixes the problem described below? Thx.

Due to NCCL communicator reuse this then leads to rank 0 using the (kinda) temporary communicator while the other processes might use other GPUs leading to them trying to create a new communicator and waiting for rank 0 until that creates a new (potentially unrelated) one.

Yes your assumption is correct. For an in-depth analysis see the issue where I posted many details, here only the summary:

During the barrier that happens very early (on creation of the process group) each process creates a communicator with GPU idx equal to its rank:

pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp

Line 1389 in 2e2fe8c

int16_t deviceIdx = static_cast<int16_t>(rank_ % numGPUs);

The problem with the old distribution is, that rank 1 (in your example) wants to use GPU 2 afterwards and hence needs a new communicator. But rank 0 wants to (continue to) use GPU 0 and hence does not need a new communicator. But because creation of a communicator is a collective operation this will fail due to rank 1 waiting for rank 0 that never appears.

Later Rank 0 might want to create a communicator for GPU 0 and 1 and joins the still waiting rank 1 in creating one, but now there is a mismatch: Rank 0 is already further down and expects 4 total GPU ranks (2 per process) while rank 1 only expects 2. This leads to a (correct) system error in NCCL code, but the real problem is already earlier.

facebook-github-bot

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-11-02T19:11:57Z

Hi @Flamefire!

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

pritamdamania87 · 2021-01-13T00:54:36Z

@Flamefire Do we still need this PR since the barrier() in ProcessGroup initialization has been replaced with a store based barrier in #49930

Flamefire · 2021-01-13T10:10:57Z

I just applied that (as far as possible) to PyTorch 1.7.0 and reran the test: Working.
So I guess that is fine 👍

Flamefire requested review from mrshenli, pritamdamania87, rohan-varma and zhaojuanmao as code owners October 15, 2020 12:48

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Oct 15, 2020

pytorchbot added the open source label Oct 15, 2020

malfet added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 15, 2020

mrshenli reviewed Oct 20, 2020

View reviewed changes

facebook-github-bot reviewed Oct 20, 2020

View reviewed changes

Flamefire closed this Jan 13, 2021

Flamefire deleted the fix_test_DistributedDataParallel branch March 21, 2024 12:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distribute GPUs in round robin mode for distributed_test #46389

Distribute GPUs in round robin mode for distributed_test #46389

Flamefire commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020 •

edited

dr-ci bot commented Oct 15, 2020 •

edited

mrshenli Oct 20, 2020

mrshenli Oct 20, 2020

Flamefire Oct 21, 2020

facebook-github-bot left a comment

facebook-github-bot commented Nov 2, 2020

pritamdamania87 commented Jan 13, 2021

Flamefire commented Jan 13, 2021

Distribute GPUs in round robin mode for distributed_test #46389

Distribute GPUs in round robin mode for distributed_test #46389

Conversation

Flamefire commented Oct 15, 2020

facebook-github-bot commented Oct 15, 2020 • edited

💊 CI failures summary and remediations

1 failure not recognized by patterns:

dr-ci bot commented Oct 15, 2020 • edited

💊 CI failures summary and remediations

🚧 1 ongoing upstream failure:

ci.pytorch.org: 2 failed

mrshenli Oct 20, 2020

Choose a reason for hiding this comment

mrshenli Oct 20, 2020

Choose a reason for hiding this comment

Flamefire Oct 21, 2020

Choose a reason for hiding this comment

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Nov 2, 2020

pritamdamania87 commented Jan 13, 2021

Flamefire commented Jan 13, 2021

facebook-github-bot commented Oct 15, 2020 •

edited

dr-ci bot commented Oct 15, 2020 •

edited