Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

rohan-varma · 2021-10-11T02:16:53Z

Stack from ghstack:

-> Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

Third try!

Fixes:

test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed test_nccl_timeout should not rely on time.sleep() #66354 to track that.
ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: D31534735

…ummy NCCL comms in constructor"" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]

pytorch-probot · 2021-10-11T02:16:58Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/cf39c7b8461c97bf064bbd8e6e3c0ab4364ac822/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default,ciflow/all

Workflows	Labels (bold enabled)	Status
Triggered Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	✅ triggered
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	✅ triggered
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	✅ triggered
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	✅ triggered
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	✅ triggered
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	✅ triggered
puretorch-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-10-11T02:17:00Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/66393
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit cf39c7b (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]

…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140210584 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

mrshenli · 2021-10-11T14:39:55Z

test/cpp/c10d/ProcessGroupNCCLTest.cpp

+#include "c10d/ProcessGroup.hpp"
+#include "c10d/Types.hpp"


Should these be

#include <c10d/ProcessGroup.hpp> #include <c10d/Types.hpp>

mrshenli · 2021-10-11T14:40:21Z

test/cpp/c10d/ProcessGroupNCCLTest.cpp

@@ -19,7 +22,7 @@ using c10d::ProcessGroup;

 class NCCLTestBase {
 public:
-  NCCLTestBase(const std::string& path) : path_(path) {}
+  NCCLTestBase(const std::string& path, const std::chrono::milliseconds pgTimeout = kProcessGroupDefaultTimeout) : path_(path), pgTimeout_(pgTimeout) {}


nit: break long line

mrshenli · 2021-10-11T14:47:41Z

test/cpp/c10d/ProcessGroupNCCLTest.cpp

+    // Catch error relating to health check failure
+    bool error_caught = false;
+    try {
+      test.initialize(timeout ? 0 : -1, worldSize);


curious, why are we using the timeout to control the rank?

timeout is a bool in this case and controls whether we should test the timeout error path or actual exception error path.

So if timeout is true we run all threads as rank 0 which would result in a timeout when communicators are being initialized.

On other hand to simulate an exception (timeout is false), rank=-1 will result in an error in getDeviceIdx which we can test here.

mrshenli · 2021-10-11T14:48:29Z

test/cpp/c10d/ProcessGroupNCCLTest.cpp

+      test.initialize(timeout ? 0 : -1, worldSize);
+    } catch (const std::exception &e) {
+      std::string errMsg = e.what();
+      const std::string kTimeoutErr = "Failed to initialize NCCL communicator on rank";


Is this error raised because multiple threads would like to create communicator on the same device?

In this case yes, but in general it is raised by ProcessGroupNCCL whenever communicators cannot be initialized for a variety of reasons (timeout, hang, etc).

mrshenli · 2021-10-11T14:50:20Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -13,6 +14,7 @@

 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAGraphsC10Utils.h>
+#include <c10/core/DeviceType.h>


nit: sort includes?

mrshenli · 2021-10-11T14:52:55Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+at::Device getDeviceForRank(int rank) {
+  TORCH_CHECK(rank >= 0, "Invalid rank ", rank);
+  auto numGPUs = at::cuda::getNumGPUs();
+  int16_t deviceIdx = static_cast<int16_t>(rank % numGPUs);


hmm, is this always guarantee to be correct? What if each machine has 8 GPUs, but I only use the first 4 for training? Which device ProcessGroupNCCL will use before this change?

For non-barrier collectives pg nccl will look at the device the tensor is on to pick the device to use.

For barrier, we currently use this logic. barrier also takes in a device_ids argument that can specify which device to use.

What if each machine has 8 GPUs, but I only use the first 4 for training

In this case, the health check won't be completely accurate if the set up is something like 2 nodes, 8 GPUs each, but only 4 on each node used. It will probably try to create communicators to connect just the first 8 GPUs on machine 0. This still serves some purpose because it checks if GPUs are healthy and communicators can be created. We can fix this if there is demand, but likely the only way is to have the user call torch.cuda.set_device() and respect that if it is set.

…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]

…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140425736 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

mrshenli · 2021-10-13T17:43:11Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -1,4 +1,5 @@
 #include <c10d/ProcessGroupNCCL.hpp>
+#include <c10/util/Exception.h>


nit: IIUC, this needs to be

#include <c10d/ProcessGroupNCCL.hpp> #include <c10/util/Exception.h>

And it might make more sense to group #include <c10/util/Exception.h> together with the rest of imports starting from line 15.

mrshenli · 2021-10-13T17:43:18Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+#include <c10/core/DeviceType.h>
 #include <ATen/cuda/CUDAContext.h>
 #include <c10/cuda/CUDAGraphsC10Utils.h>
 #include <c10/cuda/CUDAGuard.h>


mrshenli · 2021-10-13T17:47:00Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

+  });
+  // We don't need to join the thread, just need to verify health check via the
+  // CV. Hence we detach the thread here.
+  t.detach(); // NOLINT


curious, since we already wait for the signal from this thread, why do we choose to detach instead of joining thread after wait?

…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]

…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140560113 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

facebook-github-bot · 2021-10-15T05:24:46Z

This pull request has been merged in 06fa6c1.

…ummy NCCL comms in constructor"" (#66393) Summary: Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140560113 Test Plan: CI Reviewed By: mrshenli Differential Revision: D31534735 fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2

ptrblck · 2021-10-25T08:43:46Z

Our internal nightly test started to fail after this PR landed.
The error we are seeing is:

NCCL_DEBUG=INFO python test/distributed/test_c10d_nccl.py -v -k test_default_store_timeout_nccl
test_default_store_timeout_nccl (__main__.TimeoutTest) ... 963fe44156bf:36898:37029 [0] NCCL INFO Bootstrap : Using eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
963fe44156bf:36898:37029 [0] NCCL INFO P2P plugin IBext
963fe44156bf:36898:37029 [0] NCCL INFO NET/IB : Using [0]mlx5_5:1/RoCE ; OOB eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Using network IBext
NCCL version 2.11.4+cuda11.5
ERROR

======================================================================
ERROR: test_default_store_timeout_nccl (__main__.TimeoutTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
    return func(*args, **kwargs)
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2243, in wrapper
    return func(*args, **kwargs)
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
    return func(*args, **kwargs)
  File "test/distributed/test_c10d_nccl.py", line 180, in test_default_store_timeout_nccl
    self._test_default_store_timeout("nccl")
  File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 117, in _test_default_store_timeout
    raise c2p[0]
  File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 72, in _test_store_timeout
    c10d.distributed_c10d.init_process_group(
  File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 584, in init_process_group
    default_pg = _new_process_group_helper(
  File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 720, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL: Health check failure: Failed to initialize NCCL communicator on rank 0

Did you see the same failures previously in your CI?

rohan-varma requested review from H-Huang, mingzhe09088, mrshenli, pritamdamania87 and zhaojuanmao as code owners October 11, 2021 02:16

pytorch-probot bot added the ciflow/default label Oct 11, 2021

rohan-varma mentioned this pull request Oct 11, 2021

Skip test_nccl_errors_nonblocking #66394

Closed

mrshenli reviewed Oct 11, 2021

View reviewed changes

rohan-varma requested a review from mrshenli October 13, 2021 01:28

rohan-varma added the ciflow/all label Oct 13, 2021

mrshenli approved these changes Oct 13, 2021

View reviewed changes

facebook-github-bot added the cla signed label Oct 13, 2021

facebook-github-bot closed this in 06fa6c1 Oct 15, 2021

facebook-github-bot added the Merged label Oct 15, 2021

facebook-github-bot deleted the gh/rohan-varma/426/head branch October 18, 2021 14:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

rohan-varma commented Oct 11, 2021 •

edited

pytorch-probot bot commented Oct 11, 2021 •

edited

⚛️ CI Flow

facebook-github-bot commented Oct 11, 2021 •

edited

mrshenli Oct 11, 2021

mrshenli Oct 11, 2021

mrshenli Oct 11, 2021

rohan-varma Oct 11, 2021

mrshenli Oct 11, 2021

rohan-varma Oct 12, 2021

mrshenli Oct 11, 2021

mrshenli Oct 11, 2021

rohan-varma Oct 12, 2021 •

edited

mrshenli Oct 13, 2021

mrshenli Oct 13, 2021

mrshenli Oct 13, 2021

facebook-github-bot commented Oct 15, 2021

ptrblck commented Oct 25, 2021

		@@ -1,4 +1,5 @@
		#include <c10d/ProcessGroupNCCL.hpp>
		#include <c10/util/Exception.h>

Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

Conversation

rohan-varma commented Oct 11, 2021 • edited

pytorch-probot bot commented Oct 11, 2021 • edited

⚛️ CI Flow

facebook-github-bot commented Oct 11, 2021 • edited

🔗 Helpful links

💊 CI failures summary and remediations

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rohan-varma Oct 12, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

facebook-github-bot commented Oct 15, 2021

ptrblck commented Oct 25, 2021

rohan-varma commented Oct 11, 2021 •

edited

pytorch-probot bot commented Oct 11, 2021 •

edited

facebook-github-bot commented Oct 11, 2021 •

edited

rohan-varma Oct 12, 2021 •

edited