Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393

Closed
wants to merge 4 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Oct 11, 2021

Stack from ghstack:

Third try!

Fixes:

  • test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed test_nccl_timeout should not rely on time.sleep() #66354 to track that.
  • ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: D31534735

…ummy NCCL comms in constructor""

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

[ghstack-poisoned]
@pytorch-probot
Copy link

pytorch-probot bot commented Oct 11, 2021

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/cf39c7b8461c97bf064bbd8e6e3c0ab4364ac822/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default,ciflow/all

Workflows Labels (bold enabled) Status
Triggered Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux ✅ triggered
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux ✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow ✅ triggered
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux ✅ triggered
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck ✅ triggered
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win ✅ triggered
puretorch-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Oct 11, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit cf39c7b (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…CCL] Init dummy NCCL comms in constructor"""

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 11, 2021
…ummy NCCL comms in constructor""

Pull Request resolved: #66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140210584

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
Comment on lines 8 to 9
#include "c10d/ProcessGroup.hpp"
#include "c10d/Types.hpp"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be

#include <c10d/ProcessGroup.hpp>
#include <c10d/Types.hpp>

@@ -19,7 +22,7 @@ using c10d::ProcessGroup;

class NCCLTestBase {
public:
NCCLTestBase(const std::string& path) : path_(path) {}
NCCLTestBase(const std::string& path, const std::chrono::milliseconds pgTimeout = kProcessGroupDefaultTimeout) : path_(path), pgTimeout_(pgTimeout) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: break long line

// Catch error relating to health check failure
bool error_caught = false;
try {
test.initialize(timeout ? 0 : -1, worldSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, why are we using the timeout to control the rank?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout is a bool in this case and controls whether we should test the timeout error path or actual exception error path.

So if timeout is true we run all threads as rank 0 which would result in a timeout when communicators are being initialized.

On other hand to simulate an exception (timeout is false), rank=-1 will result in an error in getDeviceIdx which we can test here.

test.initialize(timeout ? 0 : -1, worldSize);
} catch (const std::exception &e) {
std::string errMsg = e.what();
const std::string kTimeoutErr = "Failed to initialize NCCL communicator on rank";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error raised because multiple threads would like to create communicator on the same device?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case yes, but in general it is raised by ProcessGroupNCCL whenever communicators cannot be initialized for a variety of reasons (timeout, hang, etc).

@@ -13,6 +14,7 @@

#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/core/DeviceType.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sort includes?

at::Device getDeviceForRank(int rank) {
TORCH_CHECK(rank >= 0, "Invalid rank ", rank);
auto numGPUs = at::cuda::getNumGPUs();
int16_t deviceIdx = static_cast<int16_t>(rank % numGPUs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this always guarantee to be correct? What if each machine has 8 GPUs, but I only use the first 4 for training? Which device ProcessGroupNCCL will use before this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-barrier collectives pg nccl will look at the device the tensor is on to pick the device to use.

For barrier, we currently use this logic. barrier also takes in a device_ids argument that can specify which device to use.

What if each machine has 8 GPUs, but I only use the first 4 for training

In this case, the health check won't be completely accurate if the set up is something like 2 nodes, 8 GPUs each, but only 4 on each node used. It will probably try to create communicators to connect just the first 8 GPUs on machine 0. This still serves some purpose because it checks if GPUs are healthy and communicators can be created. We can fix this if there is demand, but likely the only way is to have the user call torch.cuda.set_device() and respect that if it is set.

…CCL] Init dummy NCCL comms in constructor"""

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 13, 2021
…ummy NCCL comms in constructor""

Pull Request resolved: #66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140425736

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
@@ -1,4 +1,5 @@
#include <c10d/ProcessGroupNCCL.hpp>
#include <c10/util/Exception.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: IIUC, this needs to be

#include <c10d/ProcessGroupNCCL.hpp>

#include <c10/util/Exception.h>

And it might make more sense to group #include <c10/util/Exception.h> together with the rest of imports starting from line 15.

Comment on lines 15 to 18
#include <c10/core/DeviceType.h>
#include <ATen/cuda/CUDAContext.h>
#include <c10/cuda/CUDAGraphsC10Utils.h>
#include <c10/cuda/CUDAGuard.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort?

});
// We don't need to join the thread, just need to verify health check via the
// CV. Hence we detach the thread here.
t.detach(); // NOLINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, since we already wait for the signal from this thread, why do we choose to detach instead of joining thread after wait?

…CCL] Init dummy NCCL comms in constructor"""

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Oct 14, 2021
…ummy NCCL comms in constructor""

Pull Request resolved: #66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140560113

Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 06fa6c1.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/426/head branch October 18, 2021 14:44
wconstab pushed a commit that referenced this pull request Oct 20, 2021
…ummy NCCL comms in constructor"" (#66393)

Summary:
Pull Request resolved: #66393

Third try!

Fixes:
- test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that.
- ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed.
ghstack-source-id: 140560113

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D31534735

fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2
@ptrblck
Copy link
Collaborator

ptrblck commented Oct 25, 2021

Our internal nightly test started to fail after this PR landed.
The error we are seeing is:

NCCL_DEBUG=INFO python test/distributed/test_c10d_nccl.py -v -k test_default_store_timeout_nccl
test_default_store_timeout_nccl (__main__.TimeoutTest) ... 963fe44156bf:36898:37029 [0] NCCL INFO Bootstrap : Using eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
963fe44156bf:36898:37029 [0] NCCL INFO P2P plugin IBext
963fe44156bf:36898:37029 [0] NCCL INFO NET/IB : Using [0]mlx5_5:1/RoCE ; OOB eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Using network IBext
NCCL version 2.11.4+cuda11.5
ERROR

======================================================================
ERROR: test_default_store_timeout_nccl (__main__.TimeoutTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
    return func(*args, **kwargs)
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2243, in wrapper
    return func(*args, **kwargs)
  File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
    return func(*args, **kwargs)
  File "test/distributed/test_c10d_nccl.py", line 180, in test_default_store_timeout_nccl
    self._test_default_store_timeout("nccl")
  File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 117, in _test_default_store_timeout
    raise c2p[0]
  File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 72, in _test_store_timeout
    c10d.distributed_c10d.init_process_group(
  File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 584, in init_process_group
    default_pg = _new_process_group_helper(
  File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 720, in _new_process_group_helper
    pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL: Health check failure: Failed to initialize NCCL communicator on rank 0

Did you see the same failures previously in your CI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants