New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Back out "Revert D31299350: Back out "Revert D31005792: [NCCL] Init dummy NCCL comms in constructor"" #66393
Conversation
…ummy NCCL comms in constructor"" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit cf39c7b (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]
…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140210584 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
#include "c10d/ProcessGroup.hpp" | ||
#include "c10d/Types.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be
#include <c10d/ProcessGroup.hpp>
#include <c10d/Types.hpp>
@@ -19,7 +22,7 @@ using c10d::ProcessGroup; | |||
|
|||
class NCCLTestBase { | |||
public: | |||
NCCLTestBase(const std::string& path) : path_(path) {} | |||
NCCLTestBase(const std::string& path, const std::chrono::milliseconds pgTimeout = kProcessGroupDefaultTimeout) : path_(path), pgTimeout_(pgTimeout) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: break long line
// Catch error relating to health check failure | ||
bool error_caught = false; | ||
try { | ||
test.initialize(timeout ? 0 : -1, worldSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, why are we using the timeout to control the rank?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timeout
is a bool in this case and controls whether we should test the timeout error path or actual exception error path.
So if timeout
is true we run all threads as rank 0 which would result in a timeout when communicators are being initialized.
On other hand to simulate an exception (timeout is false), rank=-1 will result in an error in getDeviceIdx
which we can test here.
test.initialize(timeout ? 0 : -1, worldSize); | ||
} catch (const std::exception &e) { | ||
std::string errMsg = e.what(); | ||
const std::string kTimeoutErr = "Failed to initialize NCCL communicator on rank"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this error raised because multiple threads would like to create communicator on the same device?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case yes, but in general it is raised by ProcessGroupNCCL whenever communicators cannot be initialized for a variety of reasons (timeout, hang, etc).
@@ -13,6 +14,7 @@ | |||
|
|||
#include <ATen/cuda/CUDAContext.h> | |||
#include <c10/cuda/CUDAGraphsC10Utils.h> | |||
#include <c10/core/DeviceType.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sort includes?
at::Device getDeviceForRank(int rank) { | ||
TORCH_CHECK(rank >= 0, "Invalid rank ", rank); | ||
auto numGPUs = at::cuda::getNumGPUs(); | ||
int16_t deviceIdx = static_cast<int16_t>(rank % numGPUs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is this always guarantee to be correct? What if each machine has 8 GPUs, but I only use the first 4 for training? Which device ProcessGroupNCCL
will use before this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For non-barrier collectives pg nccl will look at the device the tensor is on to pick the device to use.
For barrier, we currently use this logic. barrier
also takes in a device_ids
argument that can specify which device to use.
What if each machine has 8 GPUs, but I only use the first 4 for training
In this case, the health check won't be completely accurate if the set up is something like 2 nodes, 8 GPUs each, but only 4 on each node used. It will probably try to create communicators to connect just the first 8 GPUs on machine 0. This still serves some purpose because it checks if GPUs are healthy and communicators can be created. We can fix this if there is demand, but likely the only way is to have the user call torch.cuda.set_device()
and respect that if it is set.
…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]
…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140425736 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
@@ -1,4 +1,5 @@ | |||
#include <c10d/ProcessGroupNCCL.hpp> | |||
#include <c10/util/Exception.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IIUC, this needs to be
#include <c10d/ProcessGroupNCCL.hpp>
#include <c10/util/Exception.h>
And it might make more sense to group #include <c10/util/Exception.h>
together with the rest of imports starting from line 15.
#include <c10/core/DeviceType.h> | ||
#include <ATen/cuda/CUDAContext.h> | ||
#include <c10/cuda/CUDAGraphsC10Utils.h> | ||
#include <c10/cuda/CUDAGuard.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort?
}); | ||
// We don't need to join the thread, just need to verify health check via the | ||
// CV. Hence we detach the thread here. | ||
t.detach(); // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious, since we already wait for the signal from this thread, why do we choose to detach instead of joining thread after wait?
…CCL] Init dummy NCCL comms in constructor""" Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/) [ghstack-poisoned]
…ummy NCCL comms in constructor"" Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140560113 Differential Revision: [D31534735](https://our.internmc.facebook.com/intern/diff/D31534735/)
This pull request has been merged in 06fa6c1. |
…ummy NCCL comms in constructor"" (#66393) Summary: Pull Request resolved: #66393 Third try! Fixes: - test_nccl_timeout can be flaky because of 1s timeout, bump up the timeout to resolve the flakiness. But in general we should not have been relying on time.sleep for this test, filed #66354 to track that. - ciflow/all did not actually run tests due to a bug causing multigpu tests to not be run. This has since been fixed. ghstack-source-id: 140560113 Test Plan: CI Reviewed By: mrshenli Differential Revision: D31534735 fbshipit-source-id: 8b7e0f4fed3972b7a77cbcda28876c9eefb0c7e2
Our internal nightly test started to fail after this PR landed. NCCL_DEBUG=INFO python test/distributed/test_c10d_nccl.py -v -k test_default_store_timeout_nccl
test_default_store_timeout_nccl (__main__.TimeoutTest) ... 963fe44156bf:36898:37029 [0] NCCL INFO Bootstrap : Using eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
963fe44156bf:36898:37029 [0] NCCL INFO P2P plugin IBext
963fe44156bf:36898:37029 [0] NCCL INFO NET/IB : Using [0]mlx5_5:1/RoCE ; OOB eth0:192.168.99.3<0>
963fe44156bf:36898:37029 [0] NCCL INFO Using network IBext
NCCL version 2.11.4+cuda11.5
ERROR
======================================================================
ERROR: test_default_store_timeout_nccl (__main__.TimeoutTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
return func(*args, **kwargs)
File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2243, in wrapper
return func(*args, **kwargs)
File "/opt/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 2904, in wrapper
return func(*args, **kwargs)
File "test/distributed/test_c10d_nccl.py", line 180, in test_default_store_timeout_nccl
self._test_default_store_timeout("nccl")
File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 117, in _test_default_store_timeout
raise c2p[0]
File "/opt/pytorch/pytorch/test/distributed/test_c10d_common.py", line 72, in _test_store_timeout
c10d.distributed_c10d.init_process_group(
File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 584, in init_process_group
default_pg = _new_process_group_helper(
File "/opt/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 720, in _new_process_group_helper
pg = ProcessGroupNCCL(prefix_store, rank, world_size, pg_options)
RuntimeError: ProcessGroupNCCL: Health check failure: Failed to initialize NCCL communicator on rank 0 Did you see the same failures previously in your CI? |
Stack from ghstack:
Third try!
Fixes:
Differential Revision: D31534735