This repository has been archived by the owner on Jan 6, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 98
Elastic agent doesn't detect worker failures in NCCL #134
Comments
ruipeterpan
changed the title
Question about behavior difference on gloo and nccl
Elastic agent doesn't detect worker failures in NCCL
Nov 16, 2020
Hey there, I think I have the same trouble. Best regards, |
@tchaton Unfortunately I haven't been able to resolve this issue :( |
Thanks for the question. Have you tried setting |
Hey @kiukchung thanks for the pointer! Setting the environment variable
Thanks again for the quick help! Closing this issue. |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Context
I have been using torchelastic for a while to launch fault-tolerant jobs on CPUs using the
gloo
backend. I was switching to GPUs so that I can usebroadcast
andreduce
. I firstly made the necessary modifications to move everything onto GPUs. Then, I changed the backend for group initialization fromgloo
tonccl
hoping things will work as before. However, fornccl
, when some workers gets killed, the remaining workers stay in the previous rendezvous and hang, whereas the elastic agent should be able to detect a worker failure and halts all workers.Current Behavior
When using the
nccl
backend, when a worker is killed, the remaining workers hang instead of throwing a RuntimeError duringall_reduce()
like when using thegloo
backend.The workers that are killed outputs this (which is expected):
However, for the remaining workers, the elastic agent doesn't declare the process group as failed. Here is the log obtained by using
export NCCL_DEBUG=INFO
:Expected Behavior
Just like
gloo
, after some workers are killed, the remaining workers/gloo should be able to detect a missing member duringall_reduce()
, and throw a RuntimeError so that the local_elastic_agent can mark the worker group as failed, halt the training, and wait for a new worker to join the next rendezvous.The workers that are killed should output this:
The surviving workers should output this:
More details
dist.init_process_group(backend='gloo', init_method='env://')
to initialize the process group.The text was updated successfully, but these errors were encountered: