-
Notifications
You must be signed in to change notification settings - Fork 21.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible deadlock in dist.init_process_group #9696
Comments
Could you provide a code snippet that reproduces this? |
Sorry for not isolating the issue, this seems to only happen in a large-scale run and I don't have a small repro. Complicated repro is to follow instructions in
I'll update this bug if I can isolate this further |
Hi @yaroslavvb , what kind of pre-warm did you use? Also could you give this a try and see if it helps? https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/distributed.py#L72 |
My "warm-up" is to attach an existing EBS volume from a previously successful run, instead of initializing new volume from AMI. Another workflow that works is this
|
@yaroslavvb This is a bit hard for us to repro. A smaller repro will definitely help. but if it's not possible, next time it happens to you, could you get a full gdb bt trace of all process & threads? For each process, |
Here's an example of things hanging, it looks stuck in
|
You can test whether your master port is reachable from worker nodes |
@yaroslavvb this is likely because AWS give you some machines within different subnets so not all of them are reachable from each other. |
I don't think it explains it because this problem is fixed by killing Python processes and restarting them on same instances. I'm explicitly specifying zone to use, launching all instances into a single placement group. To be safe, I additionally mark master port to be reachable from public internet |
Hi @yaroslavvb , do you compile pytorch from source? If so, could you add a few print statements around
and
The GDB trace indicates the master is still waiting on accept() but the worker is waiting on recv a message from master. Every time master accepts a connection, it confirms by sending back the work's address. Since it's hard to repro on our side, if you could provide in which line the master & workers got stuck respectively, that could be very helpful as well. Thanks! |
I'll close this issue for now since I have not seen this error recently, it may have been fixed in master |
how did you solve this? I am having a similiar issue but I am unable to figure out what it is.
it's easy to reproduce just run the above for each
For me it's on my own signle machine related: https://stackoverflow.com/questions/66498045/how-to-solve-dist-init-process-group-from-hanging-or-deadlocks |
I'm observing hangs in
dist.init_process_group
. This happens reliably (100% of the time for me) when launching PyTorch distributed training runs on AWS using official DLAMI.It goes away when I pre-warm the volume. This workaround makes PyTorch startup much faster, hence I suspect the failure is caused by some handshake logic not being robust to variability in distributed worker timings.
Using NCCL version 2.1.15+cuda9.1, and Amazon Deep Learning AMI v11
Looking at strace, I see some workers stuck in
recvfrom
, while others are waiting onaccept4
The text was updated successfully, but these errors were encountered: