New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Torch RPC on multiple nodes with GPU returns a EOF error #95487
Comments
Here's the error on the node log
|
"connection reset by peer" means that the peer is exiting due to a failure. Do you know what happened to that peer? |
I have 3 nodes (rank 0: EOF error, rank 2: connection reset). terminate called after throwing an instance of 'std::runtime_error'
what(): In globalIdxForDevice at tensorpipe/channel/cuda_ipc/context_impl.cc:102 "iter == globalUuids.end()Couldn't find GPU with UUID b951cea7-64f5-946d-a7fd-981149841eed"
srun: error: a100-st-p4d24xlarge-13: task 0: Aborted (core dumped) This is the stderr of rank 1 |
I tried to do a bit more debugging. I also tried to get rid of any submitit specific behaviour (like unpickling the target function) by re-writing my example above with the So TL;DR: when launching multiple jobs with RPC on different nodes with GPU, a failure occurs in |
I'm able to reproduce this and this is the stack trace of the core dump I get on rank 1
Unfortunately its pretty deep in the tensorpipe library which I don't understand. I am not sure why non-cuda RPC has a cuda context being set up in tensorpipe. |
I can reproduce the issue even with the official demo: https://github.com/pytorch/examples/tree/main/distributed/rpc/ddp_rpc The coredump occurs at rank 0, where other ranks try to register/membership-updating to the rank 0. Different ranks have to launched at different machines, which means rank 0 cannot see GPUs on other ranks. During the register/membership-updating, tensorpipe tries to setup CUDA IPC channel between the local devices and devices on the newly joined rank. The CUDA IPC channel shall serve faster connection between GPUs instead of transferring via CPU. However, it seems tensorpipe creates the GPU uuid list, as well as its p2p support matrix, at its initialization and never update when new rank joins. In this way, tensorpipe iterate its GPU uuid list and try to find the GPU uuid coming from a remote rank, which by all means does not exist on rank 0. A quick fix is to change the behavior of cuda ipc in tensorpipe. However, the tensorpipe repository is archived. @H-Huang Could you suggest the best approach to patch the archived tensorpipe repository? Or is there any plan to retire tensorpipe in PyTorch? |
馃悰 Describe the bug
When running torch rpc on multiple nodes with submitit (through slurm) I get an EOF error even if I'm not using the gpus and I'm not making them available to RPC.
Here's a script to reproduce:
I commented the line that makes the code break if uncommented (you should comment the line above tagged with
# works
).###聽What does not matter
devices=list_of_device
, ordevices=[]
the effect is the same.I had to set the
_transport
in the TensorPipe options because I'm running on AWS and without it it's not runningHere's the error:
Versions
Latest torch nightly, locally built
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @pietern @jjlilley @mrzzd @lw @beauby
The text was updated successfully, but these errors were encountered: