RendezvousConnectionError when use C10d on multi nodes #69197
Labels
oncall: r2p
Add this issue/PR to R2P (elastic) oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃悰 Bug
When I run the script by torchrun on multi nodes and multi gpus with rdzv_backend of c10d, the node can't create TCP connection with master. Each node can ping to each other and can connect to each other by TCP.
But it is OK if just runs on single node with args standalone.
The tracebacks of all nodes are the same:
To Reproduce
Steps to reproduce the behavior:
Here is what I init the distribution.
Expected behavior
Environment
Node 1:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27
Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX TITAN X
GPU 1: NVIDIA GeForce GTX TITAN X
Nvidia driver version: 470.74
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.4.0
[pip3] torchnet==0.0.4
[pip3] torchvision==0.11.1
[conda] Could not collect
Node 2:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.5 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23
Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti
Nvidia driver version: 470.63.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/local/cuda-9.0/lib64/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang
The text was updated successfully, but these errors were encountered: