RendezvousConnectionError when use C10d on multi nodes #69197

ghost · 2021-12-01T07:18:27Z

🐛 Bug

When I run the script by torchrun on multi nodes and multi gpus with rdzv_backend of c10d, the node can't create TCP connection with master. Each node can ping to each other and can connect to each other by TCP.

But it is OK if just runs on single node with args standalone.
The tracebacks of all nodes are the same:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
RuntimeError: connect() timed out. Original timeout was 60000 ms.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 228, in launch_agent
    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
    handler = creator(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend
    store = _create_tcp_store(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

To Reproduce

Steps to reproduce the behavior:

Run torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=9527 --rdzv_backend=c10d --rdzv_endpoint=192.168.1.65:12900 pytorch_distributed_run.py on the first node.
Run torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=9527 --rdzv_backend=c10d --rdzv_endpoint=192.168.1.65:12900 pytorch_distributed_run.py on the second node.
The RendezvousConnectionError occur on all nodes after waiting.

Here is what I init the distribution.

import os
import sys

sys.path.append('../')

import torch
import torch.distributed as dist
import torch.distributed.launch
from torch import cuda
from torch.distributed.elastic.multiprocessing.errors import record
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim import SGD
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import transforms
from torchvision.datasets import MNIST
from tqdm import trange, tqdm

from distributed.set_args import parser

from distributed.distributed_helpers import get_rank

from mnist_net import Net


@record
def main():
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False

    dist.init_process_group(backend='nccl')

    local_rank = int(os.environ['LOCAL_RANK'])
    cuda.set_device(local_rank)

    model = Net().cuda()
    model = DDP(model, device_ids=[int(os.environ['LOCAL_RANK'])], output_device=int(os.environ['LOCAL_RANK']))

Expected behavior

Environment

Node 1:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX TITAN X
GPU 1: NVIDIA GeForce GTX TITAN X

Nvidia driver version: 470.74
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.4.0
[pip3] torchnet==0.0.4
[pip3] torchvision==0.11.1
[conda] Could not collect

Node 2:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.5 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23

Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 470.63.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/local/cuda-9.0/lib64/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

The text was updated successfully, but these errors were encountered:

hujunchao · 2023-02-23T12:21:10Z

I meet the same problem. How can I solve it?

balcklive · 2023-03-29T02:22:39Z

same problem

chensongcan · 2023-08-11T09:19:33Z

I meet the same problem. How can I solve it?

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 1, 2021

H-Huang added oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 2, 2021

mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RendezvousConnectionError when use C10d on multi nodes #69197

RendezvousConnectionError when use C10d on multi nodes #69197

ghost commented Dec 1, 2021 •

edited by pytorch-probot bot

hujunchao commented Feb 23, 2023

balcklive commented Mar 29, 2023

chensongcan commented Aug 11, 2023

RendezvousConnectionError when use C10d on multi nodes #69197

RendezvousConnectionError when use C10d on multi nodes #69197

Comments

ghost commented Dec 1, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

hujunchao commented Feb 23, 2023

balcklive commented Mar 29, 2023

chensongcan commented Aug 11, 2023

ghost commented Dec 1, 2021 •

edited by pytorch-probot bot