Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

RendezvousConnectionError when use C10d on multi nodes #69197

Open
ghost opened this issue Dec 1, 2021 · 3 comments
Open

RendezvousConnectionError when use C10d on multi nodes #69197

ghost opened this issue Dec 1, 2021 · 3 comments
Labels
oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ghost
Copy link

ghost commented Dec 1, 2021

馃悰 Bug

When I run the script by torchrun on multi nodes and multi gpus with rdzv_backend of c10d, the node can't create TCP connection with master. Each node can ping to each other and can connect to each other by TCP.

But it is OK if just runs on single node with args standalone.
The tracebacks of all nodes are the same:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
RuntimeError: connect() timed out. Original timeout was 60000 ms.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 228, in launch_agent
    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
    handler = creator(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend
    store = _create_tcp_store(params)
  File "/home/ubuntu/miniconda3/envs/py39/lib/python3.9/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

To Reproduce

Steps to reproduce the behavior:

  1. Run torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=9527 --rdzv_backend=c10d --rdzv_endpoint=192.168.1.65:12900 pytorch_distributed_run.py on the first node.
  2. Run torchrun --nnodes=2 --nproc_per_node=2 --rdzv_id=9527 --rdzv_backend=c10d --rdzv_endpoint=192.168.1.65:12900 pytorch_distributed_run.py on the second node.
  3. The RendezvousConnectionError occur on all nodes after waiting.

Here is what I init the distribution.

import os
import sys

sys.path.append('../')

import torch
import torch.distributed as dist
import torch.distributed.launch
from torch import cuda
from torch.distributed.elastic.multiprocessing.errors import record
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.optim import SGD
from torch.utils.data import DataLoader, DistributedSampler
from torchvision import transforms
from torchvision.datasets import MNIST
from tqdm import trange, tqdm

from distributed.set_args import parser

from distributed.distributed_helpers import get_rank

from mnist_net import Net


@record
def main():
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.deterministic = False

    dist.init_process_group(backend='nccl')

    local_rank = int(os.environ['LOCAL_RANK'])
    cuda.set_device(local_rank)

    model = Net().cuda()
    model = DDP(model, device_ids=[int(os.environ['LOCAL_RANK'])], output_device=int(os.environ['LOCAL_RANK']))

Expected behavior

Environment

Node 1:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.27

Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 9.1.85
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX TITAN X
GPU 1: NVIDIA GeForce GTX TITAN X

Nvidia driver version: 470.74
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchfile==0.1.0
[pip3] torchmetrics==0.4.0
[pip3] torchnet==0.0.4
[pip3] torchvision==0.11.1
[conda] Could not collect

Node 2:
PyTorch version: 1.10.0
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.5 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.23

Python version: 3.9.5 (default, Jun 4 2021, 12:28:51) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 470.63.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.2.1
/usr/local/cuda-9.0/lib64/libcudnn.so.7
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.0
[pip3] torch==1.10.0
[pip3] torchaudio==0.10.0
[pip3] torchvision==0.11.1
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Dec 1, 2021
@H-Huang H-Huang added oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Dec 2, 2021
@mrshenli mrshenli added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 8, 2021
@hujunchao
Copy link

I meet the same problem. How can I solve it?

@balcklive
Copy link

same problem

@chensongcan
Copy link

I meet the same problem. How can I solve it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: r2p Add this issue/PR to R2P (elastic) oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants