Failed to create Gloo new group after initialized with NCCL #68726

zhuzilin · 2021-11-22T10:20:41Z

🐛 Bug

In our project Tencent/PatrickStar, we need to create a NCCL comm group and a Gloo comm group in order to utilize both GPU reduce scatter and CPU comm operations. However, it seems that after initialized with NCCL, the Gloo group could not detect the master address and master port, but instead using localhost (127.0.0.1).

I'm afraid the reason is that the NCCL store and Gloo store are not compatible with each other so that the new Gloo group could not read the master addr saved by NCCL group.

The relevant error message is:

Traceback (most recent call last):
Traceback (most recent call last):
  File "test_new_group.py", line 4, in <module>
  File "test_new_group.py", line 4, in <module>
        cpu_comm = torch.distributed.new_group(backend="gloo")cpu_comm = torch.distributed.new_group(backend="gloo")

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2843, in new_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2843, in new_group
        pg = _new_process_group_helper(pg = _new_process_group_helper(

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 668, in _new_process_group_helper
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 668, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
    RuntimeErrorpg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout): 
[/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.0.1]:1453: Connection refused
RuntimeError: [/opt/pytorch/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:799] connect [127.0.0.1]:4005: Connection refused
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6027) of binary: /opt/conda/bin/python3

To Reproduce

# test_new_group.py
import torch

torch.distributed.init_process_group(backend="nccl")
cpu_comm = torch.distributed.new_group(backend="gloo")

Run :

python3 -m torch.distributed.launch --nproc_per_node=1 \
               --nnodes=2 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT \
              --node_rank=$NODE_RANK \
              test_new_group.py

Note that the code should work with only one machine or the second new group is created with backend "nccl".

Expected behavior

User should be allowed to create 2 comm group of different types.

Environment

I'm using the NGC container: nvcr.io/nvidia/pytorch:21.09-py3

PyTorch Version (e.g., 1.0): 1.10.0
OS (e.g., Linux): linux
Python version: 3.8
CUDA/cuDNN version: 11.4
GPU models and configuration: A100

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang

The text was updated successfully, but these errors were encountered:

rohan-varma · 2021-11-23T02:50:58Z

Hmm, I am not able to reproduce the issue using the latest version of PyTorch nightly.

I ran the following script:

https://github.com/rohan-varma/torch-script/blob/master/training_script.py

via

srun -p train --nodes=2 -t 5:00:00 --gpus-per-node=1 --cpus-per-task=8 ./test.sh

on a GPU cluster, where test.sh is https://github.com/rohan-varma/torch-script/blob/master/test.sh, and the output was -

[W socket.cpp:634] The server socket on [ip-10-200-91-114.ec2.internal]:29501 is not yet listening (generic error: 111 - Connection refused).
imported
imported
initialized pgs
done
initialized pgs
done
/fsx/users/rvarm1/conda/envs/pytorch_nightly/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated

A significant overhaul of the TCPStore has landed recently which improves the error logging, could you try on the latest/nightly PyTorch so we can get more error details? #68226

rohan-varma · 2021-11-23T02:57:09Z

In addition, new_group will use the same default_store as the default process group, so it is quite strange that it would attempt to connect to the wrong address.

Can you print out $MASTER_ADDR and $MASTER_PORT in both of your worker scripts?

zhuzilin · 2021-11-26T08:13:10Z

@rohan-varma Thank you for your reply! I've double checked the cluster and found that the real error is I could not establish gloo comm group. And the reason for that is the hostname of the ips are all set to the same and point to 127.0.0.1. Therefore gloo resolved the hostname to localhost and connected to itself... I've manually changed the hostname at it works now :)

Thank you for your help! And I wonder if there is a way to pass the ip directly to gloo?

weiyx16 · 2022-10-02T08:48:04Z

I found if we set TORCH_DISTRIBUTED_DEBUG=INFO, PyTorch will setup a gloo-backend group. If the clusters don't support gloo communication, setting this environment variable will cause the error.

masip85 · 2023-07-17T00:35:27Z

I found if we set TORCH_DISTRIBUTED_DEBUG=INFO, PyTorch will setup a gloo-backend group. If the clusters don't support gloo communication, setting this environment variable will cause the error.

This is happening to me too. If I want DEBUG info, can't I avoid that? Is this issue detected?

kumpera · 2023-07-24T21:16:30Z

This is not currently possible

trias702 · 2023-08-24T04:56:20Z

@zhuzilin How did you manage to change the hostname so it works? Did you ever find a way to pass an IP directly to gloo?

chestnut-Q · 2023-11-16T05:21:43Z

How did you manage to change the hostname so it works? Did you ever find a way to pass an IP directly to gloo?

@trias702 You can try manually setting the network interface as follows: First, use ifconfig to find the interface corresponding to your IP address, such as em0, eth0; then set the environment variable os.environ['GLOO_SOCKET_IFNAME'] = 'eth0'.

jbohnslav · 2024-02-25T04:57:43Z

This is an unfortunate bug. The only time you want to set TORCH_DISTRIBUTED_DEBUG=INFO is if you're having trouble with torch.distributed. That seems like the wrong time to set up an extra process group. In my case, the gloo backend wasn't compatible with my environment and crashed all my jobs.

jbschlosser added module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue labels Nov 22, 2021

ejguan mentioned this issue Feb 17, 2023

NCCL backend can't be used with a dataset that is IterDataPipe #95005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to create Gloo new group after initialized with NCCL #68726

Failed to create Gloo new group after initialized with NCCL #68726

zhuzilin commented Nov 22, 2021 •

edited by pytorch-probot bot

rohan-varma commented Nov 23, 2021

rohan-varma commented Nov 23, 2021

zhuzilin commented Nov 26, 2021

weiyx16 commented Oct 2, 2022 •

edited

masip85 commented Jul 17, 2023

kumpera commented Jul 24, 2023

trias702 commented Aug 24, 2023

chestnut-Q commented Nov 16, 2023

jbohnslav commented Feb 25, 2024

Failed to create Gloo new group after initialized with NCCL #68726

Failed to create Gloo new group after initialized with NCCL #68726

Comments

zhuzilin commented Nov 22, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Expected behavior

Environment

rohan-varma commented Nov 23, 2021

rohan-varma commented Nov 23, 2021

zhuzilin commented Nov 26, 2021

weiyx16 commented Oct 2, 2022 • edited

masip85 commented Jul 17, 2023

kumpera commented Jul 24, 2023

trias702 commented Aug 24, 2023

chestnut-Q commented Nov 16, 2023

jbohnslav commented Feb 25, 2024

zhuzilin commented Nov 22, 2021 •

edited by pytorch-probot bot

weiyx16 commented Oct 2, 2022 •

edited