-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Open
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue
Description
🐛 Describe the bug
Initializing torch distributed with NCCL backend:
import torch
torch.distributed.init_process_group(backend="nccl")
Leads to the error of:
Traceback (most recent call last):
File "main_task_caption.py", line 24, in <module>
torch.distributed.init_process_group(backend="nccl")
File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370120218/work/torch/lib/c10d/ProcessGroupNCCL.cpp:748, internal error, NCCL version 2.7.8
How should I handle such an issue? Pointers greatly appreciated
Versions
python=3.6.9
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.0 (v11.0.221)
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501
Metadata
Metadata
Assignees
Labels
oncall: distributedAdd this issue/PR to distributed oncall triage queueAdd this issue/PR to distributed oncall triage queue