Skip to content

torch.distributed.init_process_group(backend="nccl") NCCL version error #78638

@yrf1

Description

@yrf1

🐛 Describe the bug

Initializing torch distributed with NCCL backend:

import torch
torch.distributed.init_process_group(backend="nccl")

Leads to the error of:

Traceback (most recent call last):
  File "main_task_caption.py", line 24, in <module>
    torch.distributed.init_process_group(backend="nccl")
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370120218/work/torch/lib/c10d/ProcessGroupNCCL.cpp:748, internal error, NCCL version 2.7.8

How should I handle such an issue? Pointers greatly appreciated

Versions

python=3.6.9
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.0 (v11.0.221)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

Metadata

Metadata

Assignees

No one assigned

    Labels

    oncall: distributedAdd this issue/PR to distributed oncall triage queue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions