torch.distributed.init_process_group(backend="nccl") NCCL version error

### 🐛 Describe the bug

Initializing torch distributed with NCCL backend:
```
import torch
torch.distributed.init_process_group(backend="nccl")
```

Leads to the error of:
```
Traceback (most recent call last):
  File "main_task_caption.py", line 24, in <module>
    torch.distributed.init_process_group(backend="nccl")
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370120218/work/torch/lib/c10d/ProcessGroupNCCL.cpp:748, internal error, NCCL version 2.7.8
```

How should I handle such an issue? Pointers greatly appreciated

### Versions

python=3.6.9 
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.0  (v11.0.221)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.distributed.init_process_group(backend="nccl") NCCL version error #78638

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

torch.distributed.init_process_group(backend="nccl") NCCL version error #78638

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions