Skip to content

torch.distributed.init_process_group(backend="nccl") NCCL version error #78638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yrf1 opened this issue Jun 1, 2022 · 2 comments
Open

torch.distributed.init_process_group(backend="nccl") NCCL version error #78638

yrf1 opened this issue Jun 1, 2022 · 2 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@yrf1
Copy link

yrf1 commented Jun 1, 2022

🐛 Describe the bug

Initializing torch distributed with NCCL backend:

import torch
torch.distributed.init_process_group(backend="nccl")

Leads to the error of:

Traceback (most recent call last):
  File "main_task_caption.py", line 24, in <module>
    torch.distributed.init_process_group(backend="nccl")
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
    barrier()
  File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370120218/work/torch/lib/c10d/ProcessGroupNCCL.cpp:748, internal error, NCCL version 2.7.8

How should I handle such an issue? Pointers greatly appreciated

Versions

python=3.6.9
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.0 (v11.0.221)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 1, 2022
@wanchaol
Copy link
Collaborator

wanchaol commented Jun 5, 2022

@yrf1 Can you share your script of setting up the process group? i.e. did you specify MASTER_ADDR and MASTER_PORT? Did you follow tutorials we shared (i.e. https://pytorch.org/tutorials/intermediate/dist_tuto.html)

Let me know if you still face some failures after following those tutorials, thanks!

@kwen2501
Copy link
Contributor

kwen2501 commented Jun 7, 2022

Hi, I recommend turning on NCCL_DEBUG=INFO to see what the NCCL internal error is about.

Also, NCCL 2.7.8 is pretty old. At the time of PyTorch 1.11 release, the NCCL version that came with it is 2.10.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

4 participants