You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "main_task_caption.py", line 24, in <module>
torch.distributed.init_process_group(backend="nccl")
File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
barrier()
File "/shared/nas/data/users/yifung2/envs/py_univl/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370120218/work/torch/lib/c10d/ProcessGroupNCCL.cpp:748, internal error, NCCL version 2.7.8
How should I handle such an issue? Pointers greatly appreciated
Versions
python=3.6.9
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.0 (v11.0.221)
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Initializing torch distributed with NCCL backend:
Leads to the error of:
How should I handle such an issue? Pointers greatly appreciated
Versions
python=3.6.9
conda install pytorch==1.11.0 cudatoolkit=11.0 -c pytorch
NCCL version 2.7.8
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.0 (v11.0.221)
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @kwen2501
The text was updated successfully, but these errors were encountered: