Skip to content

NCCL watchdog thread terminated with exception  #113128

@syngokhan

Description

@syngokhan

🐛 Describe the bug

Hello ,

I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes in the middle, or it doesn't start at all and throws this error. But when I start the training, I encounter an error. I will share it below. I really couldn't solve this problem, what should I do?

one gpu and more gpus throws this error.

Do I need to upgrade or downgrade the CUDA versions? I would appreciate your help.

#ERROR

[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd75cf92617 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd75cf4d98d in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd806cea9f8 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd6e8504918 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fd6e851b15b in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fd6e851b468 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fd72d0dbbf4 in /usr/anaconda3/envs/train/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd80a894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126a40 (0x7fd80a926a40 in /lib/x86_64-linux-gnu/libc.so.6)

Versions

ENV List :

Python3.10.12
cuda version = 12.1.1
cudnn version = 8.9.2.26
nccl version = 2.18.1

absl-py==2.0.0
accelerate==0.24.1
aiohttp==3.8.6
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
cachetools==5.3.2
certifi==2023.7.22
charset-normalizer==3.3.1
datasets==2.14.6
deepspeed==0.11.1
dill==0.3.7
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.10.0
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.2
hjson==3.1.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
Markdown==3.5
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
packaging==23.2
pandas==2.1.2
Pillow==10.1.0
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==13.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.13
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.4.0
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.2
tokenizers==0.14.1
torch==2.1.0
torchaudio==2.1.0
torchvision==0.16.0
tqdm==4.66.1
transformers==4.35.0
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
Werkzeug==3.0.1
xxhash==3.4.1
yarl==1.9.2

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @rohan-varma

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: ncclProblems related to nccl supportoncall: distributedAdd this issue/PR to distributed oncall triage queuetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions