New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
NCCL watchdog thread terminated with exception #113128
Comments
Hello again, do you have any suggestions or solutions for this issue? I downgraded and upgraded the cuda, cdnn, nccl versions (I couldn't try them all) but I couldn't find a solution. This problem causes the computer to crash completely and causes problems in restarting it. The computer cannot recover for long periods of time. We have an H100 x 2 workstation, I wanted to mention that as a note. Maybe the problem is specific to this graphics card. Thanks again for your help. |
I wonder if you're hitting a crash inside the cuda driver or have a hardware issue? The stacktrace you're referencing here ( You can try rerunning with |
I still received the same errors with CUDA_LAUNCH_BLOCKING=1 products. It gave results like this. You can correct me if I'm wrong. |
How can I tell if it is hardware or not? What path should I follow? I have fully installed CUDA, Cudnn, nvcc. I see the device with nvidia-smi and monitor the device. While it is being used, it suddenly goes away during training and believe me, it takes a long time for the computer to reset itself. It collapses at some point. When I type nvidia-smi when the crash occurs; cmd -> nvidia-smi As an error comes. What I don't understand is that it is very strange that there was no problem in the 4 trainings completed before this, but now it is happening. Maybe I'm insufficient in debugging or not, I don't know. I would be happy if you guide me. I installed pytorch 2.2.0 dev and now it throws an error like this. Should I upgrade to CUDA 12.3 etc? Other versions too? |
Same issue here with PyTorch 2.2 |
I think this error points to an application error (or deepspeed library error) where a tensor that is given to a cuda kernel is not aligned. Also, it is surfaced by the nccl watchdog becuase the watchdog is periodically checking cuda state. But the error could have come from any cuda kernel (compute or communication). Becuase cuda's API is async, errors happening during one kernel are not immediately 'raised' to the CPU, and only noticed later by someone who checks on the state (like the watchdog).
|
Same issue here. Upgrading torch and torch lightning didn't help. |
Same issue |
same issue |
馃悰 Describe the bug
Hello ,
I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes in the middle, or it doesn't start at all and throws this error. But when I start the training, I encounter an error. I will share it below. I really couldn't solve this problem, what should I do?
one gpu and more gpus throws this error.
Do I need to upgrade or downgrade the CUDA versions? I would appreciate your help.
#ERROR
[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd75cf92617 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd75cf4d98d in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd806cea9f8 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd6e8504918 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fd6e851b15b in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fd6e851b468 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fd72d0dbbf4 in /usr/anaconda3/envs/train/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd80a894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126a40 (0x7fd80a926a40 in /lib/x86_64-linux-gnu/libc.so.6)
Versions
ENV List :
Python3.10.12
cuda version = 12.1.1
cudnn version = 8.9.2.26
nccl version = 2.18.1
absl-py==2.0.0
accelerate==0.24.1
aiohttp==3.8.6
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
cachetools==5.3.2
certifi==2023.7.22
charset-normalizer==3.3.1
datasets==2.14.6
deepspeed==0.11.1
dill==0.3.7
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.10.0
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.2
hjson==3.1.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
Markdown==3.5
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
packaging==23.2
pandas==2.1.2
Pillow==10.1.0
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==13.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.13
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.4.0
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.2
tokenizers==0.14.1
torch==2.1.0
torchaudio==2.1.0
torchvision==0.16.0
tqdm==4.66.1
transformers==4.35.0
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
Werkzeug==3.0.1
xxhash==3.4.1
yarl==1.9.2
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin
The text was updated successfully, but these errors were encountered: