Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL watchdog thread terminated with exception #113128

Open
syngokhan opened this issue Nov 7, 2023 · 11 comments
Open

NCCL watchdog thread terminated with exception #113128

syngokhan opened this issue Nov 7, 2023 · 11 comments
Labels
module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue

Comments

@syngokhan
Copy link

syngokhan commented Nov 7, 2023

馃悰 Describe the bug

Hello ,

I have two h100 devices. I'm running an application via DeepSpeedChat. I ran LLama2-Chat-hf 3 4 times before and finished the training. Either the training starts and explodes in the middle, or it doesn't start at all and throws this error. But when I start the training, I encounter an error. I will share it below. I really couldn't solve this problem, what should I do?

one gpu and more gpus throws this error.

Do I need to upgrade or downgrade the CUDA versions? I would appreciate your help.

#ERROR

[E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd75cf92617 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fd75cf4d98d in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fd806cea9f8 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fd6e8504918 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7fd6e851b15b in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fd6e851b468 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7fd72d0dbbf4 in /usr/anaconda3/envs/train/bin/../lib/libstdc++.so.6)
frame #8: + 0x94ac3 (0x7fd80a894ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: + 0x126a40 (0x7fd80a926a40 in /lib/x86_64-linux-gnu/libc.so.6)

Versions

ENV List :

Python3.10.12
cuda version = 12.1.1
cudnn version = 8.9.2.26
nccl version = 2.18.1

absl-py==2.0.0
accelerate==0.24.1
aiohttp==3.8.6
aiosignal==1.3.1
async-timeout==4.0.3
attrs==23.1.0
cachetools==5.3.2
certifi==2023.7.22
charset-normalizer==3.3.1
datasets==2.14.6
deepspeed==0.11.1
dill==0.3.7
filelock==3.13.1
frozenlist==1.4.0
fsspec==2023.10.0
google-auth==2.23.3
google-auth-oauthlib==1.1.0
grpcio==1.59.2
hjson==3.1.0
huggingface-hub==0.17.3
idna==3.4
Jinja2==3.1.2
Markdown==3.5
MarkupSafe==2.1.3
mpmath==1.3.0
multidict==6.0.4
multiprocess==0.70.15
networkx==3.2.1
ninja==1.11.1.1
numpy==1.26.1
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.3.52
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
packaging==23.2
pandas==2.1.2
Pillow==10.1.0
protobuf==3.20.3
psutil==5.9.6
py-cpuinfo==9.0.0
pyarrow==13.0.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pydantic==1.10.13
python-dateutil==2.8.2
pytz==2023.3.post1
PyYAML==6.0.1
regex==2023.10.3
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
safetensors==0.4.0
sentencepiece==0.1.99
six==1.16.0
sympy==1.12
tensorboard==2.15.0
tensorboard-data-server==0.7.2
tokenizers==0.14.1
torch==2.1.0
torchaudio==2.1.0
torchvision==0.16.0
tqdm==4.66.1
transformers==4.35.0
triton==2.1.0
typing_extensions==4.8.0
tzdata==2023.3
urllib3==2.0.7
Werkzeug==3.0.1
xxhash==3.4.1
yarl==1.9.2

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin

@malfet malfet added oncall: distributed Add this issue/PR to distributed oncall triage queue module: nccl Problems related to nccl support labels Nov 7, 2023
@wconstab
Copy link
Contributor

wconstab commented Nov 7, 2023

cc @fduwjj @kwen2501 any idea if this is already fixed by a certain combination of nccl version or pytorch version? Any other debug flag we should turn on to get more info?

@syngokhan
Copy link
Author

cc @fduwjj @kwen2501 any idea if this is already fixed by a certain combination of nccl version or pytorch version? Any other debug flag we should turn on to get more info?

Hello again, do you have any suggestions or solutions for this issue? I downgraded and upgraded the cuda, cdnn, nccl versions (I couldn't try them all) but I couldn't find a solution. This problem causes the computer to crash completely and causes problems in restarting it. The computer cannot recover for long periods of time. We have an H100 x 2 workstation, I wanted to mention that as a note. Maybe the problem is specific to this graphics card. Thanks again for your help.

@wconstab
Copy link
Contributor

wconstab commented Nov 8, 2023

I wonder if you're hitting a crash inside the cuda driver or have a hardware issue?

The stacktrace you're referencing here (frame https://github.com/pytorch/pytorch/issues/3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)) is pointing to a background thread that we run which waits for asynchronous cuda communication ops to finish, and checks if they have encountered an error during their async execution. In this case, it looks like that thread found one of these communication ops raised an error. This can happen either due to an error in that operator, or an error in some other cuda operation that was scheduled before it (since cuda's API is async and cpu side doesn't check for cuda errors after every operation launch)

You can try rerunning with CUDA_LAUNCH_BLOCKING=1 env, which will make sure CPU side checks for any cuda error after each kernel launch. This might help pinpoint where the error is actually happening in your program.

@syngokhan
Copy link
Author

I wonder if you're hitting a crash inside the cuda driver or have a hardware issue?

The stacktrace you're referencing here (frame https://github.com/pytorch/pytorch/issues/3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)) is pointing to a background thread that we run which waits for asynchronous cuda communication ops to finish, and checks if they have encountered an error during their async execution. In this case, it looks like that thread found one of these communication ops raised an error. This can happen either due to an error in that operator, or an error in some other cuda operation that was scheduled before it (since cuda's API is async and cpu side doesn't check for cuda errors after every operation launch)

You can try rerunning with CUDA_LAUNCH_BLOCKING=1 env, which will make sure CPU side checks for any cuda error after each kernel launch. This might help pinpoint where the error is actually happening in your program.

I still received the same errors with CUDA_LAUNCH_BLOCKING=1 products. It gave results like this. You can correct me if I'm wrong.

MicrosoftTeams-image

@syngokhan
Copy link
Author

I wonder if you're hitting a crash inside the cuda driver or have a hardware issue?
The stacktrace you're referencing here (frame https://github.com/pytorch/pytorch/issues/3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)) is pointing to a background thread that we run which waits for asynchronous cuda communication ops to finish, and checks if they have encountered an error during their async execution. In this case, it looks like that thread found one of these communication ops raised an error. This can happen either due to an error in that operator, or an error in some other cuda operation that was scheduled before it (since cuda's API is async and cpu side doesn't check for cuda errors after every operation launch)
You can try rerunning with CUDA_LAUNCH_BLOCKING=1 env, which will make sure CPU side checks for any cuda error after each kernel launch. This might help pinpoint where the error is actually happening in your program.

I still received the same errors with CUDA_LAUNCH_BLOCKING=1 products. It gave results like this. You can correct me if I'm wrong.

MicrosoftTeams-image

How can I tell if it is hardware or not? What path should I follow? I have fully installed CUDA, Cudnn, nvcc. I see the device with nvidia-smi and monitor the device. While it is being used, it suddenly goes away during training and believe me, it takes a long time for the computer to reset itself. It collapses at some point.

When I type nvidia-smi when the crash occurs;

cmd -> nvidia-smi
Unable to determine the device handle for GPU0000:61:00.0: Unknown Error

As an error comes.

What I don't understand is that it is very strange that there was no problem in the 4 trainings completed before this, but now it is happening. Maybe I'm insufficient in debugging or not, I don't know. I would be happy if you guide me.

I installed pytorch 2.2.0 dev and now it throws an error like this.

Should I upgrade to CUDA 12.3 etc? Other versions too?

MicrosoftTeams-image (1)

@minkowski0125
Copy link

I wonder if you're hitting a crash inside the cuda driver or have a hardware issue?
The stacktrace you're referencing here (frame https://github.com/pytorch/pytorch/issues/3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fd6e8500af0 in /usr/anaconda3/envs/train/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)) is pointing to a background thread that we run which waits for asynchronous cuda communication ops to finish, and checks if they have encountered an error during their async execution. In this case, it looks like that thread found one of these communication ops raised an error. This can happen either due to an error in that operator, or an error in some other cuda operation that was scheduled before it (since cuda's API is async and cpu side doesn't check for cuda errors after every operation launch)
You can try rerunning with CUDA_LAUNCH_BLOCKING=1 env, which will make sure CPU side checks for any cuda error after each kernel launch. This might help pinpoint where the error is actually happening in your program.

I still received the same errors with CUDA_LAUNCH_BLOCKING=1 products. It gave results like this. You can correct me if I'm wrong.
MicrosoftTeams-image

How can I tell if it is hardware or not? What path should I follow? I have fully installed CUDA, Cudnn, nvcc. I see the device with nvidia-smi and monitor the device. While it is being used, it suddenly goes away during training and believe me, it takes a long time for the computer to reset itself. It collapses at some point.

When I type nvidia-smi when the crash occurs;

cmd -> nvidia-smi Unable to determine the device handle for GPU0000:61:00.0: Unknown Error

As an error comes.

What I don't understand is that it is very strange that there was no problem in the 4 trainings completed before this, but now it is happening. Maybe I'm insufficient in debugging or not, I don't know. I would be happy if you guide me.

I installed pytorch 2.2.0 dev and now it throws an error like this.

Should I upgrade to CUDA 12.3 etc? Other versions too?

MicrosoftTeams-image (1)

Hi, I met a same issue here with
torch 2.1.0a0+32f93b1
CUDA Version: 12.2

which only happens when trained by bfloat16 (fp16 is ok), also strange that it always happens at a same rank from multiple attempts.

6: [E ProcessGroupNCCL.cpp:852] [Rank 6] NCCL watchdog thread terminated with exception: CUDA error: misaligned address
6: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
6: 
6: Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
6: frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f25883be449 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
6: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f25883790c4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
6: frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f25884567e2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
6: frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x7b (0x7f24ed2de73b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f24ed2e2928 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7f24ed2e6f7b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x80 (0x7f24ed2e72f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #7: <unknown function> + 0xdc253 (0x7f2589474253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
6: frame #8: <unknown function> + 0x94ac3 (0x7f258d6a5ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
6: frame #9: <unknown function> + 0x126a40 (0x7f258d737a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
6: 
6: terminate called after throwing an instance of 'std::runtime_error'
6:   what():  [Rank 6] NCCL watchdog thread terminated with exception: CUDA error: misaligned address
6: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
6: 
6: Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
6: frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x99 (0x7f25883be449 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
6: frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7f25883790c4 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
6: frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7f25884567e2 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
6: frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x7b (0x7f24ed2de73b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f24ed2e2928 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x11b (0x7f24ed2e6f7b in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x80 (0x7f24ed2e72f0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
6: frame #7: <unknown function> + 0xdc253 (0x7f2589474253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
6: frame #8: <unknown function> + 0x94ac3 (0x7f258d6a5ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
6: frame #9: <unknown function> + 0x126a40 (0x7f258d737a40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
6: 
6: scripts/train_ddp.sh: line 52: 1228930 Aborted                 (core dumped) WORLD_SIZE=8 RANK=6 MASTER_ADDR=g0003 MASTER_PORT=10848 LOCAL_RANK=6 LOCAL_WORLD_SIZE=8 python train.py --base configs/model/base.yaml configs/training/pt-256.yaml

wondering if you have managed to solve this :(

@npuichigo
Copy link

Same issue here with PyTorch 2.2

@wconstab
Copy link
Contributor

NCCL watchdog thread terminated with exception: CUDA error: misaligned address

I think this error points to an application error (or deepspeed library error) where a tensor that is given to a cuda kernel is not aligned.

Also, it is surfaced by the nccl watchdog becuase the watchdog is periodically checking cuda state. But the error could have come from any cuda kernel (compute or communication). Becuase cuda's API is async, errors happening during one kernel are not immediately 'raised' to the CPU, and only noticed later by someone who checks on the state (like the watchdog).

CUDA_LAUNCH_BLOCKING=1 may actually help you pinpoint your issue to a specific kernel launch. (it might be different from the original issue posted above).

@mfoglio
Copy link

mfoglio commented May 2, 2024

Same issue here. Upgrading torch and torch lightning didn't help.

@shashwat14
Copy link

Same issue

@lxysl
Copy link

lxysl commented May 16, 2024

same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: nccl Problems related to nccl support oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

No branches or pull requests

8 participants