Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: device-side assert triggered #28

Closed
tekinek opened this issue Nov 30, 2023 · 1 comment
Closed

RuntimeError: CUDA error: device-side assert triggered #28

tekinek opened this issue Nov 30, 2023 · 1 comment

Comments

@tekinek
Copy link

tekinek commented Nov 30, 2023

Hi, the training of PL-BERT have terminated few times by the follwing error. What might be the underlying cause? Is someting wrong inside the dataset? it is difficult to debug as the error happens per several hours or a day.

By the way, has anyone tried a regular pretrained language model (like bert/albert/roberta) as an alternative encoder for styletts? they only mask/replace the tokens, but the pl-bert mask/replace an individual phonem as well.

Step [649200/1000000], Loss: 2.91462, Vocab Loss: 2.15879, Token Loss: 1.49617
Step [649400/1000000], Loss: 2.95243, Vocab Loss: 2.51928, Token Loss: 1.62138
Step [649600/1000000], Loss: 2.91280, Vocab Loss: 0.92345, Token Loss: 0.68771
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
[W CUDAGuardImpl.h:115] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/launchers.py", line 175, in notebook_launcher
start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/launch.py", line 554, in call
self.launcher(*args)
File "/home/tts/speechlab/repo/PL-BERT-main/train.py", line 119, in train
for _, batch in enumerate(train_loader):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/data_loader.py", line 460, in iter
current_batch = send_to_device(current_batch, self.device)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 167, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@yl4579
Copy link
Owner

yl4579 commented Nov 30, 2023

Your dataset has more tokens or vocabulary than your settings in the models, and the out of range only happens occasionally. The easiest way to fix this is skip the sample where vocabulary index is out of range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants