RuntimeError: CUDA error: device-side assert triggered #28

tekinek · 2023-11-30T03:56:57Z

Hi, the training of PL-BERT have terminated few times by the follwing error. What might be the underlying cause? Is someting wrong inside the dataset? it is difficult to debug as the error happens per several hours or a day.

By the way, has anyone tried a regular pretrained language model (like bert/albert/roberta) as an alternative encoder for styletts? they only mask/replace the tokens, but the pl-bert mask/replace an individual phonem as well.

Step [649200/1000000], Loss: 2.91462, Vocab Loss: 2.15879, Token Loss: 1.49617
Step [649400/1000000], Loss: 2.95243, Vocab Loss: 2.51928, Token Loss: 1.62138
Step [649600/1000000], Loss: 2.91280, Vocab Loss: 0.92345, Token Loss: 0.68771
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [30,0,0] Assertion t >= 0 && t < n_classes failed.
../aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [31,0,0] Assertion t >= 0 && t < n_classes failed.
[W CUDAGuardImpl.h:115] Warning: CUDA warning: device-side assert triggered (function destroyEvent)
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/launchers.py", line 175, in notebook_launcher
start_processes(launcher, args=args, nprocs=num_processes, start_method="fork")
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 202, in start_processes
while not context.join():
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 163, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap
fn(i, *args)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/launch.py", line 554, in call
self.launcher(*args)
File "/home/tts/speechlab/repo/PL-BERT-main/train.py", line 119, in train
for _, batch in enumerate(train_loader):
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/data_loader.py", line 460, in iter
current_batch = send_to_device(current_batch, self.device)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 151, in send_to_device
return honor_type(
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 83, in honor_type
return type(obj)(generator)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 152, in
tensor, (send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys) for t in tensor)
File "/home/tts/miniconda3/envs/xtts/lib/python3.10/site-packages/accelerate/utils/operations.py", line 167, in send_to_device
return tensor.to(device, non_blocking=non_blocking)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

The text was updated successfully, but these errors were encountered:

yl4579 · 2023-11-30T23:28:11Z

Your dataset has more tokens or vocabulary than your settings in the models, and the out of range only happens occasionally. The easiest way to fix this is skip the sample where vocabulary index is out of range.

yl4579 closed this as completed Dec 5, 2023

junylee11 mentioned this issue Jan 12, 2024

RuntimeError: CUDA error: device-side assert triggered on criterion #43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: device-side assert triggered #28

RuntimeError: CUDA error: device-side assert triggered #28

tekinek commented Nov 30, 2023

yl4579 commented Nov 30, 2023

RuntimeError: CUDA error: device-side assert triggered #28

RuntimeError: CUDA error: device-side assert triggered #28

Comments

tekinek commented Nov 30, 2023

yl4579 commented Nov 30, 2023