nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

johngrabner · 2023-08-02T16:24:15Z

🐛 Describe the bug

The following code will, on occasion, generate an exit(-123), ie: I have bad data.

        assert predicted_line_mask.shape[0]==16
        assert predicted_line_mask.shape[1]==4
        assert predicted_line_mask.shape[2]==512
        assert predicted_line_mask.shape[3]==512

        assert all_line_mask.shape[0]==16
        assert all_line_mask.shape[1]==512
        assert all_line_mask.shape[2]==512

        assert not torch.any(torch.isinf(predicted_line_mask))
        if torch.any(all_line_mask > 3):
            print(all_line_mask.dtype)
            print(all_line_mask.min())
            print(all_line_mask.max())
            exit(-123)
        assert not torch.any(all_line_mask > 3)

        torch.cuda.synchronize()

        loss1 = self.criterion_line(predicted_line_mask, all_line_mask)

Typical code do not have all these checks and looks like:

intermediate = self(image)
predicted_line_mask = self.conv(intermediate)
loss1 = self.criterion_line(predicted_line_mask, all_line_mask)

In this case, the code will run most often without reporting any corruptions or strange behavior, even in the presence of this bad data on every run, but I suspect it is corrupting GPU memory. On occasion runs, it will report and exist
RuntimeError: CUDA error: an illegal memory access was encountered
and not report any indication where in the code something went bad leaving the impression that Pytorch just crashes on occasion.

Note the bad data is present in all runs, but just occasional runs randomly crash.

Adding torch.cuda.synchronize() allows the narrowing of the invalid memory access.

It would be nice, if the nn.CrossEntropyLoss could validate the range of its target vs the dimensions of the input and report this error in the input data rather than silently corrupting memory.

Versions

Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27
Python version: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.13.1
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.22.3 py310hfa59a62_0
[conda] numpy-base 1.22.3 py310h9585f30_0
[conda] pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 1.5.10 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchtext 0.14.1 py310 pytorch
[conda] torchvision 0.14.1 py310_cu116 pytorch

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

The text was updated successfully, but these errors were encountered:

jbschlosser · 2023-08-04T17:56:40Z

FYI I'm not sure about this specific case, but often for CUDA kernels we don't perform such checks as they can be expensive.

johngrabner · 2023-08-04T18:42:03Z

It has been a few years of using Pytorch. I always thought that Pytorch was buggy and crashed on occasion.
Now I know It could be bad data triggering memory corruption.
A sanity pass of adding a bunch of Python asserts to my code is probably advisable if many CUDA routines do not range-check the input. Thanks for the heads up.

I worked in 1980 when a thousand engineers were put on the task of tracking down the bug that causes memory corruption and occasional crashes that gave the company a poor-quality reputation. It was a nightmare. Hope a range check can be added to this CUDA and other CUDA so Pytorch does not get a tarnished reputation from bad user data.

mikaylagawarecki · 2023-08-04T19:55:32Z

Hey @johngrabner, would it be viable to validate the targets (e.g. during your dataloading process) on CPU?

CUDA device side asserts will likewise lead to a hard crash so I am not sure whether they will help here.

johngrabner · 2023-08-04T21:28:40Z

Yes, I now have exhaustive range checks in my data loader and in my Pytorch Python code. I will also examine my other projects to add these checks.

The issue for you, the developer of Pytorch, is bad data from a user, unknown by user, will cause Pytorch to crash in obscure ways leading to these people incorrectly assuming that Pytorch is poor quality rather than their data.

ptrblck · 2023-08-14T17:00:42Z

The 1.13 release had a few issues with missing device asserts and newer versions should properly raise the assert again:

>>> import torch
>>> import torch.nn as nn
>>> 
>>> x = torch.randn(10, 10, requires_grad=True, device="cuda")
>>> y = torch.randint(0, 10, (10,), device="cuda")
>>> criterion = nn.CrossEntropyLoss()
>>> 
>>> loss = criterion(x, y)
>>> print(loss)
tensor(2.8109, device='cuda:0', grad_fn=<NllLossBackward0>)
>>> 
>>> y[0] = 10
>>> loss = criterion(x, y)
>>> print(loss)../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

mikaylagawarecki · 2023-08-14T20:03:24Z

@ptrblck Are you using 2.0.1 or main? I can't reproduce the device assert with that snippet when I build on main, does a specific build flag have to be set?

(edit: my bad, I see that I should be compiling with TORCH_USE_CUDA_DSA, rebuilding and retrying)

ptrblck · 2023-08-17T14:30:16Z

@mikaylagawarecki I used the 2.0.1 release. If the nightly isn't reporting it again, I would consider this a resurfacing of the old issues in 1.13. You shouldn't need to rebuild PyTorch with TORCH_USE_CUDA_DSA as 99+% of the users won't be able to do so since they depend on pre-built binaries.

awgu added the module: nn Related to torch.nn label Aug 2, 2023

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 3, 2023

jbschlosser added the module: loss Problem is related to loss function label Aug 4, 2023

mikaylagawarecki mentioned this issue Aug 17, 2023

[regression] Not getting CUDA error: device-side assert triggered on main for CUDA_KERNEL_ASSERT2 #107396

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

johngrabner commented Aug 2, 2023 •

edited by pytorch-bot bot

jbschlosser commented Aug 4, 2023

johngrabner commented Aug 4, 2023

mikaylagawarecki commented Aug 4, 2023 •

edited

johngrabner commented Aug 4, 2023

ptrblck commented Aug 14, 2023 •

edited

mikaylagawarecki commented Aug 14, 2023 •

edited

ptrblck commented Aug 17, 2023

nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

Comments

johngrabner commented Aug 2, 2023 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

jbschlosser commented Aug 4, 2023

johngrabner commented Aug 4, 2023

mikaylagawarecki commented Aug 4, 2023 • edited

johngrabner commented Aug 4, 2023

ptrblck commented Aug 14, 2023 • edited

mikaylagawarecki commented Aug 14, 2023 • edited

ptrblck commented Aug 17, 2023

johngrabner commented Aug 2, 2023 •

edited by pytorch-bot bot

mikaylagawarecki commented Aug 4, 2023 •

edited

ptrblck commented Aug 14, 2023 •

edited

mikaylagawarecki commented Aug 14, 2023 •

edited