Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467

Open
johngrabner opened this issue Aug 2, 2023 · 7 comments
Labels
module: loss Problem is related to loss function module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@johngrabner
Copy link

johngrabner commented Aug 2, 2023

馃悰 Describe the bug

The following code will, on occasion, generate an exit(-123), ie: I have bad data.

        assert predicted_line_mask.shape[0]==16
        assert predicted_line_mask.shape[1]==4
        assert predicted_line_mask.shape[2]==512
        assert predicted_line_mask.shape[3]==512

        assert all_line_mask.shape[0]==16
        assert all_line_mask.shape[1]==512
        assert all_line_mask.shape[2]==512

        assert not torch.any(torch.isinf(predicted_line_mask))
        if torch.any(all_line_mask > 3):
            print(all_line_mask.dtype)
            print(all_line_mask.min())
            print(all_line_mask.max())
            exit(-123)
        assert not torch.any(all_line_mask > 3)

        torch.cuda.synchronize()

        loss1 = self.criterion_line(predicted_line_mask, all_line_mask)

Typical code do not have all these checks and looks like:

intermediate = self(image)
predicted_line_mask = self.conv(intermediate)
loss1 = self.criterion_line(predicted_line_mask, all_line_mask)

In this case, the code will run most often without reporting any corruptions or strange behavior, even in the presence of this bad data on every run, but I suspect it is corrupting GPU memory. On occasion runs, it will report and exist
RuntimeError: CUDA error: an illegal memory access was encountered
and not report any indication where in the code something went bad leaving the impression that Pytorch just crashes on occasion.

Note the bad data is present in all runs, but just occasional runs randomly crash.

Adding torch.cuda.synchronize() allows the narrowing of the invalid memory access.

It would be nice, if the nn.CrossEntropyLoss could validate the range of its target vs the dimensions of the input and report this error in the input data rather than silently corrupting memory.

Versions

Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27
Python version: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.13.1
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.22.3 py310hfa59a62_0
[conda] numpy-base 1.22.3 py310h9585f30_0
[conda] pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 1.5.10 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchtext 0.14.1 py310 pytorch
[conda] torchvision 0.14.1 py310_cu116 pytorch

cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki

@awgu awgu added the module: nn Related to torch.nn label Aug 2, 2023
@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 3, 2023
@jbschlosser jbschlosser added the module: loss Problem is related to loss function label Aug 4, 2023
@jbschlosser
Copy link
Contributor

FYI I'm not sure about this specific case, but often for CUDA kernels we don't perform such checks as they can be expensive.

@johngrabner
Copy link
Author

It has been a few years of using Pytorch. I always thought that Pytorch was buggy and crashed on occasion.
Now I know It could be bad data triggering memory corruption.
A sanity pass of adding a bunch of Python asserts to my code is probably advisable if many CUDA routines do not range-check the input. Thanks for the heads up.

I worked in 1980 when a thousand engineers were put on the task of tracking down the bug that causes memory corruption and occasional crashes that gave the company a poor-quality reputation. It was a nightmare. Hope a range check can be added to this CUDA and other CUDA so Pytorch does not get a tarnished reputation from bad user data.

@mikaylagawarecki
Copy link
Contributor

mikaylagawarecki commented Aug 4, 2023

Hey @johngrabner, would it be viable to validate the targets (e.g. during your dataloading process) on CPU?

CUDA device side asserts will likewise lead to a hard crash so I am not sure whether they will help here.

@johngrabner
Copy link
Author

Yes, I now have exhaustive range checks in my data loader and in my Pytorch Python code. I will also examine my other projects to add these checks.

The issue for you, the developer of Pytorch, is bad data from a user, unknown by user, will cause Pytorch to crash in obscure ways leading to these people incorrectly assuming that Pytorch is poor quality rather than their data.

@ptrblck
Copy link
Collaborator

ptrblck commented Aug 14, 2023

The 1.13 release had a few issues with missing device asserts and newer versions should properly raise the assert again:

>>> import torch
>>> import torch.nn as nn
>>> 
>>> x = torch.randn(10, 10, requires_grad=True, device="cuda")
>>> y = torch.randint(0, 10, (10,), device="cuda")
>>> criterion = nn.CrossEntropyLoss()
>>> 
>>> loss = criterion(x, y)
>>> print(loss)
tensor(2.8109, device='cuda:0', grad_fn=<NllLossBackward0>)
>>> 
>>> y[0] = 10
>>> loss = criterion(x, y)
>>> print(loss)../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

@mikaylagawarecki
Copy link
Contributor

mikaylagawarecki commented Aug 14, 2023

@ptrblck Are you using 2.0.1 or main? I can't reproduce the device assert with that snippet when I build on main, does a specific build flag have to be set?

(edit: my bad, I see that I should be compiling with TORCH_USE_CUDA_DSA, rebuilding and retrying)

@ptrblck
Copy link
Collaborator

ptrblck commented Aug 17, 2023

@mikaylagawarecki I used the 2.0.1 release. If the nightly isn't reporting it again, I would consider this a resurfacing of the old issues in 1.13. You shouldn't need to rebuild PyTorch with TORCH_USE_CUDA_DSA as 99+% of the users won't be able to do so since they depend on pre-built binaries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: loss Problem is related to loss function module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Status: Done
Development

No branches or pull requests

5 participants