New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
nn.CrossEntropyLoss with invalid target generates corrups memory eventualy leading to CUDA error: an illegal memory access #106467
Comments
FYI I'm not sure about this specific case, but often for CUDA kernels we don't perform such checks as they can be expensive. |
It has been a few years of using Pytorch. I always thought that Pytorch was buggy and crashed on occasion. I worked in 1980 when a thousand engineers were put on the task of tracking down the bug that causes memory corruption and occasional crashes that gave the company a poor-quality reputation. It was a nightmare. Hope a range check can be added to this CUDA and other CUDA so Pytorch does not get a tarnished reputation from bad user data. |
Hey @johngrabner, would it be viable to validate the CUDA device side asserts will likewise lead to a hard crash so I am not sure whether they will help here. |
Yes, I now have exhaustive range checks in my data loader and in my Pytorch Python code. I will also examine my other projects to add these checks. The issue for you, the developer of Pytorch, is bad data from a user, unknown by user, will cause Pytorch to crash in obscure ways leading to these people incorrectly assuming that Pytorch is poor quality rather than their data. |
The >>> import torch
>>> import torch.nn as nn
>>>
>>> x = torch.randn(10, 10, requires_grad=True, device="cuda")
>>> y = torch.randint(0, 10, (10,), device="cuda")
>>> criterion = nn.CrossEntropyLoss()
>>>
>>> loss = criterion(x, y)
>>> print(loss)
tensor(2.8109, device='cuda:0', grad_fn=<NllLossBackward0>)
>>>
>>> y[0] = 10
>>> loss = criterion(x, y)
>>> print(loss)../aten/src/ATen/native/cuda/Loss.cu:240: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed. |
@ptrblck Are you using 2.0.1 or main? I can't reproduce the device assert with that snippet when I build on main, does a specific build flag have to be set? (edit: my bad, I see that I should be compiling with TORCH_USE_CUDA_DSA, rebuilding and retrying) |
@mikaylagawarecki I used the |
馃悰 Describe the bug
The following code will, on occasion, generate an exit(-123), ie: I have bad data.
Typical code do not have all these checks and looks like:
In this case, the code will run most often without reporting any corruptions or strange behavior, even in the presence of this bad data on every run, but I suspect it is corrupting GPU memory. On occasion runs, it will report and exist
RuntimeError: CUDA error: an illegal memory access was encountered
and not report any indication where in the code something went bad leaving the impression that Pytorch just crashes on occasion.
Note the bad data is present in all runs, but just occasional runs randomly crash.
Adding torch.cuda.synchronize() allows the narrowing of the invalid memory access.
It would be nice, if the nn.CrossEntropyLoss could validate the range of its target vs the dimensions of the input and report this error in the input data rather than silently corrupting memory.
Versions
Collecting environment information...
PyTorch version: 1.13.1
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.6 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.27
Python version: 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.27
Is CUDA available: True
CUDA runtime version: 11.6.124
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000
Nvidia driver version: 525.125.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] pytorch-lightning==1.5.10
[pip3] torch==1.13.1
[pip3] torchelastic==0.2.2
[pip3] torchmetrics==0.11.4
[pip3] torchtext==0.14.1
[pip3] torchvision==0.14.1
[conda] blas 1.0 mkl
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.4.0 h06a4308_640
[conda] mkl-service 2.4.0 py310h7f8727e_0
[conda] mkl_fft 1.3.1 py310hd6ae3a3_0
[conda] mkl_random 1.2.2 py310h00e6091_0
[conda] numpy 1.22.3 py310hfa59a62_0
[conda] numpy-base 1.22.3 py310h9585f30_0
[conda] pytorch 1.13.1 py3.10_cuda11.6_cudnn8.3.2_0 pytorch
[conda] pytorch-cuda 11.6 h867d48c_1 pytorch
[conda] pytorch-lightning 1.5.10 pypi_0 pypi
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchmetrics 0.11.4 pypi_0 pypi
[conda] torchtext 0.14.1 py310 pytorch
[conda] torchvision 0.14.1 py310_cu116 pytorch
cc @albanD @mruberry @jbschlosser @walterddr @mikaylagawarecki
The text was updated successfully, but these errors were encountered: