Misaligned Address / Lane User Stack Overflow in `cunn_SpatialSoftmax` #56325

ptrblck · 2021-04-17T08:58:42Z

🐛 Bug

Reported in the forum by cameronb (thanks for reporting this issue!)

To Reproduce

import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
import torch
import torch.nn as nn

device = torch.device("cuda")

def make_token_tensor(id, vocab_len, should_squeeze=True):
  # print("Making token tensor of id:", id)
  t = torch.zeros(vocab_len).to(device)
  t[id] = 1
  if should_squeeze:
    return t.unsqueeze(0).unsqueeze(0)
  else:
    return t

h_size = 1536 # The Hidden size that goes into the decoder
o_size = 30522 # 30522 = Vocabulary size of default BERT tokenizer

class DecoderRNN(nn.Module):
    def __init__(self, hidden_size, output_size):
        super(DecoderRNN, self).__init__()
        self.hidden_size = hidden_size

        self.gru = nn.GRU(output_size, hidden_size, batch_first=True)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        output, hidden = self.gru(input, hidden)
        output = self.softmax(self.out(output[0]))
        return output, hidden


decoder = DecoderRNN(h_size, o_size)
decoder.to(device)

# The initial inputs to the decoder
d_hidden = torch.rand((1,1,h_size)).to(device)
prev_token_pred = make_token_tensor(1, o_size) # Has dimensions 1 x 1 x o_size

ans_tokens = [1, 2, 3, 4] # Imagine that in a real model these would be used for teacher forcing
max_len = len(ans_tokens)
seq_preds = []
for i in range(max_len):
  token_pred, d_hidden = decoder(prev_token_pred, d_hidden)
  prev_token_id = torch.argmax(token_pred)
  prev_token_pred = make_token_tensor(prev_token_id, o_size)
  seq_preds.append(token_pred.squeeze(0))
test_preds = torch.stack(seq_preds)

loss = nn.NLLLoss()
input = test_preds
# each element in target has to have 0 <= value < C
target = torch.tensor(ans_tokens).to(device)
output = loss(input, target)
output.backward()

Original error message:

CUDA error: Misaligned Address

$pc info:

(cuda-gdb) x/4i $pc-32
   0x5579fb8d3ab0 <_ZN2at6native78_GLOBAL__N__54_tmpxft_00004b40_00000000_13_SoftMax_compute_86_cpp1_ii_a331004220cunn_SoftMaxBackwardILi4EfffNS1_26LogSoftMaxBackwardEpilogueEEEvPT0_PT2_S7_i+1200>:   SHF.L.U32 R12, R13, 0x2, RZ
   0x5579fb8d3ac0 <_ZN2at6native78_GLOBAL__N__54_tmpxft_00004b40_00000000_13_SoftMax_compute_86_cpp1_ii_a331004220cunn_SoftMaxBackwardILi4EfffNS1_26LogSoftMaxBackwardEpilogueEEEvPT0_PT2_S7_i+1216>:   ISETP.GE.AND P2, PT, R12, R15, PT
=> 0x5579fb8d3ad0 <_ZN2at6native78_GLOBAL__N__54_tmpxft_00004b40_00000000_13_SoftMax_compute_86_cpp1_ii_a331004220cunn_SoftMaxBackwardILi4EfffNS1_26LogSoftMaxBackwardEpilogueEEEvPT0_PT2_S7_i+1232>:   FADD R8, R4, R9
   0x5579fb8d3ae0 <_ZN2at6native78_GLOBAL__N__54_tmpxft_00004b40_00000000_13_SoftMax_compute_86_cpp1_ii_a331004220cunn_SoftMaxBackwardILi4EfffNS1_26LogSoftMaxBackwardEpilogueEEEvPT0_PT2_S7_i+1248>:   FADD R9, R5, R8
(cuda-gdb) info registers $R8 $R4 $R9
R8             0xffff88c5          -30523
R4             0x0                 0
R9             0x0                 0

After rebuilding with -g -G the error changes to:

CUDA_EXCEPTION_2, Lane User Stack Overflow.

Backtrace:

(cuda-gdb) bt
#0  0x000055abe3dafad0 in void at::native::ReduceOp<float, at::native::ArgMaxOps<float>, unsigned int, long, 4>::run<1>() const ()
#1  0x000055abe292f720 in void at::native::reduce_kernel<512, 1, at::native::ReduceOp<float, at::native::ArgMaxOps<float>, unsigned int, long, 4> >(at::native::ReduceOp<float, at::native::ArgMaxOps<float>, unsigned int, long, 4>)
   <<<(1,1,1),(512,1,1)>>> ()

I'm currently unsure, if the stack overflow might be caused by the debug flags or if it's the real issue.

Anyway, both issues point to cunn_SpatialSoftMax.

Environment

PyTorch version: 1.9.0a0+2ecb2c7
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.19.6

Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 11.3.58
GPU models and configuration:
GPU 0: A100-SXM4-40GB
[...]

Nvidia driver version: 460.32.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0
[..]
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] nvidia-dlprof-pytorch-nvtx==1.1.0
[pip3] pytorch-quantization==2.1.0
[pip3] pytorch-transformers==1.1.0
[pip3] torch==1.9.0a0+2ecb2c7
[pip3] torchtext==0.10.0a0
[pip3] torchvision==0.9.0a0
[conda] magma-cuda110             2.5.2                         5    local
[conda] mkl                       2019.4                      243
[conda] mkl-include               2019.4                      243
[conda] nomkl                     3.0                           0
[conda] numpy                     1.19.2           py38h6163131_0
[conda] numpy-base                1.19.2           py38h75fe3a5_0
[conda] nvidia-dlprof-pytorch-nvtx 1.1.0                    pypi_0    pypi
[conda] pytorch-quantization      2.1.0                    pypi_0    pypi
[conda] pytorch-transformers      1.1.0                    pypi_0    pypi
[conda] torch                     1.9.0a0+2ecb2c7           dev_0    <develop>
[conda] torchtext                 0.10.0a0                 pypi_0    pypi
[conda] torchvision               0.9.0a0                  pypi_0    pypi

@eqy would you like to take a shot at it?

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ngimel

The text was updated successfully, but these errors were encountered:

ngimel · 2021-04-19T01:50:32Z

HIgh priority for a crash

eqy · 2021-04-19T16:12:46Z

After a quick look, it seems the failure appears in the call to blockReduce. It doesn't look like sdata is misaligned so another part of the setup for the reduction may be incorrect? Will take a deeper look.

ngimel · 2021-04-19T19:34:15Z

ilpReduce should be called with grad_output_shift, and not shift

pytorch/aten/src/ATen/native/cuda/SoftMax.cu

Line 672 in 40483ac

    
           shift, gradOutput, classes, AddFloat<outscalar_t, accscalar_t>(), accscalar_t(0));

eqy · 2021-04-19T19:41:52Z

ilpReduce should be called with grad_output_shift, and not shift

pytorch/aten/src/ATen/native/cuda/SoftMax.cu

Line 672 in 40483ac

shift, gradOutput, classes, AddFloat<outscalar_t, accscalar_t>(), accscalar_t(0));

Yup, can confirm this fixes the issue on V100.

@ngimel

CC @ngimel @ptrblck

Summary: CC ngimel ptrblck ref: #56325 Pull Request resolved: #56403 Reviewed By: mruberry Differential Revision: D27866625 Pulled By: ngimel fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02

ngimel · 2021-04-26T17:04:30Z

Fixed in #56304

Summary: CC ngimel ptrblck ref: pytorch#56325 Pull Request resolved: pytorch#56403 Reviewed By: mruberry Differential Revision: D27866625 Pulled By: ngimel fbshipit-source-id: 9dff0e9749f8de57fac6a653f685c14854611a02

ngimel added module: cuda Related to torch.cuda, and CUDA support in general high priority labels Apr 19, 2021

pytorch-probot bot added the triage review label Apr 19, 2021

eqy added a commit to eqy/pytorch that referenced this issue Apr 19, 2021

fix misaligned access pytorch#56325

338c674

CC @ngimel @ptrblck

eqy mentioned this issue Apr 19, 2021

fix misaligned access #56325 #56403

Closed

ngimel closed this as completed Apr 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misaligned Address / Lane User Stack Overflow in `cunn_SpatialSoftmax` #56325

Misaligned Address / Lane User Stack Overflow in `cunn_SpatialSoftmax` #56325

ptrblck commented Apr 17, 2021 •

edited by pytorch-probot bot

ngimel commented Apr 19, 2021

eqy commented Apr 19, 2021

ngimel commented Apr 19, 2021

eqy commented Apr 19, 2021

ngimel commented Apr 26, 2021

Misaligned Address / Lane User Stack Overflow in cunn_SpatialSoftmax #56325

Misaligned Address / Lane User Stack Overflow in cunn_SpatialSoftmax #56325

Comments

ptrblck commented Apr 17, 2021 • edited by pytorch-probot bot

🐛 Bug

To Reproduce

Environment

ngimel commented Apr 19, 2021

eqy commented Apr 19, 2021

ngimel commented Apr 19, 2021

eqy commented Apr 19, 2021

ngimel commented Apr 26, 2021

Misaligned Address / Lane User Stack Overflow in `cunn_SpatialSoftmax` #56325

Misaligned Address / Lane User Stack Overflow in `cunn_SpatialSoftmax` #56325

ptrblck commented Apr 17, 2021 •

edited by pytorch-probot bot