Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomaly detection: Error detected in CudnnRnnBackward0 #65301

Open
cowwoc opened this issue Sep 18, 2021 · 18 comments
Open

Anomaly detection: Error detected in CudnnRnnBackward0 #65301

cowwoc opened this issue Sep 18, 2021 · 18 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: rnn Issues related to RNN support (LSTM, GRU, etc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@cowwoc
Copy link

cowwoc commented Sep 18, 2021

🐛 Bug

To Reproduce

When I run output, _ = self.gru(output) I get the following traceback:

[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in CudnnRnnBackward0. Traceback of forward call that caused the error:
  File "C:\Users\Gili\AppData\Roaming\JetBrains\IntelliJIdea2021.2\plugins\python\helpers\pydev\pydevd.py", line 2173, in <module>
    main()
  File "C:\Users\Gili\AppData\Roaming\JetBrains\IntelliJIdea2021.2\plugins\python\helpers\pydev\pydevd.py", line 2164, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Users\Gili\AppData\Roaming\JetBrains\IntelliJIdea2021.2\plugins\python\helpers\pydev\pydevd.py", line 1476, in run
    return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
  File "C:\Users\Gili\AppData\Roaming\JetBrains\IntelliJIdea2021.2\plugins\python\helpers\pydev\pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Users\Gili\AppData\Roaming\JetBrains\IntelliJIdea2021.2\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 785, in <module>
    main()
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 771, in main
    tune_hyperparameters(java_dataset)
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 743, in tune_hyperparameters
    study.optimize(lambda trial: optimize_train(trial, java_dataset),
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\study.py", line 400, in optimize
    _optimize(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 66, in _optimize
    _optimize_sequential(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 163, in _optimize_sequential
    trial = _run_trial(study, func, catch)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\optuna\study\_optimize.py", line 213, in _run_trial
    value_or_values = func(trial)
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 743, in <lambda>
    study.optimize(lambda trial: optimize_train(trial, java_dataset),
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 708, in optimize_train
    return train(dataset, learning_rate, max_epochs, seq2seq_type, seq2seq_layers, linear_layers,
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 492, in train
    trainer.fit(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit
    self._run(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 917, in _run
    self._dispatch()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 995, in run_stage
    return self._run_train()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 395, in _optimizer_step
    model_ref.optimizer_step(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\lightning.py", line 1616, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\optimizer.py", line 206, in step
    self.__optimizer_step(closure=closure, profiler_name=profiler_name, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\optimizer.py", line 128, in __optimizer_step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 292, in optimizer_step
    make_optimizer_step = self.precision_plugin.pre_optimizer_step(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\precision\native_amp.py", line 57, in pre_optimizer_step
    result = lambda_closure()  # native amp does not support closures
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 536, in training_step_and_backward
    result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 306, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 193, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 172, in training_step
    return self.model.training_step(*args, **kwargs)
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 308, in training_step
    actual = self(input)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:/Users/Gili/Documents/myproject/aggregator/src/main/python/com.mycompany.ai/predict_outdoor_temperature.py", line 262, in forward
    output, _ = self.gru(output)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\rnn.py", line 849, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
 (function _print_stack)

Unfortunately, I don't have a minimal testcase to share with you but feel free to ask me for any more information you need.

Environment

PyTorch version: 1.10.0.dev20210918+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19043-SP0
Is CUDA available: True
CUDA runtime version: 11.4.120
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 471.96
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.7
[pip3] torch==1.10.0.dev20210918+cu113
[pip3] torch-tb-profiler==0.2.1
[pip3] torchmetrics==0.5.1
[pip3] torchvision==0.11.0.dev20210918+cu113
[conda] Could not collect

cc @csarofeen @ptrblck @xwang233 @zou3519 @ngimel

@cowwoc
Copy link
Author

cowwoc commented Sep 18, 2021

Moving from CUDA to CPU I now get this error: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1065, 618]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@cowwoc
Copy link
Author

cowwoc commented Sep 18, 2021

Running the same code against pip3 torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio===0.9.0 on the GPU fails with a different error: Function 'CudnnRnnBackward' returned nan values in its 0th output.

Notice none of these failures occur at the beginning of training. They occur a bit into an epoch.

@cowwoc
Copy link
Author

cowwoc commented Sep 18, 2021

Torch 1.10 is likely still an issue but 1.09 might be suffering from #35666

I started getting these errors around the time I moved from BatchNorm to LayerNorm and 16-bit floats.

@cowwoc
Copy link
Author

cowwoc commented Sep 19, 2021

Interestingly, the error only occurs when DataLoader.batch_size is too large to fit in GPU memory. What makes this problematic is if I catch the exception, reduce the batch size, and retry then I get a RuntimeError: CUDA error: an illegal memory access was encountered that is triggered when I invoke torch.cuda.manual_seed_all(seed) which invokes default_generator.manual_seed(seed). This is native code so I have no idea why it is failing.

Do I need to do something special to reset the GPU state before retrying the operation with a smaller batch size?

@cowwoc
Copy link
Author

cowwoc commented Sep 20, 2021

Here is a minimal testcase for this bug:

import math
import os
from typing import Optional, Tuple

import pytorch_lightning as pl
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.utilities.types import STEP_OUTPUT
from torch import Tensor
from torch.utils.data import DataLoader, Subset, Dataset

DETERMINISTIC = True
DETERMINISTIC_SEED = 41
# Debugging "one of the variables needed for gradient computation has been modified by an inplace operation"
torch.autograd.set_detect_anomaly(True)


class OutdoorTemperatureDataset(Dataset):
    def __init__(self, batch_size: int):
        self.batch_size = batch_size
        self.input_horizon = 60 * 5 * 2
        self.output_horizon = 1
        self.total_horizon = self.input_horizon + self.output_horizon
        # TODO: Why does crash only occur if tensor contains at least (batch_size + 134) entries?
        self.outdoor_temperature = torch.tensor([1.0]).repeat(batch_size + 134, self.total_horizon)

    def __getitem__(self, index) -> Tuple[Tensor, Tensor]:
        samples = torch.stack([self.outdoor_temperature[index]])
        # Convert [features, samples] to [samples, features]
        samples = samples.permute(1, 0)
        x = samples[:self.input_horizon, :]
        y = samples[self.input_horizon:, 0]
        return x, y

    def __len__(self):
        return self.outdoor_temperature.shape[0]


class ProcessContext:
    def __init__(self, dataset: OutdoorTemperatureDataset):
        self.input_horizon = dataset.input_horizon
        self.output_horizon = dataset.output_horizon
        train_size = max(1,
                         min(len(dataset) - 1,
                             math.ceil(len(dataset) * 0.9)))
        val_size = len(dataset) - train_size
        assert train_size > 0
        assert val_size > 0
        self.train_dataset, self.val_dataset = torch.utils.data.random_split(
            Subset(dataset, range(0, (train_size + val_size))),
            [train_size, val_size])

    def get_train_dataset(self):
        return self.train_dataset

    def get_validation_dataset(self):
        return self.val_dataset

    def get_model(self, learning_rate: float, max_epochs: int, hidden_layer_size: int, batch_size: int):
        return Predictor(self.train_dataset, self.val_dataset, self.input_horizon, self.output_horizon,
                         learning_rate=learning_rate, max_epochs=max_epochs,
                         hidden_layer_size=hidden_layer_size, batch_size=batch_size)


class Predictor(LightningModule):
    def __init__(self, train_dataset: Dataset, val_dataset: Dataset, input_horizon: int, output_horizon: int,
                 learning_rate: float, max_epochs: int, hidden_layer_size: int, batch_size: int):
        super(Predictor, self).__init__()
        self.train_dataset = train_dataset
        self.val_dataset = val_dataset
        self.input_horizon = input_horizon
        self.output_horizon = output_horizon
        self.total_horizon = self.input_horizon + self.output_horizon
        self.max_epochs = max_epochs
        self.learning_rate = learning_rate
        self.hidden_layer_size = hidden_layer_size

        self.input_norm = nn.LayerNorm(1)
        self.layer_norm = nn.LayerNorm(self.hidden_layer_size)
        self.gru = nn.GRU(1, self.hidden_layer_size, 1)

        self.linear_layer = nn.Linear(self.hidden_layer_size, self.output_horizon)
        self.loss_function = F.mse_loss
        self.batch_size = batch_size

    def train_dataloader(self):
        return DataLoader(dataset=self.train_dataset, batch_size=self.batch_size, shuffle=True,
                          pin_memory=True)

    def val_dataloader(self):
        return DataLoader(dataset=self.val_dataset, batch_size=self.batch_size, pin_memory=True)

    def configure_optimizers(self):
        return optim.Adam(self.parameters(), lr=self.learning_rate)

    def forward(self, input):
        output = self.input_norm(input)
        # Input shape is [batch, sequence, feature] but lstm/gru expects [sequence, batch, feature]
        output = output.permute(1, 0, 2)
        output, _ = self.gru(output)
        # Extract the hidden layer of the last element of the sequence
        output = output[-1, :, :]
        output = F.relu(output)
        output = self.layer_norm(output)
        output = self.linear_layer(output)
        return output

    def training_step(self, batch, batch_idx) -> STEP_OUTPUT:
        input, expected = batch

        actual = self(input)
        return self.loss_function(actual, expected)

    def validation_step(self, batch, batch_index) -> Optional[STEP_OUTPUT]:
        input, expected = batch

        actual = self(input)
        return self.loss_function(actual, expected)


def train(dataset: OutdoorTemperatureDataset, learning_rate: float, max_epochs: int,
          hidden_layer_size: int) -> float:
    process_context = ProcessContext(dataset)
    model = process_context.get_model(learning_rate, max_epochs, hidden_layer_size, dataset.batch_size)
    model.learning_rate = learning_rate

    trainer = Trainer(gpus=-1, benchmark=not DETERMINISTIC, precision=16, weights_summary=None,
                      max_epochs=max_epochs, deterministic=DETERMINISTIC)
    trainer.fit(model)
    return trainer.logged_metrics["val_loss"]


def main():
    if DETERMINISTIC:
        # https://pytorch.org/docs/stable/notes/randomness.html
        pl.seed_everything(DETERMINISTIC_SEED, workers=True)
        torch.use_deterministic_algorithms(True)
        # https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html#torch.nn.LSTM
        os.putenv("CUBLAS_WORKSPACE_CONFIG", ":4096:8")
    LEARNING_RATE = 0.001
    MAX_EPOCHS = 1000
    HIDDEN_LAYER_SIZE = 64
    batch_size = 10240
    dataset = OutdoorTemperatureDataset(batch_size)
    while True:
        try:
            train(dataset, LEARNING_RATE, MAX_EPOCHS, HIDDEN_LAYER_SIZE)
        except RuntimeError as e:
            message = repr(e)
            if "CUDNN_STATUS_EXECUTION_FAILED" in message or "CUDA out of memory" in message:
                print(message)
                batch_size = batch_size // 2
                if batch_size <= 0:
                    raise e
                print(f"Reducing batch_size to {batch_size}")
            else:
                raise e


if __name__ == "__main__":
    main()

My output is:

Global seed set to 41
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\data_loading.py:105: UserWarning: The dataloader, val dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|          | 0/2 [00:00<?, ?it/s] Global seed set to 41
C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\data_loading.py:105: UserWarning: The dataloader, train dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 16 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\data_loading.py:326: UserWarning: The number of training samples (1) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
[W ..\torch\csrc\autograd\python_anomaly_mode.cpp:104] Warning: Error detected in CudnnRnnBackward. Traceback of forward call that caused the error:
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 164, in <module>
    main()
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 150, in main
    train(dataset, LEARNING_RATE, MAX_EPOCHS, HIDDEN_LAYER_SIZE)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 132, in train
    trainer.fit(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit
    self._run(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 917, in _run
    self._dispatch()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 995, in run_stage
    return self._run_train()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\epoch\training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 395, in _optimizer_step
    model_ref.optimizer_step(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\lightning.py", line 1616, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\optimizer.py", line 206, in step
    self.__optimizer_step(closure=closure, profiler_name=profiler_name, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\optimizer.py", line 128, in __optimizer_step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 292, in optimizer_step
    make_optimizer_step = self.precision_plugin.pre_optimizer_step(
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\precision\native_amp.py", line 57, in pre_optimizer_step
    result = lambda_closure()  # native amp does not support closures
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 536, in training_step_and_backward
    result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\loops\batch\training_batch_loop.py", line 306, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 193, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 172, in training_step
    return self.model.training_step(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 114, in training_step
    actual = self(input)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 103, in forward
    output, _ = self.gru(output)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\rnn.py", line 837, in forward
    result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
 (function _print_stack)
RuntimeError('cuDNN error: CUDNN_STATUS_EXECUTION_FAILED')
Reducing batch_size to 5120
Using native 16bit precision.
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Traceback (most recent call last):
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 164, in <module>
    main()
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 160, in main
    raise e
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 150, in main
    train(dataset, LEARNING_RATE, MAX_EPOCHS, HIDDEN_LAYER_SIZE)
  File "C:\Users\Gili\Documents\myproject\aggregator\src\main\python\ai\testcase.py", line 132, in train
    trainer.fit(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 552, in fit
    self._run(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 873, in _run
    self.accelerator.setup(self, model)  # note: this sets up self.lightning_module
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\gpu.py", line 42, in setup
    return super().setup(trainer, model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 86, in setup
    self.setup_training_type_plugin(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 339, in setup_training_type_plugin
    self.training_type_plugin.setup(model)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\single_device.py", line 67, in setup
    self.model_to_device()
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\plugins\training_type\single_device.py", line 64, in model_to_device
    self._model.to(self.root_device)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\pytorch_lightning\core\mixins\device_dtype_mixin.py", line 109, in to
    return super().to(*args, **kwargs)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 852, in to
    return self._apply(convert)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 530, in _apply
    module._apply(fn)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 552, in _apply
    param_applied = fn(param)
  File "C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\nn\modules\module.py", line 850, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: an illegal memory access was encountered

My (updated) environment is:

Collecting environment information...
PyTorch version: 1.9.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.7 (tags/v3.9.7:1016ef3, Aug 30 2021, 20:19:38) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19043-SP0
Is CUDA available: True
CUDA runtime version: 11.4.120
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3080
Nvidia driver version: 471.96
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] pytorch-lightning==1.4.7
[pip3] torch==1.9.0+cu111
[pip3] torch-tb-profiler==0.2.1
[pip3] torchaudio==0.9.0
[pip3] torchmetrics==0.5.1
[pip3] torchvision==0.10.0+cu111
[conda] Could not collect

I am looking for answers to the following questions:

  1. Why does CudnnRnnBackward fail with CUDNN_STATUS_EXECUTION_FAILED?
  2. Why does a subsequent run fail with CUDA error: an illegal memory access was encountered?
  3. How can one re-run training with a smaller batch_size without running into the "illegal memory access" error?
  4. Why does this error only occur if output_temperature contains at least batch_size + 134 entries?

@ptrblck
Copy link
Collaborator

ptrblck commented Sep 20, 2021

It seems you are running into multiple errors, so could you please try to fix them in order?

  • I don't see the error message in the original post besides the CudnnRnn error, so unsure what exactly is failing:
Warning: Error detected in CudnnRnnBackward0.
  • Did you fix this issue? "Moving from CUDA to CPU I now get this error: one of the variables needed for gradient computation has been modified by"

  • 'CudnnRnnBackward' returned nan values in its 0th output.: in case you are using mixed-precision training, enable the anomaly detection after the initial warmup steps or lower the gradient scaling factor as the first iterations could create invalid gradients (the optimizer.step() method will be skipped in this case assuming you are properly using the mixed-precision utilities)

What makes this problematic is if I catch the exception, reduce the batch size, and retry then I get a RuntimeError: CUDA error: an illegal memory access was encountered that is triggered when I invoke torch.cuda.manual_seed_all(seed) which invokes default_generator.manual_seed(seed). This is native code so I have no idea why it is failing.

  • This sounds as if your GPU is in a "bad" state, so reset the device or restart your machine to make sure the CUDA context creation can be executed properly.

@cowwoc
Copy link
Author

cowwoc commented Sep 21, 2021

@ptrblck

Did you fix this issue? "Moving from CUDA to CPU I now get this error: one of the variables needed for gradient computation has been modified by"

I would ignore the original post. This error was based on a nightly build of torch and project code which has since changed (I don't remember the details). I'd rather focus on the above testcase. As you can see, the testcase never modifies any variable in-place so I assume whatever bugs we are seeing are in pytorch.

'CudnnRnnBackward' returned nan values in its 0th output.: in case you are using mixed-precision training, enable the anomaly detection after the initial warmup steps or lower the gradient scaling factor as the first iterations could create invalid gradients (the optimizer.step() method will be skipped in this case assuming you are properly using the mixed-precision utilities)

What does "after the initial warmup steps" refer to? Anyway I think this is a red herring because I've already got anomaly detection enabled at the top of the testcase. If I disable it altogether I get RuntimeError('cuDNN error: CUDNN_STATUS_EXECUTION_FAILED') instead.

If I decrease the batch size then the error goes away. This leads me to believe that this error is actually triggered by an out of memory condition but I'm happy to investigate this further with your guidance to make sure this is correct. Just let me know how to track this down to the source of the problem.

This sounds as if your GPU is in a "bad" state, so reset the device or restart your machine to make sure the CUDA context creation can be executed properly.

  1. Restarting the process resolves this error. No need to restart the entire machine.
  2. I'm using Optuna to do hyperparameter tuning. There is no way for me to predict a reasonable batch size ahead of time since related parameters keep on changing. Nor is it realistic for me to manually restart the process every time this error occurs. Surely whatever "bad state" we're in should be resolvable without any restarts given that restarting the process resolves it automagically. Ideally I want to catch the error, reset whatever state needs to be reset, then retry the operation with a lower batch size.

How do we track this to the source of the problem?

Thank you.

@cowwoc
Copy link
Author

cowwoc commented Sep 22, 2021

@ptrblck By the way I just checked and removing precision=16 causes Warning: Error detected in CudnnRnnBackward to go away. The testcase then fails due to CUDA out of memory which is expected and the recovery code executes as expected. The key point is I no longer get CUDA error: an illegal memory access was encountered.

If we can figure out how to avoid getting an error with precision=16 enabled then I suspect we'll be done. Is it possible that the CUDA RNN implementation has a bug with precision=16?

@cowwoc
Copy link
Author

cowwoc commented Sep 22, 2021

@ptrblck I ran cuda-memcheck against the testcase and it seems to hint at a bug in GRU's native code:

Epoch 0:   0%|          | 0/2 [00:00<?, ?it/s] ========= CUDA-MEMCHECK
========= Invalid __global__ read of size 2
=========     at 0x00000350 in void GRU_elementWise_bp1<__half, __half, float>(int, int, __half*, __half*, __half*, __half*, __half*, __half*, __half*, int, int)
=========     by thread (31,0,0) in block (134,0,0)
=========     Address 0x1264eba53e is out of bounds
=========     Device Frame:void GRU_elementWise_bp1<__half, __half, float>(int, int, __half*, __half*, __half*, __half*, __half*, __half*, __half*, int, int) (void GRU_elementWise_bp1<__half, __half, float>(int, int, __half*, __half*, __half*, __half*, __half*, __half*, __half*, int, int) : 0x350)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x80c58]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x80f71]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x8548a]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll (cuProfilerStop + 0x125dba) [0x366dea]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x185b3d]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll (cuProfilerStop + 0xf8fc2) [0x339ff2]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x3aa4d]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x3af42]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll [0x3b184]
=========     Host Frame:C:\Windows\system32\DriverStore\FileRepository\nvddi.inf_amd64_825a4c41796dab48\nvcuda64.dll (cuEventRecordWithFlags + 0x238) [0x22f498]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll [0x26e6]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll [0x1914]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll (cudnnRNNForwardTrainingEx + 0xd24) [0x2753f4]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll (cudnnRNNForwardTrainingEx + 0x11455) [0x285b25]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll (cudnnRNNForwardTrainingEx + 0x30c9) [0x277799]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\cudnn_adv_train64_8.dll (cudnnRNNBackwardData + 0xa79) [0x291b09]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::_cudnn_rnn_backward + 0x1d0e) [0x4fd60e]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cuda_cpp.dll (at::native::_cudnn_rnn_backward + 0x880) [0x4fc180]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::native::set_storage_cuda_ + 0x11b23e) [0x44b785e]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cuda_cu.dll (at::native::set_storage_cuda_ + 0xda5dc) [0x4476bfc]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (at::zeros_outf + 0x1e8c3) [0x6120763]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (at::redispatch::_cudnn_rnn_backward + 0x1cd) [0x614b21d]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::jit::Value::wrap + 0x2df1c) [0x7446ddc]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::jit::Value::wrap + 0x507da) [0x746969a]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (at::native::mkldnn_tanh_ + 0x552d2) [0x5daf6f2]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (at::_cudnn_rnn_backward + 0x59d) [0x5ee351d]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::generated::CudnnRnnBackward::apply + 0x77b) [0x733433b]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Node::operator() + 0x2ef) [0x731c7ef]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Engine::add_thread_pool_task + 0x6aa) [0x79f442a]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Engine::evaluate_function + 0x42c) [0x79f4e0c]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Engine::thread_main + 0x5cb) [0x79f929b]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Engine::thread_init + 0xc6) [0x79f8c36]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_python.dll (THPShortStorage_New + 0x3211e) [0x2c9dce]
=========     Host Frame:C:\Users\Gili\Documents\myproject\python\lib\site-packages\torch\lib\torch_cpu.dll (torch::autograd::Engine::get_base_engine + 0x9c4) [0x79f09d4]
=========     Host Frame:C:\Windows\System32\ucrtbase.dll (configthreadlocale + 0x92) [0x21bb2]
=========     Host Frame:C:\Windows\System32\KERNEL32.DLL (BaseThreadInitThunk + 0x14) [0x17034]
=========     Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]

Here is the full output for your review: cuda-memcheck.zip

The only modification I made to the testcase is adding os.putenv("PYTORCH_NO_CUDA_MEMORY_CACHING", "1").

Please let me know what you think.

@soulitzer soulitzer added module: cudnn Related to torch.backends.cudnn, and CuDNN support module: rnn Issues related to RNN support (LSTM, GRU, etc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda Related to torch.cuda, and CUDA support in general labels Sep 22, 2021
@cowwoc cowwoc changed the title Anomaly detection: Error detected in CudnnRnnBackward0 Anomaly detection: Error detected in CudnnRnnBackward0 when using AMP / mixed-precision mode Sep 24, 2021
@cowwoc
Copy link
Author

cowwoc commented Oct 1, 2021

I stand corrected: in my proprietary project I get this bug even with AMP/precision=16 disabled. Does this bug need to be directed to the PyTorch team or Nvidia? Thanks.

@cowwoc cowwoc changed the title Anomaly detection: Error detected in CudnnRnnBackward0 when using AMP / mixed-precision mode Anomaly detection: Error detected in CudnnRnnBackward0 Oct 1, 2021
@ptrblck
Copy link
Collaborator

ptrblck commented Oct 1, 2021

Thanks for the updates and sorry for the late reply.
The illegal memory access points to cuDNN, so I'll forward it to the cuDNN team. Are you able to reproduce it using your posted code snippet or did you have to make some changes?
I'll try to reproduce it on an Ampere GPU and will ping you in case I get stuck.

@cowwoc
Copy link
Author

cowwoc commented Oct 1, 2021

I reproduced the illegal memory access using the code snippet unchanged. To run memcheck, I added os.putenv("PYTORCH_NO_CUDA_MEMORY_CACHING", "1").

Keep in mind that this issue might be specific to my hardware in the sense that my video card (RTX 3080) has 10GB of RAM. I don't think you'll be able to reproduce the problem as easily if you have a different amount of GPU memory. If you don't have the same amount of GPU memory available, try setting batch_size = 10240 to a different value but it's not super easy to find the correct value.

Let me know if you're able to reproduce the problem on your end. Thanks.

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 1, 2021

Perfect, thanks a lot for the code snippet as I can reproduce it on a 3080.

@cowwoc
Copy link
Author

cowwoc commented Oct 1, 2021

Phew, that's a relief :) Okay, please let me know what the next steps are and CC me on any new tickets. Thanks.

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 11, 2021

Update: just verified the fix in the upcoming cuDNN 8.3.0 release.

@cowwoc
Copy link
Author

cowwoc commented Oct 12, 2021

@ptrblck Excellent news. Is cuDNN open-source? Is there a ticket I could subscribe to?

@ptrblck
Copy link
Collaborator

ptrblck commented Oct 12, 2021

No, cuDNN is not open source and this issue is the public ticket to track the bug and fix.
I can ping you here once 8.3 is released.

@cowwoc
Copy link
Author

cowwoc commented Oct 12, 2021

Sounds good. Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: cudnn Related to torch.backends.cudnn, and CuDNN support module: rnn Issues related to RNN support (LSTM, GRU, etc) triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

3 participants