Skip to content

torch.autograd.detect_anomaly stops tracing back further into the code #86362

@netw0rkf10w

Description

@netw0rkf10w

🐛 Describe the bug

I would like to report a quite serious bug (I believe) of torch.autograd.detect_anomaly. I found this bug when trying to debug an issue (NaN loss) of my training (posted as a question on the forum).

The bug: torch.autograd.detect_anomaly doesn't trace back to the code, as shown in the output below.

Epoch: [0]  [ 120/2669]  eta: 0:50:30  lr: 0.000017  loss: 6.8861 (6.9013)  time: 1.1663  data: 0.0002  max mem: 12126
Epoch: [0]  [ 140/2669]  eta: 0:50:00  lr: 0.000020  loss: 6.8800 (6.8984)  time: 1.1699  data: 0.0002  max mem: 12126
Traceback (most recent call last):
  File "/home/code/train_bert.py", line 367, in <module>
    main(args)
  File "/home/code/train_bert.py", line 301, in main
    train_stats = train_one_epoch(
  File "/home/code/engine.py", line 68, in train_one_epoch
    loss_scaler(loss, optimizer, clip_grad=max_norm,
  File "/home/code/util/misc.py", line 290, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/home/.conda/envs/cuda11/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/.conda/envs/cuda11/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'BmmBackward0' returned nan values in its 1th output.

The message doesn't show which application of bmm actually caused the issue.

Unfortunately I'm unable to provide a minimal working example, but I hope somebody else could.

Versions

1.12.1

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: autogradRelated to torch.autograd, and the autograd engine in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions