-
Notifications
You must be signed in to change notification settings - Fork 25k
Closed
Labels
module: autogradRelated to torch.autograd, and the autograd engine in generalRelated to torch.autograd, and the autograd engine in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
🐛 Describe the bug
I would like to report a quite serious bug (I believe) of torch.autograd.detect_anomaly
. I found this bug when trying to debug an issue (NaN loss) of my training (posted as a question on the forum).
The bug: torch.autograd.detect_anomaly
doesn't trace back to the code, as shown in the output below.
Epoch: [0] [ 120/2669] eta: 0:50:30 lr: 0.000017 loss: 6.8861 (6.9013) time: 1.1663 data: 0.0002 max mem: 12126
Epoch: [0] [ 140/2669] eta: 0:50:00 lr: 0.000020 loss: 6.8800 (6.8984) time: 1.1699 data: 0.0002 max mem: 12126
Traceback (most recent call last):
File "/home/code/train_bert.py", line 367, in <module>
main(args)
File "/home/code/train_bert.py", line 301, in main
train_stats = train_one_epoch(
File "/home/code/engine.py", line 68, in train_one_epoch
loss_scaler(loss, optimizer, clip_grad=max_norm,
File "/home/code/util/misc.py", line 290, in __call__
self._scaler.scale(loss).backward(create_graph=create_graph)
File "/home/.conda/envs/cuda11/lib/python3.9/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/.conda/envs/cuda11/lib/python3.9/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Function 'BmmBackward0' returned nan values in its 1th output.
The message doesn't show which application of bmm
actually caused the issue.
Unfortunately I'm unable to provide a minimal working example, but I hope somebody else could.
Versions
1.12.1
cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7
Metadata
Metadata
Assignees
Labels
module: autogradRelated to torch.autograd, and the autograd engine in generalRelated to torch.autograd, and the autograd engine in generaltriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module