Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debugging feature for "modified by an inplace operation" errors #15803

Closed
gwgundersen opened this issue Jan 7, 2019 · 7 comments
Closed

Debugging feature for "modified by an inplace operation" errors #15803

gwgundersen opened this issue Jan 7, 2019 · 7 comments

Comments

@gwgundersen
Copy link

Feature

Better error logging for inplace operations that throw errors in automatic differentiation.

Motivation

In complex computational graphs, locating the operation causing the error is nontrivial. A feature for isolating the offending operation would save a lot of developer time.

Pitch

See here for a more concrete suggestion. To quote:

I wonder whether it might be worth adding a "debug" modus that records the stack of the op in the forward pass and spits it out on error in the backward. That way, it would point to the right line of code directly.

Alternatives

Just being able to log all inplace operations would be useful.

@ailzhang ailzhang added the todo Not as important as medium or high priority tasks, but we will work on these. label Jan 8, 2019
@fmassa
Copy link
Member

fmassa commented Jan 14, 2019

Question: can't you use the anomaly_detection for that?

Here is an example where it shows up the exact line where the error occurs:

import torch


with torch.autograd.set_detect_anomaly(True):
    a = torch.rand(1, requires_grad=True)
    c = torch.rand(1, requires_grad=True)

    b = a ** 2 * c ** 2
    b += 1
    b *= c + a

    d = b.exp_()
    d *= 5

    b.backward()

And the stack trace:

sys:1: RuntimeWarning: Traceback of forward call that caused the error:
  File "tst.py", line 13, in <module>
    d = b.exp_()

Traceback (most recent call last):
  File "tst.py", line 16, in <module>
    b.backward()
  File "/Users/fmassa/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 102, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/Users/fmassa/anaconda3/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

which points directly to b.exp_(), and indeed, if you change that line to be b.exp(), it all works fine.

@fmassa fmassa added triage review and removed triage review todo Not as important as medium or high priority tasks, but we will work on these. labels Jan 14, 2019
@fmassa
Copy link
Member

fmassa commented Jan 14, 2019

Closing as this can be obtained with anomaly_detection.

@fmassa fmassa closed this as completed Jan 14, 2019
@gwgundersen
Copy link
Author

Great, thanks. This might be a good thing to add to the section on in-place operations in the autograd mechanics tutorial. I'm happy to add a sentence or two, but the contributing doc doesn't mention how to modify tutorials or notes.

@fmassa
Copy link
Member

fmassa commented Jan 14, 2019

Maybe check in https://github.com/pytorch/tutorials ?

@f0k
Copy link
Contributor

f0k commented Mar 27, 2019

which points directly to b.exp_(), and indeed, if you change that line to be b.exp(), it all works fine.

To clarify for other readers, the anomaly detection will not necessarily point you at the inplace operation that caused the failure. Instead, it will point you at the operation that could not compute its gradient in the backward pass. The inplace operation to blame may occur anywhere after that, modifying one of the tensors that participated in the line found by the anomaly detection.

Example:

x = torch.rand(10, 20, requires_grad=True)
y = torch.rand(10)
z = (x / y[:, np.newaxis])  # anomaly detection will point here
c = y.abs_()  # but the problem is here
z.sum().backward()

The last line will cause a RuntimeError. With anomaly detection enabled, it will point at the line performing the division, but the inplace operation came later.

@redwrasse
Copy link
Contributor

Hi,
As @f0k has pointed out in the last comment, the anomaly detection feature does not isolate the line containing the in-place operation. I think the original poster's concerns (@gwgundersen ) still stand. Is there any interest or new developments in debugging this sort of error? Thanks!

@wilsonjwcsu
Copy link

Agreed with @redwrasse . anomaly detection doesn't isolate the problematic in-place operation, which can still be a real pain to track down. Definitely interested in better tools for debugging this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants