Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-output forward grad codegen is wrong if output values are used #67367

Closed
albanD opened this issue Oct 27, 2021 · 4 comments
Closed

Multi-output forward grad codegen is wrong if output values are used #67367

albanD opened this issue Oct 27, 2021 · 4 comments
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@albanD
Copy link
Collaborator

albanD commented Oct 27, 2021

The current codegen allows the user to re-use the output's value in their formula. But when they are multiple outputs, this might already have been changed by the previous formula.

For example eigh is:

- name: linalg_eigh(Tensor self, str UPLO="L") -> (Tensor eigenvalues, Tensor eigenvectors)
  self: eigh_backward(grads, self, /*eigenvectors=*/true, eigenvalues, eigenvectors)
  eigenvalues: eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors)
  eigenvectors: eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors)

And so the schematic codegen for it is (with some small edits):

  if (_any_has_forward_grad_eigenvalues) {
      // stuff
      auto eigenvalues_new_fw_grad = eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors);
      eigenvalues._set_fw_grad(eigenvalues_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
  }
  if (_any_has_forward_grad_eigenvectors) {
      // stuff
      auto eigenvectors_new_fw_grad = eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors);
      eigenvectors._set_fw_grad(eigenvectors_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
  }

You can see that by the time the eigenvectors formula uses eigenvalues, it already has its fw_grad populated. And so we're going to apply forward AD to eigh_jvp_eigenvectorsv and the output will be a weird Tensor whose forward grad also has a forward grad (which should not be possible!).

Note that having such nested forward grads actually lead to a deadlock on destruction of the level (depending on the order in which they are stored in the ordered set, so this is a non-consistent deadlock).

We should:

  • Properly enforce that a forward grad cannot have a forward grad at the same level itself. You should not be able to do make_dual(a, make_dual(b, c)) (note that this one does not usually deadlock due to the order in which they are registered on the level)
  • We should fix the codegen to compute all the gradients first and then set all the gradients. This will ensure that we don't compute forward grad within the formula.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

@albanD albanD added high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 27, 2021
@albanD
Copy link
Collaborator Author

albanD commented Oct 27, 2021

cc @IvanYashchuk @lezcano this should be fixed at the same time/before we add support for multiple entries in one line for forward mode AD.

@lezcano
Copy link
Collaborator

lezcano commented Oct 28, 2021

Both @nikitaved and @IvanYashchuk had reported that the CI hangs up when testing AD for some functions. I guess that now we know the why to this!

@albanD
Copy link
Collaborator Author

albanD commented Oct 28, 2021

Yes, some were fixed by #67360
This is going to fix the remaining ones.

@ejguan
Copy link
Contributor

ejguan commented Oct 28, 2021

This can be related as @soulitzer suggests: #67463

facebook-github-bot pushed a commit that referenced this issue Nov 9, 2021
Summary:
Will hide some of the issues from #67367
This will at least allow us to run gradcheck for now until the above issue is fixed.

For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level.
In particular, when exiting the level from https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L23
We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L29 (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock!

In the level destructor in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L55 we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it).
But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L124
While clearing, we access the level (to de-register this forward grad) via https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L139
But this tries to access the level again in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L39 and deadlocks.

Pull Request resolved: #67995

Reviewed By: soulitzer

Differential Revision: D32250996

Pulled By: albanD

fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants