Multi-output forward grad codegen is wrong if output values are used #67367

albanD · 2021-10-27T20:45:09Z

The current codegen allows the user to re-use the output's value in their formula. But when they are multiple outputs, this might already have been changed by the previous formula.

For example eigh is:

- name: linalg_eigh(Tensor self, str UPLO="L") -> (Tensor eigenvalues, Tensor eigenvectors)
  self: eigh_backward(grads, self, /*eigenvectors=*/true, eigenvalues, eigenvectors)
  eigenvalues: eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors)
  eigenvectors: eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors)

And so the schematic codegen for it is (with some small edits):

  if (_any_has_forward_grad_eigenvalues) {
      // stuff
      auto eigenvalues_new_fw_grad = eigh_jvp_eigenvalues(self_t, eigenvalues, eigenvectors);
      eigenvalues._set_fw_grad(eigenvalues_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
  }
  if (_any_has_forward_grad_eigenvectors) {
      // stuff
      auto eigenvectors_new_fw_grad = eigh_jvp_eigenvectors(self_t, eigenvalues, eigenvectors);
      eigenvectors._set_fw_grad(eigenvectors_new_fw_grad, /* level */ 0, /* is_inplace_op */ false);
  }

You can see that by the time the eigenvectors formula uses eigenvalues, it already has its fw_grad populated. And so we're going to apply forward AD to eigh_jvp_eigenvectorsv and the output will be a weird Tensor whose forward grad also has a forward grad (which should not be possible!).

Note that having such nested forward grads actually lead to a deadlock on destruction of the level (depending on the order in which they are stored in the ordered set, so this is a non-consistent deadlock).

We should:

Properly enforce that a forward grad cannot have a forward grad at the same level itself. You should not be able to do make_dual(a, make_dual(b, c)) (note that this one does not usually deadlock due to the order in which they are registered on the level)
We should fix the codegen to compute all the gradients first and then set all the gradients. This will ensure that we don't compute forward grad within the formula.

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7

The text was updated successfully, but these errors were encountered:

albanD · 2021-10-27T21:05:31Z

cc @IvanYashchuk @lezcano this should be fixed at the same time/before we add support for multiple entries in one line for forward mode AD.

lezcano · 2021-10-28T08:48:17Z

Both @nikitaved and @IvanYashchuk had reported that the CI hangs up when testing AD for some functions. I guess that now we know the why to this!

albanD · 2021-10-28T17:35:19Z

Yes, some were fixed by #67360
This is going to fix the remaining ones.

ejguan · 2021-10-28T20:05:46Z

This can be related as @soulitzer suggests: #67463

Summary: Will hide some of the issues from #67367 This will at least allow us to run gradcheck for now until the above issue is fixed. For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level. In particular, when exiting the level from https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L23 We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L29 (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock! In the level destructor in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L55 we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it). But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L124 While clearing, we access the level (to de-register this forward grad) via https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L139 But this tries to access the level again in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L39 and deadlocks. Pull Request resolved: #67995 Reviewed By: soulitzer Differential Revision: D32250996 Pulled By: albanD fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c

albanD added high priority module: autograd Related to torch.autograd, and the autograd engine in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 27, 2021

pytorch-probot bot added the triage review label Oct 27, 2021

albanD mentioned this issue Oct 27, 2021

Fix deadlock when forward and backward AD are used at the same time #67360

Closed

ejguan mentioned this issue Oct 28, 2021

test_forward_mode_AD hangs for nn.functional.cosine_embedding_loss #67463

Open

jbschlosser removed the triage review label Nov 1, 2021

mattarroz mentioned this issue Nov 2, 2021

Added forward derivatives for neg, diag, inverse, linalg_eig #67339

Closed

albanD mentioned this issue Nov 8, 2021

Fix deadlock for multi-output forward AD #67995

Closed

soulitzer mentioned this issue Nov 17, 2021

Fixes forward AD codegen for multiple formulas #68535

Closed

facebook-github-bot closed this as completed in 913ac27 Nov 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-output forward grad codegen is wrong if output values are used #67367

Multi-output forward grad codegen is wrong if output values are used #67367

albanD commented Oct 27, 2021 •

edited by pytorch-probot bot

albanD commented Oct 27, 2021 •

edited by nikitaved

lezcano commented Oct 28, 2021

albanD commented Oct 28, 2021

ejguan commented Oct 28, 2021

Multi-output forward grad codegen is wrong if output values are used #67367

Multi-output forward grad codegen is wrong if output values are used #67367

Comments

albanD commented Oct 27, 2021 • edited by pytorch-probot bot

albanD commented Oct 27, 2021 • edited by nikitaved

lezcano commented Oct 28, 2021

albanD commented Oct 28, 2021

ejguan commented Oct 28, 2021

albanD commented Oct 27, 2021 •

edited by pytorch-probot bot

albanD commented Oct 27, 2021 •

edited by nikitaved