New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-output forward grad codegen is wrong if output values are used #67367
Labels
high priority
module: autograd
Related to torch.autograd, and the autograd engine in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Comments
albanD
added
high priority
module: autograd
Related to torch.autograd, and the autograd engine in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
labels
Oct 27, 2021
cc @IvanYashchuk @lezcano this should be fixed at the same time/before we add support for multiple entries in one line for forward mode AD. |
Both @nikitaved and @IvanYashchuk had reported that the CI hangs up when testing AD for some functions. I guess that now we know the why to this! |
Yes, some were fixed by #67360 |
This can be related as @soulitzer suggests: #67463 |
facebook-github-bot
pushed a commit
that referenced
this issue
Nov 9, 2021
Summary: Will hide some of the issues from #67367 This will at least allow us to run gradcheck for now until the above issue is fixed. For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level. In particular, when exiting the level from https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L23 We are taking the `all_forward_levels_mutex_` lock and proceed to delete the level at https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L29 (nothing else usually references this object, so it gets deleted as soon as it gets removed from the vector). Note that, at this point, we still have the lock! In the level destructor in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L55 we are deleting the forward grad. Which triggers the deletion the grad Tensor and everything it holds (assuming nothing else references it). But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads: https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L124 While clearing, we access the level (to de-register this forward grad) via https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.h#L139 But this tries to access the level again in https://github.com/pytorch/pytorch/blob/191b48b12f33e1e9525882da0c62b68686d69e42/torch/csrc/autograd/forward_grad.cpp#L39 and deadlocks. Pull Request resolved: #67995 Reviewed By: soulitzer Differential Revision: D32250996 Pulled By: albanD fbshipit-source-id: f6118117effd3114fa90dc8fe22865339445f70c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
high priority
module: autograd
Related to torch.autograd, and the autograd engine in general
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
The current codegen allows the user to re-use the output's value in their formula. But when they are multiple outputs, this might already have been changed by the previous formula.
For example eigh is:
And so the schematic codegen for it is (with some small edits):
You can see that by the time the eigenvectors formula uses eigenvalues, it already has its fw_grad populated. And so we're going to apply forward AD to
eigh_jvp_eigenvectorsv
and the output will be a weird Tensor whose forward grad also has a forward grad (which should not be possible!).Note that having such nested forward grads actually lead to a deadlock on destruction of the level (depending on the order in which they are stored in the ordered set, so this is a non-consistent deadlock).
We should:
make_dual(a, make_dual(b, c))
(note that this one does not usually deadlock due to the order in which they are registered on the level)cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @albanD @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7
The text was updated successfully, but these errors were encountered: