New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock for multi-output forward AD #67995
Conversation
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 040b18d (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Thanks for the write-up!
Will hide some of the issues from #67367
This will at least allow us to run gradcheck for now until the above issue is fixed.
For more context, the deadlock happens when we (wrongfully) set a forward grad that also has a forward grad of the same level.
In particular, when exiting the level from
pytorch/torch/csrc/autograd/forward_grad.cpp
Line 23 in 191b48b
We are taking the
all_forward_levels_mutex_
lock and proceed to delete the level atpytorch/torch/csrc/autograd/forward_grad.cpp
Line 29 in 191b48b
In the level destructor in
pytorch/torch/csrc/autograd/forward_grad.cpp
Line 55 in 191b48b
But in the (bad) case where this Tensor also has a forward grad for this level, the autograd meta clears the fw grads:
pytorch/torch/csrc/autograd/forward_grad.h
Line 124 in 191b48b
While clearing, we access the level (to de-register this forward grad) via
pytorch/torch/csrc/autograd/forward_grad.h
Line 139 in 191b48b
But this tries to access the level again in
pytorch/torch/csrc/autograd/forward_grad.cpp
Line 39 in 191b48b