-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[1.5 Release][Dist Autograd][Better Engineering] Notify Workers on Failure during Distributed Autograd #34638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This pull request was exported from Phabricator. Differential Revision: D20164420 |
1 similar comment
This pull request was exported from Phabricator. Differential Revision: D20164420 |
1f61dec
to
f943c42
Compare
💊 CircleCI build failures summary and remediationsAs of commit 94cb64f (more details on the Dr. CI page):
🚧 1 upstream failure:These were probably caused by upstream breakages:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 37 times. |
f943c42
to
3103de8
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
3103de8
to
4890a80
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
4890a80
to
7c01788
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work!
torch/csrc/distributed/autograd/rpc_messages/autograd_backward_failure_req.cpp
Outdated
Show resolved
Hide resolved
7c01788
to
d26f333
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the number of DIST_AUTOGRAD_FAILURE_REQ
messages received per context id to debug info and then verify here that each node gets one of these messages for all the other context ids? Can add this as a separate PR.
d26f333
to
17b1cd8
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
…ilure during Distributed Autograd (pytorch#34638) Summary: Pull Request resolved: pytorch#34638 Fixes: pytorch#27643 This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly. Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass. Differential Revision: D20164420 fbshipit-source-id: 5aada5544ed12cd7e24053ba3b93f8b9b38ba021
17b1cd8
to
94cb64f
Compare
This pull request was exported from Phabricator. Differential Revision: D20164420 |
This pull request has been merged in 5f67c92. |
this broke lint
|
Summary:
Fixes: #27643
This PR manages notifying workers in the event of a failure during distributed autograd. Gracefully handles propagating errors across all nodes in the backward pass and sets state in the local autograd engines accordingly.
Test Plan: Added 2 new tests checking errors when they are thrown in an intermediate node during distributed autograd. Ensured that all existing distributed autograd tests pass.
Differential Revision: D20164420