-
Notifications
You must be signed in to change notification settings - Fork 22.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ProcessGroupAgent termination detection algorithm #26984
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. [ghstack-poisoned]
pytorchbot
added
the
oncall: distributed
Add this issue/PR to distributed oncall triage queue
label
Sep 27, 2019
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. [ghstack-poisoned]
mrshenli
added a commit
that referenced
this pull request
Sep 27, 2019
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. ghstack-source-id: 1d8f567faa72fd2e3cb68a24fff417d292160dd1 Pull Request resolved: #26984
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456) [ghstack-poisoned]
mrshenli
added a commit
that referenced
this pull request
Sep 28, 2019
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. ghstack-source-id: e5c7dff9964b0015d27b8789158d254ecf771148 Pull Request resolved: #26984
pietern
reviewed
Oct 1, 2019
|
||
for (int i = 0; i < worldSize; ++i) { | ||
outputCnts[0].emplace_back(torch::empty({2 * worldSize}, {torch::kInt64})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should really have an allgather that does this for you.
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456) [ghstack-poisoned]
pietern
approved these changes
Oct 2, 2019
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Two minor comments. Accepting to unblock.
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456) [ghstack-poisoned]
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456) [ghstack-poisoned]
mrshenli
added a commit
that referenced
this pull request
Oct 2, 2019
closes #26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. ghstack-source-id: 3168763f861ee9be461df6a2e310f3a8b76d72ae Pull Request resolved: #26984
pdlive215
pushed a commit
to pdlive215/pytorch
that referenced
this pull request
Nov 27, 2019
Summary: Pull Request resolved: pytorch#26984 closes pytorch#26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Test Plan: Imported from OSS Differential Revision: D17633456 Pulled By: mrshenli fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
thiagocrepaldi
pushed a commit
to thiagocrepaldi/pytorch
that referenced
this pull request
Feb 4, 2020
Summary: Pull Request resolved: pytorch#26984 closes pytorch#26944 In the existing implementation, each worker exits when it sees no send/recv tasks. However, as we adding support for nested calls, one RPC could trigger more RPCs in the UDF or in the response callback. As a result, even if the worker does not see any send/recv tasks for now, it does not mean there won't be any in the future. In this commit, we added a counters for all sent and received messages between each pair of nodes, and then use allgather to collect those counters, i.e., all workers would have the same view on the global states. The workers would only exit when all sends are received and processed. Test Plan: Imported from OSS Differential Revision: D17633456 Pulled By: mrshenli fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack:
closes #26944
In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.
In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.
Differential Revision: D17633456