Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ProcessGroupAgent termination detection algorithm #26984

Closed
wants to merge 6 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Sep 27, 2019

Stack from ghstack:

closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Differential Revision: D17633456

closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

[ghstack-poisoned]
@pytorchbot pytorchbot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 27, 2019
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Sep 27, 2019
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

ghstack-source-id: 1d8f567faa72fd2e3cb68a24fff417d292160dd1
Pull Request resolved: #26984
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Sep 28, 2019
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

ghstack-source-id: e5c7dff9964b0015d27b8789158d254ecf771148
Pull Request resolved: #26984
torch/csrc/distributed/rpc/process_group_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/process_group_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/process_group_agent.cpp Outdated Show resolved Hide resolved

for (int i = 0; i < worldSize; ++i) {
outputCnts[0].emplace_back(torch::empty({2 * worldSize}, {torch::kInt64}));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really have an allgather that does this for you.

torch/csrc/distributed/rpc/process_group_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/process_group_agent.h Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/process_group_agent.h Outdated Show resolved Hide resolved
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456)

[ghstack-poisoned]
@mrshenli mrshenli requested a review from pietern October 2, 2019 15:45
Copy link
Contributor

@pietern pietern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Two minor comments. Accepting to unblock.

torch/csrc/distributed/rpc/process_group_agent.cpp Outdated Show resolved Hide resolved
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456)

[ghstack-poisoned]
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Differential Revision: [D17633456](https://our.internmc.facebook.com/intern/diff/D17633456)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Oct 2, 2019
closes #26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

ghstack-source-id: 3168763f861ee9be461df6a2e310f3a8b76d72ae
Pull Request resolved: #26984
@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 2491dd5.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/14/head branch October 28, 2019 22:17
pdlive215 pushed a commit to pdlive215/pytorch that referenced this pull request Nov 27, 2019
Summary:
Pull Request resolved: pytorch#26984

closes pytorch#26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Test Plan: Imported from OSS

Differential Revision: D17633456

Pulled By: mrshenli

fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
thiagocrepaldi pushed a commit to thiagocrepaldi/pytorch that referenced this pull request Feb 4, 2020
Summary:
Pull Request resolved: pytorch#26984

closes pytorch#26944

In the existing implementation, each worker exits when it sees
no send/recv tasks. However, as we adding support for nested calls,
one RPC could trigger more RPCs in the UDF or in the response
callback. As a result, even if the worker does not see any send/recv
tasks for now, it does not mean there won't be any in the future.

In this commit, we added a counters for all sent and received
messages between each pair of nodes, and then use allgather to collect
those counters, i.e., all workers would have the same view on the
global states. The workers would only exit when all sends are
received and processed.

Test Plan: Imported from OSS

Differential Revision: D17633456

Pulled By: mrshenli

fbshipit-source-id: 813a155d3b2daf2226612eb17f6c698512e9beca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants