-
Notifications
You must be signed in to change notification settings - Fork 25k
Lift rpc_timeout to RpcAgent, for other RpcAgents to reuse. #29341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
#29336 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
#29336 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
#29336 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
Pull Request resolved: #29341 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. ghstack-source-id: 93429217 Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/)
#29336 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
Pull Request resolved: #29341 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. ghstack-source-id: 93433933 Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 commented in the original PR #29336 (comment) about why there should be 3 timeouts and I agree. From the client's perspective there is no difference between the phases you list. Allowing different timeouts to be used means it's practically impossible to reason about timeouts (or spot them in log output) anymore.
Looking at the contents of the PR, I see two things:
- Lifting
rpcTimeout
from ProcessGroupAgent into RPCAgent and renaming it - Unrelated formatting changes
We can all agree that lifting the parameter to RPCAgent is a good idea. Renaming is a separate issue and assumes we're going to have a bunch of timeouts, so we should discuss this more before going through with it. The formatting changes should be a separate formatting-only PR so it keeps the history usable (or you'll end up investigating why those lines were touched as part of a timeout change).
I propose the following:
- Remove the rename from this PR and make it only move the timeout field to the base class.
- Submit an issue for the rename so we can discuss what you're proposing.
- Submit a separate PR for the formatting changes.
#29336 - Rename the concept from "rpcTimeout" to "globalProcessTimeout", because 1) this timeout is used to limit the server-side request processing time. 2) This timeout is a global setting in contract to per-RPC setting. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
Pull Request resolved: #29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93471330 Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/)
@@ -93,6 +96,11 @@ class TORCH_API RpcAgent { | |||
|
|||
virtual const WorkerInfo& getWorkerInfo(worker_id_t id) const = 0; | |||
|
|||
// Retrieve the timeout for all RPCs. | |||
virtual const std::chrono::milliseconds& getRpcTimeout() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this needs to be virtual. In which situation a derived class would like to override this value?
Can we make it inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it.
Marking it as inline.
Turned out pybind11 does not require it to be declared in sub-classes. pybind11 is able to look up this member function in the base class.
torch/distributed/rpc/__init__.py
Outdated
@@ -53,7 +53,10 @@ def init_model_parallel( | |||
self_rank (int): a globally unique id/rank of this node. | |||
init_method(str): backend specific init arguments. | |||
num_send_recv_threads(int): Number of threads for send/recv work. | |||
rpc_timeout (datetime.timedelta): Timeout for RPCs. Defaults to 10 seconds. | |||
rpc_timeout (datetime.timedelta): | |||
Global timeout for server to process an RPC request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we made the decision on whether we should have finer grain timeout (server, client, etc.), shall we avoid mentioning server here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Removing it for now.
&RpcAgent::sync, | ||
"sync", &RpcAgent::sync, py::call_guard<py::gil_scoped_release>()) | ||
.def( | ||
"_get_rpc_timeout", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, if the RpcAgent will handle timeout internally, why do we need to expose this API in Python and C++? Is this only for test and debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what @rohan-varma mentioned was since we allow users to set timeout, we need to have a getter to allow users to inspect this value as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Approve to unblock. Please make sure tests pass before landing.
#29336 So that other RpcAgent could use this timeout setting as well. Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/) [ghstack-poisoned]
Pull Request resolved: #29341 So that other RpcAgent could use this timeout setting as well. ghstack-source-id: 93481902 Differential Revision: [D5681951](https://our.internmc.facebook.com/intern/diff/D5681951/)
This pull request has been merged in e66626a. |
@xush6528 Thanks! |
@@ -78,6 +79,11 @@ def _init_rpc( | |||
raise RuntimeError("RPC is already initialized") | |||
|
|||
# Initialize RPC. | |||
if not isinstance(rpc_timeout, datetime.timedelta): | |||
raise RuntimeError( | |||
"`rpc_timeout` must be a `datetime.timedelta`." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think we should have backticks (`) in exception messages. This would be displayed on the user's terminal and there is no markdown there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not expecting a markdown format. But want to make clear that it's a python variable. Any other suggestion to make it distinct? or you think plain text is fine?
@@ -962,13 +962,11 @@ def test_call_method_on_rref(self): | |||
|
|||
self.assertEqual(result, sum(vals)) | |||
|
|||
@requires_process_group_agent("PROCESS_GROUP rpc backend specific test, skip") | |||
@dist_init | |||
def test_get_default_rpc_timeout(self): | |||
timeout = rpc.get_rpc_timeout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion in #28926, we should directly add docstrings to the pybind method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pritamdamania87 currently, ProcessGroupAgent
is not exposed in the docs since I assumed that this is a private class and there's no reason for the user to directly access it (we provide higher level APIs for everything that needs to be done). If we wanted to move this doc to pybind, we'd have to expose documentation for ProcessGroupAgent
, since get_rpc_timeout
is a method on that class - is this something we'd want to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created issue #29628 for tracking thihs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this method from RpcAgent
to torch.distributed.rpc
and the implementation uses the default RpcAgent?
Stack from ghstack:
#29336
So that other RpcAgent could use this timeout setting as well.
Differential Revision: D5681951