-
Notifications
You must be signed in to change notification settings - Fork 24.9k
Enable RRef timeout for tensorpipe #39531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/) [ghstack-poisoned]
Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/) ghstack-source-id: 105273154 Pull Request resolved: #39531
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please get a stamp from @lw too
@lw Do you mind taking a look at this diff? Thanks! |
@@ -23,6 +24,8 @@ const std::string kNumIdleThreads = "agent.num_idle_threads"; | |||
const std::string kClientActiveCalls = "agent.client_active_calls"; | |||
const std::string kServerActiveCalls = "agent.server_active_calls"; | |||
const std::string kServerActiveAsyncCalls = "agent.server_active_async_calls"; | |||
const std::string kTensorPipeRPCTimeoutErrorStr = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: kTensorPipeRpcTimeoutErrorStr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for the TensorPipe
prefix, this is in an anonymous namespace, it's local to this file
@@ -418,7 +421,8 @@ std::shared_ptr<FutureMessage> TensorPipeAgent::send( | |||
{ | |||
std::unique_lock<std::mutex> lock(timeoutMapMutex_); | |||
auto& timeoutFuturesVector = timeoutMap_[expirationTime]; | |||
timeoutFuturesVector.emplace_back(futureResponseMessage); | |||
timeoutFuturesVector.emplace_back( | |||
std::make_pair(futureResponseMessage, timeout)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might not need to call std::make_pair
with emplace_back
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
@@ -2833,7 +2833,6 @@ def test_return_future(self): | |||
return_future, | |||
) | |||
|
|||
@_skip_if_tensorpipe_agent | |||
@dist_init |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been stress-tested? Just checking to ensure we don't introduce more flaky TensorPipe agent tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran it with 500 stresstests and they all passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks!
Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/) [ghstack-poisoned]
Pull Request resolved: #39531 Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. ghstack-source-id: 105377025 Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/)
Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/) [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 5661c87 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 3 times. |
Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/) [ghstack-poisoned]
Pull Request resolved: #39531 Enables RRef timeout support in TP agent by having TP agent mark timeout errors with `makeRPCError` API. Also does some refactoring so TP agent can print out the timeout for each future that has timed out. ghstack-source-id: 105461555 Differential Revision: [D21881475](https://our.internmc.facebook.com/intern/diff/D21881475/)
This pull request has been merged in e033db0. |
Stack from ghstack:
Enables RRef timeout support in TP agent by having TP agent mark
timeout errors with
makeRPCError
API. Also does some refactoring so TP agentcan print out the timeout for each future that has timed out.
Differential Revision: D21881475