New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CUDA RPC Stream Synchronization #50949
Conversation
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. [ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit 01f04f7 (more details on the Dr. CI page):
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group. |
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. [ghstack-poisoned]
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 [ghstack-poisoned]
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: 56c79004a6250bb608d473300260d181a3b11cc9 Pull Request resolved: #50949
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458) [ghstack-poisoned]
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: db34a5763a63f1d660c61552e5d3c71ae52e5875 Pull Request resolved: #50949
std::weak_ptr<JitFuture> wp = messageJitFuture; | ||
messageJitFuture->addCallback( | ||
at::wrapPropagateTLSState<void>([pyJitFuture, wp]() { | ||
return messageJitFuture->then( | ||
at::wrapPropagateTLSState<IValue>([wp]() { | ||
auto future = wp.lock(); | ||
if (future->hasError()) { | ||
pyJitFuture->setError(future->exception_ptr()); | ||
std::rethrow_exception(future->exception_ptr()); | ||
} else { | ||
pyJitFuture->markCompleted( | ||
toPyIValue(*future->value().toCustomClass<Message>())); | ||
return toPyIValue(*future->value().toCustomClass<Message>()); | ||
} | ||
})); | ||
|
||
return pyJitFuture; | ||
}), | ||
PyObjectType::get()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible, it would be nice to add some unit tests that would consistently fail without this patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's hard to make it consistently fail, but #50839 fails pretty frequently without this fix. I am not sure if this is the only bug, but I tried a few tens of times locally, and the error didn't occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks for fixing!
Stack from ghstack:
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling
rpc_async(...).wait()
. This commituses
Future::then
API to create the chained Future, which willbe creating a CUDAFuture if the existing Future is a CUDA one.
fixes #50881
fixes #50839
Differential Revision: D26020458