Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CUDA RPC Stream Synchronization #50949

Closed
wants to merge 4 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Jan 22, 2021

Stack from ghstack:

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling rpc_async(...).wait(). This commit
uses Future::then API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: D26020458

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Jan 22, 2021

💊 CI failures summary and remediations

As of commit 01f04f7 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

[ghstack-poisoned]
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

[ghstack-poisoned]
@mrshenli mrshenli requested a review from lw January 22, 2021 17:08
mrshenli added a commit that referenced this pull request Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: 56c79004a6250bb608d473300260d181a3b11cc9
Pull Request resolved: #50949
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: db34a5763a63f1d660c61552e5d3c71ae52e5875
Pull Request resolved: #50949
Comment on lines 141 to +151
std::weak_ptr<JitFuture> wp = messageJitFuture;
messageJitFuture->addCallback(
at::wrapPropagateTLSState<void>([pyJitFuture, wp]() {
return messageJitFuture->then(
at::wrapPropagateTLSState<IValue>([wp]() {
auto future = wp.lock();
if (future->hasError()) {
pyJitFuture->setError(future->exception_ptr());
std::rethrow_exception(future->exception_ptr());
} else {
pyJitFuture->markCompleted(
toPyIValue(*future->value().toCustomClass<Message>()));
return toPyIValue(*future->value().toCustomClass<Message>());
}
}));

return pyJitFuture;
}),
PyObjectType::get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, it would be nice to add some unit tests that would consistently fail without this patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's hard to make it consistently fail, but #50839 fails pretty frequently without this fix. I am not sure if this is the only bug, but I tried a few tens of times locally, and the error didn't occur.

Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks for fixing!

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in f0e72e5.

@facebook-github-bot facebook-github-bot deleted the gh/mrshenli/289/head branch January 26, 2021 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged oncall: distributed Add this issue/PR to distributed oncall triage queue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants