New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DISABLED test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881
Labels
module: flaky-tests
Problem is a flaky test in CI
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Comments
mrshenli
added
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
module: flaky-tests
Problem is a flaky test in CI
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
labels
Jan 21, 2021
hmmm, this error message does not make sense to me:
|
This should be a stream synchronization issue where the pytorch/torch/testing/__init__.py Lines 129 to 160 in 4cca083
|
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 [ghstack-poisoned]
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 [ghstack-poisoned]
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: 56c79004a6250bb608d473300260d181a3b11cc9 Pull Request resolved: #50949
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458) [ghstack-poisoned]
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458) [ghstack-poisoned]
mrshenli
added a commit
that referenced
this issue
Jan 22, 2021
When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: db34a5763a63f1d660c61552e5d3c71ae52e5875 Pull Request resolved: #50949
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
module: flaky-tests
Problem is a flaky test in CI
module: rpc
Related to RPC, distributed autograd, RRef, and distributed optimizer
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
https://app.circleci.com/pipelines/github/pytorch/pytorch/262382/workflows/f23c73a1-fcd2-46c5-a28c-cf6827d744e2/jobs/10289993/steps
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd
The text was updated successfully, but these errors were encountered: