Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISABLED test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

Closed
mrshenli opened this issue Jan 21, 2021 · 2 comments
Assignees
Labels
module: flaky-tests Problem is a flaky test in CI module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@mrshenli
Copy link
Contributor

mrshenli commented Jan 21, 2021

https://app.circleci.com/pipelines/github/pytorch/pytorch/262382/workflows/f23c73a1-fcd2-46c5-a28c-cf6827d744e2/jobs/10289993/steps

Jan 21 11:15:43 ======================================================================
Jan 21 11:15:43 ERROR [6.137s]: test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn)
Jan 21 11:15:43 ----------------------------------------------------------------------
Jan 21 11:15:43 Traceback (most recent call last):
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 282, in wrapper
Jan 21 11:15:43     self._join_processes(fn)
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 399, in _join_processes
Jan 21 11:15:43     self._check_return_codes(elapsed_time)
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 435, in _check_return_codes
Jan 21 11:15:43     raise RuntimeError(error)
Jan 21 11:15:43 RuntimeError: Processes 1 exited with error code 10
Jan 21 11:15:43 
Jan 21 11:15:43 ----------------------------------------------------------------------
Jan 21 11:15:08   test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) ... ERROR:root:Caught exception: 
Jan 21 11:15:08 Traceback (most recent call last):
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 285, in wrapper
Jan 21 11:15:08     fn()
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 99, in wrapper
Jan 21 11:15:08     return func(*args, **kwargs)
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4922, in test_device_maps_multi_gpu_self
Jan 21 11:15:08     self._test_device_maps_multi_gpu(dst)
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4910, in _test_device_maps_multi_gpu
Jan 21 11:15:08     self.assertEqual(rets[0], (torch.zeros(2) + torch.ones(2)).to(1))
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual
Jan 21 11:15:08     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
Jan 21 11:15:08     raise self.failureException(msg)
Jan 21 11:15:08 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 2) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (1.0 vs. 1.0), which occurred at index 0.
Jan 21 11:15:08 exiting process with exit code: 10
Jan 21 11:15:09 [W tensorpipe_agent.cpp:648] RPC agent for worker0 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
Jan 21 11:15:09 Process 1 terminated with exit code 10, terminating remaining processes.
Jan 21 11:15:09 ERROR (6.137s)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

@mrshenli mrshenli added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: flaky-tests Problem is a flaky test in CI module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer labels Jan 21, 2021
@mrshenli mrshenli self-assigned this Jan 21, 2021
@mrshenli
Copy link
Contributor Author

mrshenli commented Jan 21, 2021

hmmm, this error message does not make sense to me:

found 0 element(s) (out of 2) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (1.0 vs. 1.0), which occurred at index 0.

@mrshenli
Copy link
Contributor Author

This should be a stream synchronization issue where the allclose comparison was done before the ops in the stream are executed.

# All other comparisons use torch.allclose directly
if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan):
return (True, None)
# Gathers debug info for failed float tensor comparison
# NOTE: converts to float64 to best represent differences
a_flat = a.to(torch.float64).flatten()
b_flat = b.to(torch.float64).flatten()
diff = torch.abs(a_flat - b_flat)
# Masks close values
# NOTE: this avoids (inf - inf) oddities when computing the difference
close = torch.isclose(a_flat, b_flat, rtol, atol, equal_nan)
diff[close] = 0
nans = torch.isnan(diff)
num_nans = nans.sum()
outside_range = (diff > (atol + rtol * torch.abs(b_flat))) | (diff == math.inf)
count_outside_range = torch.sum(outside_range, dtype=torch.long)
greatest_diff_index = torch.argmax(diff)
debug_msg = ("With rtol={0} and atol={1}, found {2} element(s) (out of {3}) whose "
"difference(s) exceeded the margin of error (including {4} nan comparisons). "
"The greatest difference was {5} ({6} vs. {7}), which "
"occurred at index {8}.".format(rtol, atol,
count_outside_range + num_nans,
a.numel(),
num_nans,
diff[greatest_diff_index],
a_flat[greatest_diff_index],
b_flat[greatest_diff_index],
_unravel_index(greatest_diff_index, a.shape)))
return (False, debug_msg)

mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

[ghstack-poisoned]
mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

[ghstack-poisoned]
mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: 56c79004a6250bb608d473300260d181a3b11cc9
Pull Request resolved: #50949
mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458)

[ghstack-poisoned]
mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458)

[ghstack-poisoned]
mrshenli added a commit that referenced this issue Jan 22, 2021
When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

ghstack-source-id: db34a5763a63f1d660c61552e5d3c71ae52e5875
Pull Request resolved: #50949
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: flaky-tests Problem is a flaky test in CI module: rpc Related to RPC, distributed autograd, RRef, and distributed optimizer oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

1 participant