DISABLED test_device_maps_multi_gpu_self (main.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

mrshenli · 2021-01-21T15:02:07Z

https://app.circleci.com/pipelines/github/pytorch/pytorch/262382/workflows/f23c73a1-fcd2-46c5-a28c-cf6827d744e2/jobs/10289993/steps

Jan 21 11:15:43 ======================================================================
Jan 21 11:15:43 ERROR [6.137s]: test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn)
Jan 21 11:15:43 ----------------------------------------------------------------------
Jan 21 11:15:43 Traceback (most recent call last):
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 282, in wrapper
Jan 21 11:15:43     self._join_processes(fn)
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 399, in _join_processes
Jan 21 11:15:43     self._check_return_codes(elapsed_time)
Jan 21 11:15:43   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 435, in _check_return_codes
Jan 21 11:15:43     raise RuntimeError(error)
Jan 21 11:15:43 RuntimeError: Processes 1 exited with error code 10
Jan 21 11:15:43 
Jan 21 11:15:43 ----------------------------------------------------------------------

Jan 21 11:15:08   test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) ... ERROR:root:Caught exception: 
Jan 21 11:15:08 Traceback (most recent call last):
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 285, in wrapper
Jan 21 11:15:08     fn()
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_distributed.py", line 99, in wrapper
Jan 21 11:15:08     return func(*args, **kwargs)
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4922, in test_device_maps_multi_gpu_self
Jan 21 11:15:08     self._test_device_maps_multi_gpu(dst)
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/distributed/rpc/rpc_test.py", line 4910, in _test_device_maps_multi_gpu
Jan 21 11:15:08     self.assertEqual(rets[0], (torch.zeros(2) + torch.ones(2)).to(1))
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual
Jan 21 11:15:08     super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
Jan 21 11:15:08   File "/opt/conda/lib/python3.6/unittest/case.py", line 682, in assertTrue
Jan 21 11:15:08     raise self.failureException(msg)
Jan 21 11:15:08 AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 0 element(s) (out of 2) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (1.0 vs. 1.0), which occurred at index 0.
Jan 21 11:15:08 exiting process with exit code: 10
Jan 21 11:15:09 [W tensorpipe_agent.cpp:648] RPC agent for worker0 encountered error when reading incoming request from worker1: EOF: end of file (this is expected to happen during shutdown)
Jan 21 11:15:09 Process 1 terminated with exit code 10, terminating remaining processes.
Jan 21 11:15:09 ERROR (6.137s)

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @rohan-varma @jjlilley @osalpekar @jiayisuse @agolynski @SciPioneer @H-Huang @mrzzd

The text was updated successfully, but these errors were encountered:

mrshenli · 2021-01-21T21:09:11Z

hmmm, this error message does not make sense to me:

found 0 element(s) (out of 2) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.0 (1.0 vs. 1.0), which occurred at index 0.

mrshenli · 2021-01-21T21:28:39Z

This should be a stream synchronization issue where the allclose comparison was done before the ops in the stream are executed.

pytorch/torch/testing/__init__.py

Lines 129 to 160 in 4cca083

    
           # All other comparisons use torch.allclose directly 
        
           if torch.allclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan): 
        
               return (True, None) 
        
           # Gathers debug info for failed float tensor comparison 
        
           # NOTE: converts to float64 to best represent differences 
        
           a_flat = a.to(torch.float64).flatten() 
        
           b_flat = b.to(torch.float64).flatten() 
        
           diff = torch.abs(a_flat - b_flat) 
        
           # Masks close values 
        
           # NOTE: this avoids (inf - inf) oddities when computing the difference 
        
           close = torch.isclose(a_flat, b_flat, rtol, atol, equal_nan) 
        
           diff[close] = 0 
        
           nans = torch.isnan(diff) 
        
           num_nans = nans.sum() 
        
           outside_range = (diff > (atol + rtol * torch.abs(b_flat))) | (diff == math.inf) 
        
           count_outside_range = torch.sum(outside_range, dtype=torch.long) 
        
           greatest_diff_index = torch.argmax(diff) 
        
           debug_msg = ("With rtol={0} and atol={1}, found {2} element(s) (out of {3}) whose " 
        
                        "difference(s) exceeded the margin of error (including {4} nan comparisons). " 
        
                        "The greatest difference was {5} ({6} vs. {7}), which " 
        
                        "occurred at index {8}.".format(rtol, atol, 
        
                                                        count_outside_range + num_nans, 
        
                                                        a.numel(), 
        
                                                        num_nans, 
        
                                                        diff[greatest_diff_index], 
        
                                                        a_flat[greatest_diff_index], 
        
                                                        b_flat[greatest_diff_index], 
        
                                                        _unravel_index(greatest_diff_index, a.shape))) 
        
           return (False, debug_msg)

When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 [ghstack-poisoned]

When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: 56c79004a6250bb608d473300260d181a3b11cc9 Pull Request resolved: #50949

When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Differential Revision: [D26020458](https://our.internmc.facebook.com/intern/diff/D26020458) [ghstack-poisoned]

When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 ghstack-source-id: db34a5763a63f1d660c61552e5d3c71ae52e5875 Pull Request resolved: #50949

mrshenli self-assigned this Jan 21, 2021

mrshenli mentioned this issue Jan 22, 2021

Fix CUDA RPC Stream Synchronization #50949

Closed

facebook-github-bot closed this as completed in f0e72e5 Jan 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DISABLED test_device_maps_multi_gpu_self (main.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

DISABLED test_device_maps_multi_gpu_self (main.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

mrshenli commented Jan 21, 2021 •

edited by pytorch-probot bot

mrshenli commented Jan 21, 2021 •

edited

mrshenli commented Jan 21, 2021

DISABLED test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

DISABLED test_device_maps_multi_gpu_self (__main__.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

Comments

mrshenli commented Jan 21, 2021 • edited by pytorch-probot bot

mrshenli commented Jan 21, 2021 • edited

mrshenli commented Jan 21, 2021

DISABLED test_device_maps_multi_gpu_self (main.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

DISABLED test_device_maps_multi_gpu_self (main.TensorPipeTensorPipeAgentRpcTestWithSpawn) #50881

mrshenli commented Jan 21, 2021 •

edited by pytorch-probot bot

mrshenli commented Jan 21, 2021 •

edited