Cuda RPC error when using then() #56244

rohan-varma · 2021-04-16T08:34:10Z

🐛 Bug

The following results in an error on master (add into rpc/rpc_test.py):

def get_myclass():
    return MyClass(4)

@skip_if_lt_x_gpu(1)
    def test_rpc_cuda_then(self):
        dst = worker_name((self.rank + 1) % self.world_size)
        options = self.rpc_backend_options
        options.set_device_map(dst, {0: 0})

        rpc.init_rpc(
            name=worker_name(self.rank),
            backend=self.rpc_backend,
            rank=self.rank,
            world_size=self.world_size,
            rpc_backend_options=options,
        )

        if self.rank == 1:
            fut = rpc.rpc_async(dst, get_myclass, args=()).then(lambda res: res)
            print(fut.wait())

        rpc.shutdown()

Stacktrace:

======================================================================
ERROR: test_rpc_cuda_then (test_tensorpipe_agent.TensorPipeTensorPipeAgentCudaRpcTestWithSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 322, in wrapper
    self._join_processes(fn)
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 515, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 558, in _check_return_codes
    raise RuntimeError(error)
RuntimeError: Process 1 exited with error code 10 and exception:
Traceback (most recent call last):
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 441, in run_test
    getattr(self, test_name)()
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 324, in wrapper
    fn()
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/common_distributed.py", line 89, in wrapper
    return func(*args, **kwargs)
  File "/data/users/rvarm1/fbsource/fbcode/buck-out/dev/gen/caffe2/test/distributed/rpc/cuda/tensorpipe_agent_cuda#binary,link-tree/torch/testing/_internal/distributed/rpc/rpc_test.py", line 5605, in test_rpc_cuda_then
    print(fut.wait())
RuntimeError: match.success()INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/ivalue.cpp":221, please report a bug to PyTorch. Cannot infer type of <torch._C.Future object at 0x7fa9240fbdf0>
:Only tensors and (possibly nested) tuples of tensors, lists, or dictsare supported as inputs or outputs of traced functions, but instead got value of type Future.

From my understanding, the root cause is as follows. I could be mistaken though as I'm not too familiar with CUDA rpc:

In CUDA mode, tensorpipe returns AtomicJitFuture which is a RpcCUDAFuture which is a CUDAFuture.
when processing a response from server, rpc client calls markCompleted on the future for the rpc with a Message object, and then extractDataPtrs runs. Specifically extractDataPtrs defined for RpcCudaFuture runs.
If the future in (2) has a then() callback attached to it, then the resulting future returned by then() is not a c10::ivalue::Future, it's a at::CudaFuture per https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/core/ivalue_inl.h#L469 (createInstance is overriden by CUDAFuture). Since the future in (2) is of type RpcCudaFuture the createInstance should dispatch to at::cudaFuture.
When markCompleted is called on this future from then(), it seems like extractDataPtrs defined in CUDAFuture runs, and errors out with the above error. I was confused as to why extractDataPtrs defined in RpcCudaFuture wasn't running, but I think this is because the future in (3) is at::CudaFuture and not RpcCudaFuture. extractDataPtrs uses getSubValues which throws on non-torchscript python objs.

Additional context

Hit this bug while looking into #55757

cc @osalpekar @jiayisuse @lw @beauby @pritamdamania87 @mrshenli @jjlilley @gqchen @rohan-varma @pietern @zhaojuanmao @satgera @aazzolini @agolynski @SciPioneer @H-Huang @mrzzd @cbalioglu

The text was updated successfully, but these errors were encountered:

lw · 2021-04-16T11:58:47Z

Yeah the rpc_async(...).then(lambda res: res) call returns a Future[Future[...]] which is currently unsupported. Is this intentional? What do you need it for? What behavior would you expect for that object (especially wrt CUDA streams)?

I think we should:

change the preMarkCompletedHook so that it doesn't fatal if it cannot process the object, but instead just sets the child future to error
if we reach an agreement on the expected behavior, add support for extracting DataPtrs from Future[Future[...]]
land Add support for async callbacks in ivalue::Future #48790 so that there's a way to support callbacks that return futures (if that's what you would have needed here)

rohan-varma · 2021-04-16T15:55:24Z

@lw Sorry I actually intended that lambda to be a lambda res: res.wait(), which results in a similar error:

RuntimeError: match.success()INTERNAL ASSERT FAILED at "caffe2/aten/src/ATen/core/ivalue.cpp":221, please report a bug to PyTorch. Cannot infer type of <torch.testing._internal.distributed.rpc.rpc_test.MyClass object at 0x7fae759a3d60>

In fact, a similar error would occur for any py type thats not a tensor/list/dict/tuple of tensors (since as you mentioned we don't know how to extractDataPtrs for those).

What behavior would you expect for that object (especially wrt CUDA streams)?

Ideally this shouldn't error out as it works fine with CPU RPC and this particular call doesn't use the GPU at all, it would be surprising if it's not supported out of the box. With respect to CUDA streams I don't really expect any changes in behavior since this is a CPU only call (in practical use cases it could be used for things like control messages, metrics collection, etc). Could we modify extractDataPtrs to pass over objects it doesn't know how to extract tensors from?

mrshenli · 2021-04-16T16:11:00Z

In fact, a similar error would occur for any py type thats not a tensor/list/dict/tuple of tensors (since as you mentioned we don't know how to extractDataPtrs for those).

This actually relates to our discussion yesterday on adding devices to TensorPipeRpcBackendOptions. I hit the same problem before and that's also the reason why we added RpcCUDAFuture.

@lw mentioned that one candidate solution might be letting CUDAFuture::markCompleted take a list of devices, or sth along that line. Another earlier thought was to conduct picking on the returned py::object to extract tensors as we did for RPC, but this will be costly. Shall we bump up its priority and kick off discussion on that?

rohan-varma · 2021-04-16T16:19:49Z

@lw mentioned that one candidate solution might be letting CUDAFuture::markCompleted take a list of devices

This might make sense if we can pass the right devices everywhere in RPC/gradient compression logic, and then python futures using then() can not pass in devices and we don't run into the above failure when trying to extract the type.

In general I'm also concerned this pattern in the issue might also not work for gradient compression (i.e. work.get_future().then(some_intermediate_py_obj)) since it uses the cuda future ~~but haven't verified that~~. It seems that if we have a hook that can return a custom python object as an intermediatery result, we run into the same error - although this isn't a common use case.

lw · 2021-04-19T08:45:17Z

@lw mentioned that one candidate solution might be letting CUDAFuture::markCompleted take a list of devices

Yeah that's an idea but I'm not sure it would fully help here. Passing around a list of devices would help us to support callbacks that change devices. However we would still need to extract data ptrs, because we need those data ptrs to record them with the CUDA caching allocator. (I've found no way around that). Hence I think we still need to go for that Python-side pickling approach to get it to work. I'm still very worried about the perf hit of this (especially given what we found out here) but I don't know what else we can do?

lw · 2021-04-26T08:03:37Z

I think this issue has been fixed in #56516. Could you confirm?

rohan-varma added the module: tensorpipe Related to Tensorpipe RPC Agent label Apr 16, 2021

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Apr 16, 2021

rohan-varma mentioned this issue Apr 16, 2021

Synchronize RRef.to_here() CUDA Streams properly when the profiler is enabled #55757

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda RPC error when using then() #56244

Cuda RPC error when using then() #56244

rohan-varma commented Apr 16, 2021 •

edited

lw commented Apr 16, 2021

rohan-varma commented Apr 16, 2021

mrshenli commented Apr 16, 2021

rohan-varma commented Apr 16, 2021 •

edited

lw commented Apr 19, 2021

lw commented Apr 26, 2021

Cuda RPC error when using then() #56244

Cuda RPC error when using then() #56244

Comments

rohan-varma commented Apr 16, 2021 • edited

🐛 Bug

Additional context

lw commented Apr 16, 2021

rohan-varma commented Apr 16, 2021

mrshenli commented Apr 16, 2021

rohan-varma commented Apr 16, 2021 • edited

lw commented Apr 19, 2021

lw commented Apr 26, 2021

rohan-varma commented Apr 16, 2021 •

edited

rohan-varma commented Apr 16, 2021 •

edited