Enable GPU-to-GPU comm in TensorPipeAgent #44418

mrshenli · 2020-09-09T20:08:40Z

Stack from ghstack:

Enable GPU-to-GPU comm in TensorPipeAgent #44418 Enable GPU-to-GPU comm in TensorPipeAgent

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, TensorPipeAgent grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe CudaBuffer. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's CudaBuffer.

If device maps are provided, TensorPipeAgent::send will return a
derived class of CUDAFuture, which is specifically tailored for
RPC Messages.

TODOs:

Enable sending CUDA RPC to the same process.
Add a custom CUDA stream pool.
When TensorPipe addressed the error for cudaPointerGetAttributes(),
remove cuda:0 context initialization code in backend_registry.py.
When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: D23626207

[ghstack-poisoned]

ghstack-source-id: 13ac6a41e1eb7de5279827a92080cfadce171913 Pull Request resolved: #44418

dr-ci · 2020-09-09T20:38:21Z

💊 CI failures summary and remediations

As of commit b302be9 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - clang-tidy

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

mrshenli · 2020-09-09T21:18:26Z

Land only after TensorPipe CUDA support is in.

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

[ghstack-poisoned]

ghstack-source-id: 43ad23d33ad08c65520e4b97f6d3896d95def389 Pull Request resolved: #44418

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

lw · 2020-09-10T15:39:56Z

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

+                         streams{std::move(streams)}]() mutable {
+          // create guards again as this function runs on a different thread
+          auto guards = streamsToGuards(streams);


The streams we receive from pipeRead do, at the moment, only contain streams for the devices on which the input tensors lived. However the user function may place the result tensors on another different device. I therefore think we should get a stream from the pool for all devices and set them all as current.

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 400f8d3e1e079947b66667e42c6a8ceda556d6ed Pull Request resolved: #44418

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 66d6bc3864ec6b4b04760c36abc9f9cc1fec85e7 Pull Request resolved: #44418

lw

Nice idea the DeviceContext! :)

caffe2/CMakeLists.txt

torch/csrc/distributed/rpc/tensorpipe_agent.h

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e Pull Request resolved: #44418

ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e Pull Request resolved: pytorch#44418

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: 0c2672af0abd403b0bf8750e05a195a25d7eccdf Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: ba90f041a77377b18f805a4911223bd5f1d8da5b Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: 7706e1feccba7bb6affc67331c90e9e321d526ce Pull Request resolved: #44418

mrshenli

lint failure is on a file I didn't touch:

  {
    path: 'aten/src/ATen/cuda/CUDAEvent.h',
    start_line: 30,
    end_line: 30,
    start_column: 3,
    end_column: 3,
    annotation_level: 'failure',
    message: '[clang-analyzer-optin.cplusplus.UninitializedObject] warning: 1 uninitialized field at the end of the constructor call'
  }

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: 5e84caec3bf52b977ff12f8079459f66e09c2bff Pull Request resolved: #44418

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: 6e54eb8540433f5d2b40049aba3dec81a2fa598d Pull Request resolved: #44418

lw · 2021-01-13T16:44:46Z

torch/csrc/distributed/rpc/tensorpipe_utils.h

+  virtual std::vector<CUDAStream> getReservedStreams() const {
    throw std::runtime_error(
        "Attempting to access CUDA streams, but torch is not built with CUDA");
  }
-#endif

+  virtual CUDAStream getStream(c10::DeviceIndex index) {
+    throw std::runtime_error(c10::str(
+        "Attempting to access CUDA stream of device ",
+        index,
+        ", but torch is not built with CUDA"));
+  }


After re-reading this I'm not sure I follow: we define these methods if USE_CUDA is on, but these methods then claim that CUDA is off? I realize that in the subclass we override it, and I understand that we must gate them because otherwise CUDAStream would be undefined. But doesn't this mean we could just leave them unimplemented? (i.e., = 0)

Was originally intended to provide a clear error message. When I tried to use pure virtual function, I realized that the following will also need to be gated, or will need to change callsites that do not provide a ctx. Will address this in a follow up PR.

c++ TORCH_API std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize( Message&& rpcMessage, std::vector<c10::DeviceIndex> devices = {}, const std::shared_ptr<LazyStreamContext>& = std::make_shared<LazyStreamContext>());`

torch/csrc/distributed/rpc/tensorpipe_utils.cpp

torch/csrc/distributed/rpc/tensorpipe_agent.cpp

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

lw · 2021-01-13T19:13:33Z

torch/testing/_internal/distributed/rpc/rpc_test.py

+        s1 = torch.cuda.Stream(device=x.device)
+        with torch.cuda.stream(s1):
+            torch.cuda._sleep(10 * FIFTY_MIL_CYCLES)
+            z = x + y


Shouldn't there also be a synchronization before the addition? x and y might still be filled in, and this is being done in the current streams, hence it's only safe to access them from the current stream, or from streams that are explicitly synchronized with the current one.

Also, we should check that x and y are on the same devices, or else we need to also sync with the current stream of y.device.

ah, yes, good catch! it probably was hided by the _sleep

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: 86feb4664a7101318efb1e2ac477ba76e43e38d7 Pull Request resolved: #44418

mrshenli · 2021-01-13T19:26:25Z

ci-all test in #50494

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. ghstack-source-id: a268f91d3d8f98559588b63fb1b778d8e840a86d Pull Request resolved: #44418

codecov · 2021-01-14T06:09:37Z

Codecov Report

Merging #44418 (28209c4) into gh/mrshenli/235/base (2c55426) will decrease coverage by 0.76%.
The diff coverage is 90.26%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/235/base   #44418      +/-   ##
========================================================
- Coverage                 81.47%   80.71%   -0.77%     
========================================================
  Files                      1792     1910     +118     
  Lines                    186156   207364   +21208     
========================================================
+ Hits                     151669   167369   +15700     
- Misses                    34487    39995    +5508

This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

Pull Request resolved: #44418 This commit uses TensorPipe's cuda_ipc channel to conduct cross-process same-machine GPU-to-GPU communication. On the sender side, `TensorPipeAgent` grabs a stream to each device used by the message, let these streams wait for current streams, and passes the streams to TensorPipe `CudaBuffer`. On the receiver side, it also grabs a stream for each device used in the message, and uses these streams to receive tensors and run user functions. After that, these streams are then used for sending the response back to the sender. When receiving the response, the sender will grab a new set of streams and use them for TensorPipe's `CudaBuffer`. If device maps are provided, `TensorPipeAgent::send` will return a derived class of `CUDAFuture`, which is specifically tailored for RPC Messages. TODOs: 1. Enable sending CUDA RPC to the same process. 2. Add a custom CUDA stream pool. 3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`, remove `cuda:0` context initialization code in `backend_registry.py`. 4. When TensorPipe can detect availability of peer access, enable all tests on platforms without peer access. Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23626207/)! ghstack-source-id: 119821241

facebook-github-bot · 2021-01-14T22:01:11Z

@mrshenli merged this pull request in 30e45bb.

Use streams from pool on RPC callees

b0e83ac

[ghstack-poisoned]

mrshenli requested review from beauby, jiayisuse, lw and osalpekar as code owners September 9, 2020 20:08

This was referenced Sep 9, 2020

Add missing rpc.shutdown() #44417

Closed

Use TP Tensor.metadata to carry device info #44396

Closed

mrshenli added a commit that referenced this pull request Sep 9, 2020

Use streams from pool on RPC callees

a630373

ghstack-source-id: 13ac6a41e1eb7de5279827a92080cfadce171913 Pull Request resolved: #44418

mrshenli changed the title ~~Use streams from pool on RPC callees~~ [WIP] Use streams from pool on RPC callees Sep 9, 2020

mrshenli commented Sep 10, 2020

View reviewed changes

torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved

Update on "[WIP] Use streams from pool on RPC callees"

6ae4c16

[ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 10, 2020

Use streams from pool on RPC callees

41c1ff0

ghstack-source-id: 43ad23d33ad08c65520e4b97f6d3896d95def389 Pull Request resolved: #44418

lw reviewed Sep 10, 2020

View reviewed changes

Update on "[WIP] Use streams from pool on RPC callees"

0a23d4d

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 10, 2020

Use streams from pool on RPC callees

8530ad0

ghstack-source-id: 400f8d3e1e079947b66667e42c6a8ceda556d6ed Pull Request resolved: #44418

Update on "[WIP] Use streams from pool on RPC callees"

c6dec67

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

Update on "[WIP] Use streams from pool on RPC callees"

7b073dc

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

Update on "[WIP] Use streams from pool on RPC callees"

7489147

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 11, 2020

Use streams from pool on RPC callees

e84dc31

ghstack-source-id: 66d6bc3864ec6b4b04760c36abc9f9cc1fec85e7 Pull Request resolved: #44418

lw reviewed Sep 11, 2020

View reviewed changes

Update on "[WIP] Use streams from pool on RPC callees"

2dc32dc

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli added a commit that referenced this pull request Sep 11, 2020

Use streams from pool on RPC callees

f356746

ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e Pull Request resolved: #44418

mrshenli added a commit to mrshenli/pytorch that referenced this pull request Sep 18, 2020

Use streams from pool on RPC callees

727de8e

ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e Pull Request resolved: pytorch#44418

pritamdamania87 mentioned this pull request Sep 18, 2020

Support device map for distributed autograd while using TensorPipe. #44859

Closed

mrshenli mentioned this pull request Sep 18, 2020

Let rpc._all_gather use default RPC timeout #44983

Closed

Update on "[WIP] Use streams from pool on RPC callees"

af38592

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207) [ghstack-poisoned]

mrshenli commented Jan 13, 2021

View reviewed changes

lw reviewed Jan 13, 2021

View reviewed changes

lw approved these changes Jan 13, 2021

View reviewed changes

mrshenli mentioned this pull request Jan 13, 2021

[Don't Review][ci-all Test]Enable GPU-to-GPU comm in TensorPipeAgent #50494

Closed

facebook-github-bot closed this in 30e45bb Jan 14, 2021

facebook-github-bot added the Merged label Jan 14, 2021

xwang233 mentioned this pull request Jan 17, 2021

distributed/rpc/test_tensorpipe_agent test_device_maps_return_to_gpu_self fails #50671

Closed

facebook-github-bot deleted the gh/mrshenli/235/head branch January 18, 2021 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable GPU-to-GPU comm in TensorPipeAgent #44418

Enable GPU-to-GPU comm in TensorPipeAgent #44418

mrshenli commented Sep 9, 2020 •

edited

dr-ci bot commented Sep 9, 2020 •

edited by facebook-github-bot

mrshenli commented Sep 9, 2020

lw Sep 10, 2020

lw left a comment

mrshenli left a comment

lw Jan 13, 2021

mrshenli Jan 13, 2021

lw Jan 13, 2021

mrshenli Jan 13, 2021

mrshenli commented Jan 13, 2021

codecov bot commented Jan 14, 2021

facebook-github-bot commented Jan 14, 2021

Enable GPU-to-GPU comm in TensorPipeAgent #44418

Enable GPU-to-GPU comm in TensorPipeAgent #44418

Conversation

mrshenli commented Sep 9, 2020 • edited

dr-ci bot commented Sep 9, 2020 • edited by facebook-github-bot

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

mrshenli commented Sep 9, 2020

lw Sep 10, 2020

Choose a reason for hiding this comment

lw left a comment

Choose a reason for hiding this comment

mrshenli left a comment

Choose a reason for hiding this comment

lw Jan 13, 2021

Choose a reason for hiding this comment

mrshenli Jan 13, 2021

Choose a reason for hiding this comment

lw Jan 13, 2021

Choose a reason for hiding this comment

mrshenli Jan 13, 2021

Choose a reason for hiding this comment

mrshenli commented Jan 13, 2021

codecov bot commented Jan 14, 2021

Codecov Report

facebook-github-bot commented Jan 14, 2021

mrshenli commented Sep 9, 2020 •

edited

dr-ci bot commented Sep 9, 2020 •

edited by facebook-github-bot