Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable GPU-to-GPU comm in TensorPipeAgent #44418

Closed
wants to merge 63 commits into from

Conversation

mrshenli
Copy link
Contributor

@mrshenli mrshenli commented Sep 9, 2020

Stack from ghstack:

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, TensorPipeAgent grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe CudaBuffer. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's CudaBuffer.

If device maps are provided, TensorPipeAgent::send will return a
derived class of CUDAFuture, which is specifically tailored for
RPC Messages.

TODOs:

  1. Enable sending CUDA RPC to the same process.
  2. Add a custom CUDA stream pool.
  3. When TensorPipe addressed the error for cudaPointerGetAttributes(),
    remove cuda:0 context initialization code in backend_registry.py.
  4. When TensorPipe can detect availability of peer access, enable all
    tests on platforms without peer access.

Differential Revision: D23626207

mrshenli added a commit that referenced this pull request Sep 9, 2020
ghstack-source-id: 13ac6a41e1eb7de5279827a92080cfadce171913
Pull Request resolved: #44418
@dr-ci
Copy link

dr-ci bot commented Sep 9, 2020

💊 CI failures summary and remediations

As of commit b302be9 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@mrshenli mrshenli changed the title Use streams from pool on RPC callees [WIP] Use streams from pool on RPC callees Sep 9, 2020
@mrshenli
Copy link
Contributor Author

mrshenli commented Sep 9, 2020

Land only after TensorPipe CUDA support is in.

mrshenli added a commit that referenced this pull request Sep 10, 2020
ghstack-source-id: 43ad23d33ad08c65520e4b97f6d3896d95def389
Pull Request resolved: #44418
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
Comment on lines 646 to 648
streams{std::move(streams)}]() mutable {
// create guards again as this function runs on a different thread
auto guards = streamsToGuards(streams);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streams we receive from pipeRead do, at the moment, only contain streams for the devices on which the input tensors lived. However the user function may place the result tensors on another different device. I therefore think we should get a stream from the pool for all devices and set them all as current.

torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
mrshenli added a commit that referenced this pull request Sep 10, 2020
ghstack-source-id: 400f8d3e1e079947b66667e42c6a8ceda556d6ed
Pull Request resolved: #44418
mrshenli added a commit that referenced this pull request Sep 11, 2020
ghstack-source-id: 66d6bc3864ec6b4b04760c36abc9f9cc1fec85e7
Pull Request resolved: #44418
Copy link
Contributor

@lw lw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice idea the DeviceContext! :)

caffe2/CMakeLists.txt Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.h Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
mrshenli added a commit that referenced this pull request Sep 11, 2020
ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e
Pull Request resolved: #44418
mrshenli added a commit to mrshenli/pytorch that referenced this pull request Sep 18, 2020
ghstack-source-id: 630d15fc326b4073841a6a829607c889e1fb4e7e
Pull Request resolved: pytorch#44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: 0c2672af0abd403b0bf8750e05a195a25d7eccdf
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: ba90f041a77377b18f805a4911223bd5f1d8da5b
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: 7706e1feccba7bb6affc67331c90e9e321d526ce
Pull Request resolved: #44418
Copy link
Contributor Author

@mrshenli mrshenli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint failure is on a file I didn't touch:

  {
    path: 'aten/src/ATen/cuda/CUDAEvent.h',
    start_line: 30,
    end_line: 30,
    start_column: 3,
    end_column: 3,
    annotation_level: 'failure',
    message: '[clang-analyzer-optin.cplusplus.UninitializedObject] warning: 1 uninitialized field at the end of the constructor call'
  }

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: 5e84caec3bf52b977ff12f8079459f66e09c2bff
Pull Request resolved: #44418
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: 6e54eb8540433f5d2b40049aba3dec81a2fa598d
Pull Request resolved: #44418
Comment on lines 40 to 50
virtual std::vector<CUDAStream> getReservedStreams() const {
throw std::runtime_error(
"Attempting to access CUDA streams, but torch is not built with CUDA");
}
#endif

virtual CUDAStream getStream(c10::DeviceIndex index) {
throw std::runtime_error(c10::str(
"Attempting to access CUDA stream of device ",
index,
", but torch is not built with CUDA"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After re-reading this I'm not sure I follow: we define these methods if USE_CUDA is on, but these methods then claim that CUDA is off? I realize that in the subclass we override it, and I understand that we must gate them because otherwise CUDAStream would be undefined. But doesn't this mean we could just leave them unimplemented? (i.e., = 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was originally intended to provide a clear error message. When I tried to use pure virtual function, I realized that the following will also need to be gated, or will need to change callsites that do not provide a ctx. Will address this in a follow up PR.

c++ TORCH_API std::tuple<tensorpipe::Message, TensorpipeWriteBuffers> tensorpipeSerialize( Message&& rpcMessage, std::vector<c10::DeviceIndex> devices = {}, const std::shared_ptr<LazyStreamContext>& = std::make_shared<LazyStreamContext>());`

torch/csrc/distributed/rpc/tensorpipe_utils.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
torch/csrc/distributed/rpc/tensorpipe_agent.cpp Outdated Show resolved Hide resolved
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
s1 = torch.cuda.Stream(device=x.device)
with torch.cuda.stream(s1):
torch.cuda._sleep(10 * FIFTY_MIL_CYCLES)
z = x + y
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be a synchronization before the addition? x and y might still be filled in, and this is being done in the current streams, hence it's only safe to access them from the current stream, or from streams that are explicitly synchronized with the current one.

Also, we should check that x and y are on the same devices, or else we need to also sync with the current stream of y.device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, good catch! it probably was hided by the _sleep

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 13, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: 86feb4664a7101318efb1e2ac477ba76e43e38d7
Pull Request resolved: #44418
@mrshenli
Copy link
Contributor Author

ci-all test in #50494

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 14, 2021
This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

ghstack-source-id: a268f91d3d8f98559588b63fb1b778d8e840a86d
Pull Request resolved: #44418
@codecov
Copy link

codecov bot commented Jan 14, 2021

Codecov Report

Merging #44418 (28209c4) into gh/mrshenli/235/base (2c55426) will decrease coverage by 0.76%.
The diff coverage is 90.26%.

@@                   Coverage Diff                    @@
##           gh/mrshenli/235/base   #44418      +/-   ##
========================================================
- Coverage                 81.47%   80.71%   -0.77%     
========================================================
  Files                      1792     1910     +118     
  Lines                    186156   207364   +21208     
========================================================
+ Hits                     151669   167369   +15700     
- Misses                    34487    39995    +5508     

This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207)

[ghstack-poisoned]
mrshenli added a commit that referenced this pull request Jan 14, 2021
Pull Request resolved: #44418


This commit uses TensorPipe's cuda_ipc channel to conduct
cross-process same-machine GPU-to-GPU communication. On the sender
side, `TensorPipeAgent` grabs a stream to each device used by the
message, let these streams wait for current streams, and passes
the streams to TensorPipe `CudaBuffer`. On the receiver side, it
also grabs a stream for each device used in the message, and uses
these streams to receive tensors and run user functions. After that,
these streams are then used for sending the response back to the
sender. When receiving the response, the sender will grab a new set
of streams and use them for TensorPipe's `CudaBuffer`.

If device maps are provided, `TensorPipeAgent::send` will return a
derived class of `CUDAFuture`, which is specifically tailored for
RPC Messages.

TODOs:
1. Enable sending CUDA RPC to the same process.
2. Add a custom CUDA stream pool.
3. When TensorPipe addressed the error for `cudaPointerGetAttributes()`,
remove `cuda:0` context initialization code in `backend_registry.py`.
4. When TensorPipe can detect availability of peer access, enable all
tests on platforms without peer access.

Differential Revision: [D23626207](https://our.internmc.facebook.com/intern/diff/D23626207/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D23626207/)!
ghstack-source-id: 119821241
@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in 30e45bb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants