Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUDirect RDMA Out-of-Band Tensor Transport #11392

Merged
merged 48 commits into from Aug 9, 2017
Merged
Show file tree
Hide file tree
Changes from 46 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
f7f555f
GPU Direct RDMA Out-of-Band Tensor Transport
byronyi May 4, 2017
114b110
[WIP] GPU Direct with customized allocator
byronyi Jul 4, 2017
5b5661b
[WIP] Data race problem
byronyi Jul 7, 2017
dd7e2fc
[WIP] Refactor and add checksum for GDR
byronyi Jul 9, 2017
6c91828
[WIP] Add debug string to checksum check
byronyi Jul 9, 2017
a1323e5
Final piece of host memory fallback
byronyi Jul 9, 2017
44d74bd
Bugfix on memory region management
byronyi Jul 9, 2017
d3b84ee
Add RDMA library headers as third party dependency
byronyi Jul 10, 2017
be585a2
Revert "Add RDMA library headers as third party dependency"
byronyi Jul 11, 2017
2852d34
make buildifier happy
byronyi Jul 11, 2017
9638a5f
fix errors for non-RDMA target
byronyi Jul 11, 2017
4808c65
fix good path
byronyi Jul 11, 2017
f07170e
fix dangling pointer
byronyi Jul 11, 2017
947876e
add compile switch for GDR
byronyi Jul 13, 2017
946afcb
make buildifier happy
byronyi Jul 13, 2017
cdc98e1
tidy source format using clang-format --style=google
byronyi Jul 13, 2017
e6cc8ee
using buildifier to auto format
byronyi Jul 13, 2017
8858452
fix macro
byronyi Jul 13, 2017
d3ae026
fix build config
byronyi Jul 13, 2017
afd3c36
Fix a performance bug (and #11411 hopefully)
byronyi Jul 14, 2017
0cb5019
fix kUnknownNumaNode
byronyi Jul 17, 2017
074b48a
prepare for a cleaned up refactoring
byronyi Jul 31, 2017
915f2b2
remove unnecessary changes
byronyi Jul 31, 2017
2e874df
finishing moving to contrib
byronyi Aug 1, 2017
1e8b668
several quick fixes
byronyi Aug 1, 2017
7739c56
fix tests
byronyi Aug 2, 2017
e7eaf28
remove wrong stop() call in server join
byronyi Aug 2, 2017
0191ccf
fix a init race condition for gdr w.r.t. cuda
byronyi Aug 2, 2017
e4c1a99
better reporting of errors via errno
byronyi Aug 3, 2017
59e7b7d
add check for wildcard and loopback address
byronyi Aug 3, 2017
3c54f86
do not visit gpu allocators if no gpu is active
byronyi Aug 3, 2017
3b8f30c
Revert "do not visit gpu allocators if no gpu is active"
byronyi Aug 3, 2017
054daeb
fix for latest gcc and cpu-only build
byronyi Aug 3, 2017
6720470
complain louder on server side
byronyi Aug 3, 2017
9557643
reduce CPU overhead using event mode
byronyi Aug 3, 2017
71eab60
leave platform-neutral gdr to future work
byronyi Aug 4, 2017
4882418
add docs
byronyi Aug 4, 2017
7240ae3
fix interference with MKL CPU allocator
byronyi Aug 4, 2017
c20e8f4
fix testing if a tensor is on host
byronyi Aug 4, 2017
e841bc8
add several TODOs and checks
byronyi Aug 4, 2017
0a1e5c5
add a readme
byronyi Aug 5, 2017
5af83a5
Update README.md
byronyi Aug 5, 2017
bfb8f24
update readme.md
byronyi Aug 7, 2017
9bb2c8e
fix several issues under VLOG
byronyi Aug 7, 2017
554c3e4
fix typo in readme
byronyi Aug 7, 2017
f9a9d18
fix a race condition of GDR
byronyi Aug 8, 2017
db5501e
merge HEAD from master
byronyi Aug 8, 2017
241c020
remove unintended changes
byronyi Aug 8, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions configure.py
Expand Up @@ -911,6 +911,8 @@ def main():
'with_hdfs_support', False)
set_build_var(environ_cp, 'TF_ENABLE_XLA', 'XLA JIT', 'with_xla_support',
False)
set_build_var(environ_cp, 'TF_NEED_GDR', 'GDR', 'with_gdr_support',
False)
set_build_var(environ_cp, 'TF_NEED_VERBS', 'VERBS', 'with_verbs_support',
False)

Expand Down
6 changes: 6 additions & 0 deletions tensorflow/BUILD
Expand Up @@ -182,6 +182,12 @@ config_setting(
visibility = ["//visibility:public"],
)

config_setting(
name = "with_gdr_support",
values = {"define": "with_gdr_support=true"},
visibility = ["//visibility:public"],
)

config_setting(
name = "with_verbs_support",
values = {"define": "with_verbs_support=true"},
Expand Down
4 changes: 2 additions & 2 deletions tensorflow/c/c_api.cc
Expand Up @@ -145,7 +145,7 @@ class TF_ManagedBuffer : public TensorBuffer {
void* allocate_tensor(const char* operation, size_t len) {
void* data =
tensorflow::cpu_allocator()->AllocateRaw(EIGEN_MAX_ALIGN_BYTES, len);
if (tensorflow::LogMemory::IsEnabled()) {
if (tensorflow::LogMemory::IsEnabled() && data != nullptr) {
tensorflow::LogMemory::RecordRawAllocation(
operation, tensorflow::LogMemory::EXTERNAL_TENSOR_ALLOCATION_STEP_ID,
len, data, tensorflow::cpu_allocator());
Expand All @@ -154,7 +154,7 @@ void* allocate_tensor(const char* operation, size_t len) {
}

void deallocate_buffer(void* data, size_t len, void* arg) {
if (tensorflow::LogMemory::IsEnabled()) {
if (tensorflow::LogMemory::IsEnabled() && data != nullptr) {
tensorflow::LogMemory::RecordRawDeallocation(
"TensorFlow C Api",
tensorflow::LogMemory::EXTERNAL_TENSOR_ALLOCATION_STEP_ID, data,
Expand Down
122 changes: 122 additions & 0 deletions tensorflow/contrib/gdr/BUILD
@@ -0,0 +1,122 @@
# Description:
# GPU Direct RDMA Out-of-Band Tensor transport for TensorFlow.

package(default_visibility = [
"//tensorflow:__subpackages__",
])

licenses(["notice"]) # Apache 2.0

exports_files(["LICENSE"])

filegroup(
name = "all_files",
srcs = glob(
["**/*"],
exclude = [
"**/METADATA",
"**/OWNERS",
],
),
visibility = ["//tensorflow:__subpackages__"],
)

filegroup(
name = "c_srcs",
data = glob([
"**/*.cc",
"**/*.h",
]),
)

load(
"//tensorflow:tensorflow.bzl",
"tf_cuda_library",
)

# For platform specific build config
load(
"//tensorflow/core:platform/default/build_config.bzl",
"tf_proto_library_cc",
)

tf_proto_library_cc(
name = "gdr_proto",
srcs = ["gdr.proto"],
cc_api_version = 2,
visibility = [
"//tensorflow:__subpackages__",
],
)

tf_cuda_library(
name = "gdr_memory_manager",
srcs = ["gdr_memory_manager.cc"],
hdrs = ["gdr_memory_manager.h"],
linkopts = select({
"//tensorflow:with_gdr_support": [
"-libverbs",
"-lrdmacm",
],
"//conditions:default": [],
}),
deps = [
":gdr_proto_cc",
"//tensorflow/core:framework",
"//tensorflow/core:gpu_runtime",
"//tensorflow/core:lib",
],
)

tf_cuda_library(
name = "gdr_worker",
srcs = ["gdr_worker.cc"],
hdrs = ["gdr_worker.h"],
deps = [
":gdr_memory_manager",
"//tensorflow/core:core_cpu_internal",
"//tensorflow/core:framework",
"//tensorflow/core:gpu_runtime",
"//tensorflow/core:lib",
"//tensorflow/core:lib_internal",
"//tensorflow/core/distributed_runtime:graph_mgr",
"//tensorflow/core/distributed_runtime:rendezvous_mgr_interface",
"//tensorflow/core/distributed_runtime:worker",
"//tensorflow/core/distributed_runtime:worker_cache",
"//tensorflow/core/distributed_runtime:worker_env",
"//tensorflow/core/distributed_runtime:worker_session",
"//tensorflow/core/distributed_runtime/rpc:grpc_worker_service",
],
)

cc_library(
name = "gdr_rendezvous_mgr",
srcs = ["gdr_rendezvous_mgr.cc"],
hdrs = ["gdr_rendezvous_mgr.h"],
deps = [
":gdr_memory_manager",
"//tensorflow/core:core_cpu_internal",
"//tensorflow/core:framework",
"//tensorflow/core:lib",
"//tensorflow/core/distributed_runtime:base_rendezvous_mgr",
"//tensorflow/core/distributed_runtime:worker_cache",
"//tensorflow/core/distributed_runtime:worker_env",
"//tensorflow/core/distributed_runtime:worker_interface",
],
)

cc_library(
name = "gdr_server_lib",
srcs = ["gdr_server_lib.cc"],
hdrs = ["gdr_server_lib.h"],
linkstatic = 1, # Seems to be needed since alwayslink is broken in bazel
deps = [
":gdr_memory_manager",
":gdr_rendezvous_mgr",
":gdr_worker",
"//tensorflow/core:lib",
"//tensorflow/core/distributed_runtime:server_lib",
"//tensorflow/core/distributed_runtime/rpc:grpc_server_lib",
],
alwayslink = 1,
)
122 changes: 122 additions & 0 deletions tensorflow/contrib/gdr/README.md
@@ -0,0 +1,122 @@
Introduction
===

This is an implementation of GDR out-of-band transport for TensorFlow distributed runtime, complementary to current gRPC transport. It uses gRPC as control plane to setup rendezvous for each tensor transmission, and utilizes [GPU Direct RDMA](https://developer.nvidia.com/gpudirect) whenever possible to transmit tensors in remote GPU memory through network interface card (NIC), bypassing host memory and CPU entirely. It gracefully falls back to ordinary RDMA or even gRPC when GDR is not available.

Design
===

The GDR out-of-band transport is designed to avoid any unnecessary memory copies, especially for large tensors (>100MB). That typically requires registration of tensor buffers to NIC in an ad-hoc manner, which is rather slow as described in the design trade-off of the verbs runtime. The verbs runtime thus chooses to manage its own NIC-registered buffers and copy the tensors from/to those buffers for every single tensor transfer.

We show that, however, such design trade-off is not always relevant. In this patch, we manage both computation and communication buffers in a unified manner. By pre-registration of large buffers to NIC and allocating small tensors from the buffer pool using a BFC allocator, it is possible to avoid both ad-hoc buffer registration and memory copies all together.

For the actual tensor transport, we rely on gRPC to transmit the [remote buffer information](gdr.proto). This greatly simplifies our design, and there are only 2 types of RDMA messages: a single READ to retrieve the tensor data (bypassing remote CPU), and another invalidate using WRITE with IMM to release the tensor buffer on the remote side. The remote side will only be polling the invalidate message and `Unref` the tensor buffers that read by its peer.

Environment
===

To fully utilize GDR, the target environment has to meet 3 conditions:

1. There is an RDMA capable device with corresponding [OFED package](https://www.openfabrics.org/index.php/overview.html) installed (detailed information is available from your [Infiniband/RoCE](http://www.mellanox.com/page/products_dyn?product_family=116)/[iWarp](http://www.chelsio.com/gpudirect-rdma/) vendor), which could be verified through `ibv_devinfo`, e.g.

```
$ ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.40.7000
node_guid: 248a:0703:00f6:3370
sys_image_guid: 248a:0703:00f6:3370
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x1
board_id: MT_1090110023
phys_port_cnt: 2
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
```

2. There is a GDR capable GPU, i.e. of Fermi, Kepler or later architecture with [corresponding driver](http://docs.nvidia.com/cuda/gpudirect-rdma/index.html) installed. The PCI-e topology could be confirmed by `nvidia-smi topo -m`. For example, in the following topology, `GPU2` and `GPU3` are adjacent to `mlx4_0`, and tensors on these devices could benefit from GDR in current implementation.

```
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx4_0 CPU Affinity
GPU0 X PHB SOC SOC SOC 0-5
GPU1 PHB X SOC SOC SOC 0-5
GPU2 SOC SOC X PHB PHB 6-11
GPU3 SOC SOC PHB X PHB 6-11
mlx4_0 SOC SOC PHB PHB X

Legend:

X = Self
SOC = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
```

3. The [`nv_peer_mem`](https://github.com/Mellanox/nv_peer_memory) kernel module is installed.

How to build and run in GDR mode
===

To test it out on a GDR capable environment, choose to enable GDR in your configure script.

```
Do you wish to build TensorFlow with GDR support? [y/N]: y
GDR support will be enabled for TensorFlow.
```

Change your `protocol` to `grpc+gdr` to enable GDR in your deployment.

```
server = tf.train.Server(cluster, job_name="local", task_index=0, protocol='grpc+gdr') # default protocol is 'grpc'
```

Currently the out-of-band transport service listens to the same IP and port address as specified in gRPC.

A successful initialization looks like this:

```
2017-08-05 19:10:38.601718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0)
2017-08-05 19:10:38.601728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:03:00.0)
2017-08-05 19:10:38.601736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:82:00.0)
2017-08-05 19:10:38.601742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:83:00.0)
2017-08-05 19:10:39.591026: I tensorflow/contrib/gdr/gdr_memory_manager.cc:235] RDMA server is listening on 10.40.2.200:5001
2017-08-05 19:10:39.591071: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cuda_host_bfc
2017-08-05 19:10:39.591083: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_pool
2017-08-05 19:10:39.591095: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_rdma_bfc
2017-08-05 19:10:39.591278: I tensorflow/contrib/gdr/gdr_memory_manager.cc:78] NUMA node for device: mlx4_0 is 1
2017-08-05 19:10:39.740253: I tensorflow/contrib/gdr/gdr_memory_manager.cc:296] Instrumenting GPU allocator with bus_id 2
```

The last line suggests that the GPUs with bus id 2 (mapped to pci bus id prefixed 0000:8) will benefit from GDR and host memory bypass, which is `/gpu:2` and `/gpu:3` in this case.

Caveats
===

In current implementation, only tensors that reside in host memory or in GPU memory such that the GPU is adjacent to an RDMA capable NIC will use direct RDMA as its transport. When RDMA is available but not GDR, a temporary tensor copy on host memory will be used as RDMA source/destination (and copied from/to the target device). When there is no RDMA device present, it can even fallback to the original gRPC runtime. While it is theoretically possible to mix GDR enabled TF with non-GDR deployments in the same job, make sure the environment is properly setup so the GDR mode is enabled whenever possible (i.e. do not fall back to gRPC when it is not absolutely necessary).

In the original design (as in the reference), tensor buffers are only registered to NIC when we could determine that the tensor will be either a source of Send or a sink of Recv across physical machine boundary. However, to implement the precise allocations, we need to change all the devices to possibly return a NIC compatible allocator. As GDR is currently in contrib, we would like to avoid the unnecessary code disruption to the TF core, so we allocate all tensors from NIC-registered buffers using a BFC allocator. This behaviour is similar to the effect of enabling the extra GPU option `force_gpu_compatible`, which allocate all host tensors in GPU-registered buffers no matter they will be transferred from/to GPUs or not.

Reference
===

Bairen Yi, Jiacheng Xia, Li Chen, and Kai Chen. 2017. Towards Zero Copy Dataflows using RDMA. In Proceedings of SIGCOMM Posters and Demos'17, Los Angeles, CA, USA, August 22-24, 2017, 3 pages. https://doi.org/10.1145/3123878.3123907
15 changes: 15 additions & 0 deletions tensorflow/contrib/gdr/gdr.proto
@@ -0,0 +1,15 @@
syntax = "proto3";

package tensorflow;
option cc_enable_arenas = true;

import "google/protobuf/any.proto";

message RemoteMemoryRegion {
string host = 1;
string port = 2;
uint64 addr = 3;
uint32 rkey = 4;
uint32 tensor_key = 5;
uint64 checksum = 6;
}