GPUDirect RDMA Out-of-Band Tensor Transport #11392

byronyi · 2017-07-09T11:33:03Z

Introduction

This PR implements GDR out-of-band transport for TensorFlow distributed runtime, complementary to current gRPC transport. It uses gRPC as control plane to setup rendezvous for each tensor transmission, and utilizes GPU Direct RDMA whenever possible to transmit tensors in remote GPU memory through network interface card (NIC), bypassing host memory and CPU entirely. It gracefully falls back to ordinary RDMA or even gRPC when GDR is not available.

Design

The GDR out-of-band transport is designed to avoid any unnecessary memory copies, especially for large tensors (>100MB). That typically requires registration of tensor buffers to NIC on the fly, which is rather slow as described in the design trade-off of the verbs runtime. The verbs runtime thus chooses to manage its own NIC-registered buffers and copy the tensors from/to those buffers for every single tensor transfer.

We show that, however, such design trade-off is not always relevant. In this patch, we manage both computation and communication buffers in a unified manner. By pre-registration of large buffers to NIC and allocating small tensors from the buffer pool using a BFC allocator, it is possible to avoid both buffer registration on the fly and memory copies all together.

For the actual tensor transport, we rely on gRPC to transmit the remote buffer information. This greatly simplifies our design, and there are only 2 types of RDMA messages: a single READ to retrieve the tensor data (bypassing remote CPU), and another invalidate using WRITE with IMM to release the tensor buffer on the remote side. The remote side will only be polling the invalidate message and Unref the tensor buffers that read by its peer.

Environment

To fully utilize GDR, the target environment has to meet 3 conditions:

There is an RDMA capable device with corresponding OFED package installed (detailed information is available from your Infiniband/RoCE/iWarp vendor), which could be verified through ibv_devinfo, e.g.

$ ibv_devinfo
hca_id:	mlx4_0
	transport:			InfiniBand (0)
	fw_ver:				2.40.7000
	node_guid:			248a:0703:00f6:3370
	sys_image_guid:			248a:0703:00f6:3370
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x1
	board_id:			MT_1090110023
	phys_port_cnt:			2
	Device ports:
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

		port:	2
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

There is a GDR capable GPU, i.e. of Fermi, Kepler or later architecture with corresponding driver installed. The PCI-e topology could be confirmed by nvidia-smi topo -m. For example, in the following topology, GPU2 and GPU3 are adjacent to mlx4_0, and tensors on these devices could benefit from GDR in current implementation.

$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	mlx4_0	CPU Affinity
GPU0	 X 	PHB	SOC	SOC	SOC	0-5
GPU1	PHB	 X 	SOC	SOC	SOC	0-5
GPU2	SOC	SOC	 X 	PHB	PHB	6-11
GPU3	SOC	SOC	PHB	 X 	PHB	6-11
mlx4_0	SOC	SOC	PHB	PHB	 X

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks

The nv_peer_mem kernel module is installed.

How to build and run in GDR mode

To test it out on a GDR capable environment, choose to enable GDR in your configure script.

Do you wish to build TensorFlow with GDR support? [y/N]: y
GDR support will be enabled for TensorFlow.

Change your protocol to grpc+gdr to enable GDR in your deployment.

server = tf.train.Server(cluster, job_name="local", task_index=0, protocol='grpc+gdr') # default protocol is 'grpc'

Currently the out-of-band transport service listens to the same IP and port address as specified in gRPC.

A successful initialization looks like this:

2017-08-05 19:10:38.601718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0)
2017-08-05 19:10:38.601728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:03:00.0)
2017-08-05 19:10:38.601736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:82:00.0)
2017-08-05 19:10:38.601742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:83:00.0)
2017-08-05 19:10:39.591026: I tensorflow/contrib/gdr/gdr_memory_manager.cc:235] RDMA server is listening on 10.40.2.200:5001
2017-08-05 19:10:39.591071: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cuda_host_bfc
2017-08-05 19:10:39.591083: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_pool
2017-08-05 19:10:39.591095: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_rdma_bfc
2017-08-05 19:10:39.591278: I tensorflow/contrib/gdr/gdr_memory_manager.cc:78] NUMA node for device: mlx4_0 is 1
2017-08-05 19:10:39.740253: I tensorflow/contrib/gdr/gdr_memory_manager.cc:296] Instrumenting GPU allocator with bus_id 2

The last line suggests that the GPUs with bus id 2 (mapped to pci bus id prefixed 0000:8) will benefit from GDR and host memory bypass, which is /gpu:2 and /gpu:3 in this case.

Caveats

In current implementation, only tensors that reside in host memory or in GPU memory such that the GPU is adjacent to an RDMA capable NIC will use direct RDMA as its transport. When RDMA is available but not GDR, a temporary tensor copy on host memory will be used as RDMA source/destination (and copied from/to the target device). When there is no RDMA device present, it can even fallback to the original gRPC runtime. While it is theoretically possible to mix GDR enabled TF with non-GDR deployments in the same job, make sure the environment is properly setup so the GDR mode is enabled whenever possible (i.e. do not fall back to gRPC when it is not absolutely necessary).

tensorflow-jenkins · 2017-07-09T11:33:05Z

Can one of the admins verify this patch?

drpngx · 2017-07-09T21:06:00Z

Jenkins, test this please.

drpngx · 2017-07-09T21:08:33Z

Are you depending on a host RDMA library to be installed?

tensorflow/core/distributed_runtime/rpc/rdma.cc:13:27: fatal error: rdma/rdma_cma.h: No such file or directory
 #include <rdma/rdma_cma.h>

byronyi · 2017-07-09T22:57:00Z

@drpngx Yes, as in case of CUDA, we assume an OFED package is installed with all the drivers and user space libraries for RDMA to properly function. libibverbs and librdmacm are included in any OFED package.

I agree that we need some sort of compile switch, but I haven't figure out the best way to do it. Alternatively we could add linux-rdma/rdma-core as a third party compile dependency, and detect whether the GDR environment is available only at runtime.

byronyi · 2017-07-10T10:13:31Z

I have done some investigations and would like to purpose adding librdmacm and libibverbs as third party dependencies (librdmacm depends on libibverbs). All of these compile on latest Linux without any hardware requirements. For the runtime, one could simply call ibv_get_device_list(3) and check if the target environment requirements are met.

Now of course all of these are Linux specific, so we need an extra compatibility layer against non-Linux platforms. I do know there are RDMA APIs available on Windows, but I don't have a Windows box, so this has to be left to future contributors.

And for the functional testing, we could use Software RDMA over Ethernet as a software emulated RDMA layer that runs on any platform. It does require upstream Linux kernel version at least 4.8 to enable such feature, so we do need to setup a separated CI environment for that purpose.

byronyi · 2017-07-10T10:30:26Z

@shamoya Any interests on this patch? I would certainly appreciate any help from Mellanox on testing and setting up CI environment with GDR (and possibly RoCE).

drpngx · 2017-07-10T15:54:19Z

@gunan

…

On Jul 10, 2017 3:32 AM, "Bairen Yi" ***@***.***> wrote: @shamoya <https://github.com/shamoya> Any interests on this patch? I would certainly appreciate any help from Mellanox on testing and setting up CI environment with GDR (and possibly with RoCE). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11392 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AT_SbYeP7V3-l2Duo9aG9508zR3sQb1Yks5sMf2ugaJpZM4OSCMo> .

gunan · 2017-07-10T15:58:43Z

buldifier checks are failing here. Thus all the other tests are blocked.

Running do_buildifier on 204 files

tensorflow/core/distributed_runtime/rpc/BUILD # reformat callsort listsort unsafesort sort:tf_cuda_library.deps

buildifier took 1 s

FAIL: buildifier found errors and/or warnings in above BUILD files.
buildifier suggested the following changes:
248a249,254
>     linkopts = select({
>         "//conditions:default": [
>             "-libverbs",
>             "-lrdmacm",
>         ],
>     }),
250d255
<         "//tensorflow/core/distributed_runtime:rdma",
254a260
>         "//tensorflow/core/distributed_runtime:rdma",
256,258d261
<     linkopts = select({
<         "//conditions:default": ["-libverbs", "-lrdmacm"],
<     }),
Please fix manually or run buildifier <file> to auto-fix.

Could you look into this failure, so that the rest of our CI tests can run?

byronyi · 2017-07-10T16:18:45Z

@gunan Yes, I am looking into it. Will ping you after it's resolved.

byronyi · 2017-07-10T16:34:21Z

I am working on introducing the RDMA user space libraries as third party dependencies, but I am not familiar with converting Makefile or CMake rules into bazel BUILD files. A similar case #5349 of integrating cURL seems to be a much larger effort than I expected. Would appreciate if @jart or @gunan could give some suggestions on this.

drpngx · 2017-07-10T17:45:35Z

Jenkins, test this please.

byronyi · 2017-07-10T17:52:45Z

@drpngx Sorry I was pushing the wrong commit. Could you ask Jenkins to test it again?

drpngx · 2017-07-10T17:56:44Z

Jenkins, test this please.

…

On Jul 10, 2017 10:55 AM, "Bairen Yi" ***@***.***> wrote: @drpngx <https://github.com/drpngx> Sorry I was pushing the wrong commit. Could you ask Jenkins to test it again? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11392 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AT_SbVW1ng50wP9vJmxxwfvJZTEFypiQks5sMmWcgaJpZM4OSCMo> .

byronyi · 2017-07-10T18:09:36Z

Seems a rebase is needed.

shamoya · 2017-07-10T19:39:22Z

Thanks @byronyi
Let me take a look at this patch and get back to you soon.
We thought of GDR in the context of the verbs code till now, so this is interesting.
Adding the RDMA libraries as third party is a good idea, good also for the grpc+verbs protocol flow.

byronyi · 2017-07-10T19:46:49Z

@shamoya Glad to have your prompt reply. The last commit turns out to be a unsuccessful attempt to introduce RDMA dependency. If you would like to try out the patch on a machine with OFED installed, you could revert the last commit and build it from source.

Update: I've reverted it myself so it should work by just checking out from this branch.

shamoya · 2017-07-10T20:02:10Z

The libraries in third_party are linked statically to the code right ?
In general, how can one provide a different library version then the one in third_party (grpc for example) ?

byronyi · 2017-07-10T20:11:08Z

My initial attempt was to include only RDMA headers, which apparently is not runnable as it isn't linked against the actual library. I haven't figured it out how to use bazel to build rdma-core; they are using CMake and it's hard to port it to bazel in general. I guess we need to consult someone in the TF team on platform and portability issues.

Regarding to the version of specific third party library, I think it is hardcoded into the compilation directives. Google seems to compile everything from source for their projects, and I guess this isn't a issue for them.

drpngx · 2017-07-11T16:41:37Z

Jenkins, test this please.

byronyi · 2017-07-11T16:44:08Z

@drpngx I am still working on resolving the build issues...maybe we should wait for review from @poxvoculi first?

drpngx · 2017-07-11T16:47:43Z

OK. I usually prefer to review when everything builds and tests properly. Can you summarize what your questions are?

…

On Tue, Jul 11, 2017 at 9:46 AM, Bairen Yi ***@***.***> wrote: @drpngx <https://github.com/drpngx> I am still working on resolving the build issues...maybe we should wait for review from @poxvoculi <https://github.com/poxvoculi> first? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#11392 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AT_SbSZTqPDhgHxUFwIp2qRQXqKnumSoks5sM6bKgaJpZM4OSCMo> .

byronyi · 2017-07-11T17:01:11Z

Current CI doesn't have GDR environment so we are specifically testing the "fallback to gRPC" behaviour.

I've already implemented such logic in my code (here and there); but GDR environment requires RDMA user space libraries, which are currently not linked into the TF core library. Currently I am stuck by adding them as a third-party dependency (porting CMake based project to bazel).

Alternatively I could make it a separated runtime (like grpc+verbs and grpc+mpi), but that will defeat the whole point of being able to fallback to gRPC.

byronyi · 2017-07-11T17:03:07Z

Or I could just add a compiler directive in configure and disable everything that's useful.

byronyi · 2017-07-11T20:28:41Z

@drpngx I've tested my non-RDMA box and everything works just fine. It does require the box to have librdmacm and libibverbs installed, though. It falls back to gRPC correctly as suggested.

Is there anyway to setup a CI environment with those two libraries? They are available on all Linux distros I know of. After we confirm everything builds and tests properly, we can arrange a review on my patch.

martinwicke · 2017-08-08T15:25:10Z

There is no source release other than getting the source code from GitHub. So once we merge this it's available for people to use. However if someone specifically asks to build the 1.3 branch, they will get the source code which builds the 1.3 binary, which would not include this change.

byronyi · 2017-08-08T15:31:06Z

Thanks for clarification! I actually just realized that :)

rmlarsen · 2017-08-08T19:39:52Z

@tensorflow-jenkins test this please

poxvoculi · 2017-08-08T20:43:10Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

@@ -285,12 +290,12 @@ Status GdrMemoryManager::Init() {
  }

 #if GOOGLE_CUDA
+  VisitableAllocator::Visitor cuda_alloc_visitor =
+      std::bind(&GdrMemoryManager::InsertCUDAMemoryRegion, this, _1, _2);


If I understand what you're doing correctly, you're registering the entire GPU memory to have the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS property. This seems to be a technique that NVIDIA provides because it makes concurrent programming easier, but I would avoid it because it likely will also damage performance considerably.

It sounds like you suspect a race condition on either the RDMA read, or some use of the final value. For background, here's how we generally handle this issue.

TensorFlow tries hard to use as few synchronous CUDA calls as possible, and to introduce as few sync points as possible. The techniques used rely on careful use of streams. The general discipline is that 4 streams are created for each GPU, by the StreamGroupFactory. All compute Ops (i.e. CUDA kernels) are launched on the single compute stream, and the other three are used for memcpys: one for H2D, one for D2H, and one for D2D copies. In the normal case where an op relies only on the completion of prior ops in the same stream, it can be launched async without danger or any other temporal dependency. In the case where it relies on prior completion of an op on another stream, we need to introduce a sync dependency, which can be done in more than one way. If we need to wait for an ordinary compute op prior to e.g. a D2H copy, we can just introduce a wait on the current compute stream prior to the copy, as here. This will cause the i/o stream to wait for completion of all ops pending at the time of call to complete before launching the next op added to itself.

In the case where we need to wait for a memcpy to terminate on one stream before launching a compute on another, the technique used is here. We don't want the compute stream to stall until the i/o is done, so we wait for the i/o to complete before queuing the compute op. The wait is accomplished by a call the the EventMgr.

In our internal networking which uses GPUDirect, we use the EventMgr to ensure that we don't start a send before the op writing the buffer to be read has completed. In the other direction, our RPC system provides a callback that executes only after the target buffer area has been written, and we use that callback to trigger subsequent ops that want to read that area.

I noticed the stream executor and I think that it is a good design for both pipelining and parallelism. It is just that I have not figured out a viable path to accommodate my current design with it, and I need to make a temporary fix to the race condition issue.

Let me take a closer look and see if I can do better. Thanks for the comment! It is very helpful.

rmlarsen · 2017-08-08T22:43:47Z

@tensorflow-jenkins test this please

tensorflow-jenkins · 2017-08-09T16:57:03Z

Can one of the admins verify this patch?

byronyi · 2017-08-09T17:02:36Z

I thought I will make some further improvements, but @rmlarsen seems to be eager to merge this. Alright then, I think I will leave them to the future 😆

Good job guys, thank you all for all the feedbacks in this thread. Hope we could have some users soon so we could hear back from them.

shamoya · 2017-08-09T19:30:32Z

Thank you @byronyi for the contribution!
I hope we'll start testing it soon (+ @bkovalev)
J

suiyuan2009 · 2017-08-23T08:49:32Z

I find first gpu's forward propagation is much slower than other gpus on same machine when training resnet101 using grpc+gdr on a 4 gpus worker machine (grpc may have same problem, but grpc is much slower so it may hide the problem.), I have run many times, trace file is in here.

avolkov1 · 2017-10-19T08:18:22Z

@byronyi I keep getting an error: UnavailableError: No such device: cannot bind to rdma://hsw215:2302

No matter what port I've tried I get that error. The hsw215 is a compute node. This is my cluster spec:

CLUSTER_SPEC_DICT: {'ps': ['hsw215:2300'], 'worker': ['hsw215:2301', 'hsw215:2302', 'hsw215:2303', 'hsw215:2304']}

I compiled TF 1.4 with GDR, and I have nv_peer_mem running:

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         10.16.1006
        node_guid:                      e41d:2d03:0006:ae40
        sys_image_guid:                 e41d:2d03:0006:ae40
        vendor_id:                      0x02c9
        vendor_part_id:                 4113
        hw_ver:                         0x0
        board_id:                       MT_1210110019
        phys_port_cnt:                  2
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               16
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               65535
                        port_lmc:               0x00
                        link_layer:             InfiniBand


$ service nv_peer_mem status
nv_peer_mem module is loaded.

My code runs fine with protocol='grpc+verbs'. I get messages like:

I tensorflow/contrib/verbs/rdma_mgr.cc:56] connecting to remote node /job:worker/replica:0/task:0

Rdma runs well, but I can't get the gdr working. Maybe I'm doing something stupid, but it's not clear from the README what else needs to be done. Thanks.

byronyi · 2017-10-19T08:31:02Z

@avolkov1 The GDR runtime binds to whatever IP you assigned, even if it is not an IB device. Please check with IP address first. Sometimes it resolves to different IP for the same host on different machines.

avolkov1 · 2017-10-19T18:35:27Z

@byronyi Thank you. Made some progress. I'm getting Unavailable: Cannot find pinned memory region errors. Have you come across that? I'm trying to run multigpu training. Again, this worked with grpc+verbs. Are there some gotchas with parameter servers on CPU.

I get the RDMA GDR endpoints working:

1: 2017-10-19 11:17:07.804151: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
2: 2017-10-19 11:17:07.803164: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301
0: 2017-10-19 11:17:08.033571: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301
1: 2017-10-19 11:17:08.033677: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:21.449746: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2300
2: 2017-10-19 11:17:21.448892: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:21.455124: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2300
2: 2017-10-19 11:17:21.455214: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:22.995750: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2300
0: 2017-10-19 11:17:22.995812: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
2: 2017-10-19 11:17:29.572063: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2301
3: 2017-10-19 11:17:29.572255: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:29.572637: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2300
0: 2017-10-19 11:17:29.573710: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
0: 2017-10-19 11:17:38.169948: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2301
3: 2017-10-19 11:17:38.169127: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:49.647909: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:49.646921: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301

I'm running this on slurm. The first number above is the task id. Here's my cluster setup:

CLUSTER_SPEC_DICT: {'ps': ['hsw215.ib.cluster:2300', 'hsw216.ib.cluster:2300'], 'worker': ['hsw215.ib.cluster:2301', 'hsw216.ib.cluster:2301']}
# slurm task 0 - hsw215.ib.cluster:2300 - ps 0
# slurm task 1 - hsw215.ib.cluster:2301 - worker 0
# slurm task 2 - hsw216.ib.cluster:2300 - ps 1
# slurm task 3 - hsw216.ib.cluster:2301 - worker 1

The parameter servers run on cpu and each worker has 4 GPUs. Devices list:

['/job:ps/task:0', '/job:ps/task:1']
['/job:worker/task:0/device:GPU:0', '/job:worker/task:0/device:GPU:1', '/job:worker/task:0/device:GPU:2', '/job:worker/task:0/device:GPU:3', '/job:worker/task:1/device:GPU:0', '/job:worker
/task:1/device:GPU:1', '/job:worker/task:1/device:GPU:2', '/job:worker/task:1/device:GPU:3']

ERRORS:

1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]
1: 2017-10-19 11:17:49.641150: W tensorflow/core/framework/op_kernel.cc:1192] Unavailable: Cannot find pinned memor
1: y region
1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]
1: 2017-10-19 11:17:49.641155: W tensorflow/core/framework/op_kernel.cc:1192] Unavailable: Cannot find pinned memory region
1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]

3: 2017-10-19 11:18:25.395597: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.395852: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cuda_host_bfc
3: 2017-10-19 11:18:25.395913: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.396150: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.396200: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool

byronyi · 2017-10-20T04:51:25Z

@avolkov1 I suspect it is because of the limit on page locked memory size. Check with ulimit -l or ulimit -a to see the limit of your current user.

EDIT: you might need your sysadmin to modify the ulimit. Usually it shall be set unlimited with your IB driver installation, but it does depend on your vendor. See man limits.conf for explanation.

avolkov1 · 2017-10-20T15:30:24Z

@byronyi Thanks. I looked at it. It seems to be unlimited.

hsw228$ ulimit -l
unlimited
hsw228$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1031351
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

byronyi · 2017-10-23T12:31:15Z

@avolkov1 I failed to reproduce your issue on a fresh installation of vendor provided OFED on my testbed. Would you mind to paste the full log?

avolkov1 · 2017-10-23T15:13:42Z

@byronyi Will do. Thank you very much. Give me a day or two (busy with work related stuff). I'll post the error log and the code I use to run.

googlebot added the cla: yes label Jul 9, 2017

drpngx assigned poxvoculi Jul 9, 2017

frankchn added awaiting review Pull request awaiting review and removed awaiting review Pull request awaiting review labels Jul 10, 2017

byronyi mentioned this pull request Jul 11, 2017

Fetching data in Distributed Tensorflow has too much latency #11411

Closed

byronyi changed the title ~~GPU Direct RDMA Out-of-Band Tensor Transport~~ [WIP] GPU Direct RDMA Out-of-Band Tensor Transport Jul 11, 2017

byronyi mentioned this pull request Aug 8, 2017

Distributed training with synchronized SGD using 'grpc+verbs' sometimes hangs indefinitely #11416

Closed

rmlarsen self-assigned this Aug 8, 2017

poxvoculi reviewed Aug 8, 2017

View reviewed changes

byronyi changed the title ~~GPU Direct RDMA Out-of-Band Tensor Transport~~ [WIP] GPU Direct RDMA Out-of-Band Tensor Transport Aug 8, 2017

rmlarsen merged commit 11e2aef into tensorflow:master Aug 9, 2017

byronyi changed the title ~~[WIP] GPU Direct RDMA Out-of-Band Tensor Transport~~ GPU Direct RDMA Out-of-Band Tensor Transport Aug 9, 2017

byronyi mentioned this pull request Aug 10, 2017

Fix unnecessary sync_memops cuda pointer attr in GDR #12170

Merged

byronyi deleted the gpudirect branch August 10, 2017 09:34

byronyi mentioned this pull request Aug 25, 2017

remove unused code, it does not happen #12509

Merged

byronyi changed the title ~~GPU Direct RDMA Out-of-Band Tensor Transport~~ GPUDirect RDMA Out-of-Band Tensor Transport Sep 2, 2017

byronyi mentioned this pull request Dec 6, 2017

Does MXNet support RDMA over Converged Ethernet (ROCE) apache/mxnet#5826

Closed

byronyi mentioned this pull request Feb 6, 2018

Verbs w 0 copies #16005

Merged

byronyi mentioned this pull request Oct 9, 2018

Does libfabric support on-demand-paging(ODP)? ofiwg/libfabric#3634

Closed

CheukNgai mentioned this pull request Nov 13, 2018

Add grpc+gdr protocol tensorflow/estimator#10

Closed

byronyi mentioned this pull request Jul 5, 2019

Add alternative protocols supported by TF and SIG-Networking tensorflow/estimator#36

Closed

GPUDirect RDMA Out-of-Band Tensor Transport #11392

GPUDirect RDMA Out-of-Band Tensor Transport #11392

Conversation

byronyi commented Jul 9, 2017 • edited

Introduction

Design

Environment

How to build and run in GDR mode

Caveats

tensorflow-jenkins commented Jul 9, 2017

drpngx commented Jul 9, 2017

drpngx commented Jul 9, 2017 • edited

byronyi commented Jul 9, 2017 • edited

byronyi commented Jul 10, 2017 • edited

byronyi commented Jul 10, 2017 • edited

drpngx commented Jul 10, 2017 via email

gunan commented Jul 10, 2017

byronyi commented Jul 10, 2017

byronyi commented Jul 10, 2017 • edited

drpngx commented Jul 10, 2017

byronyi commented Jul 10, 2017

drpngx commented Jul 10, 2017 via email

byronyi commented Jul 10, 2017

shamoya commented Jul 10, 2017

byronyi commented Jul 10, 2017 • edited

shamoya commented Jul 10, 2017

byronyi commented Jul 10, 2017

drpngx commented Jul 11, 2017

byronyi commented Jul 11, 2017

drpngx commented Jul 11, 2017 via email

byronyi commented Jul 11, 2017

byronyi commented Jul 11, 2017

byronyi commented Jul 11, 2017

martinwicke commented Aug 8, 2017 via email

byronyi commented Aug 8, 2017

rmlarsen commented Aug 8, 2017

poxvoculi Aug 8, 2017

Choose a reason for hiding this comment

byronyi Aug 8, 2017

Choose a reason for hiding this comment

rmlarsen commented Aug 8, 2017

tensorflow-jenkins commented Aug 9, 2017

byronyi commented Aug 9, 2017

shamoya commented Aug 9, 2017 • edited

suiyuan2009 commented Aug 23, 2017

avolkov1 commented Oct 19, 2017

byronyi commented Oct 19, 2017 • edited

avolkov1 commented Oct 19, 2017

byronyi commented Oct 20, 2017 • edited

avolkov1 commented Oct 20, 2017

byronyi commented Oct 23, 2017

avolkov1 commented Oct 23, 2017

byronyi commented Jul 9, 2017 •

edited

drpngx commented Jul 9, 2017 •

edited

byronyi commented Jul 9, 2017 •

edited

byronyi commented Jul 10, 2017 •

edited

byronyi commented Jul 10, 2017 •

edited

byronyi commented Jul 10, 2017 •

edited

byronyi commented Jul 10, 2017 •

edited

shamoya commented Aug 9, 2017 •

edited

byronyi commented Oct 19, 2017 •

edited

byronyi commented Oct 20, 2017 •

edited