Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUDirect RDMA Out-of-Band Tensor Transport #11392

Merged
merged 48 commits into from Aug 9, 2017
Merged

GPUDirect RDMA Out-of-Band Tensor Transport #11392

merged 48 commits into from Aug 9, 2017

Conversation

byronyi
Copy link
Contributor

@byronyi byronyi commented Jul 9, 2017

Introduction

This PR implements GDR out-of-band transport for TensorFlow distributed runtime, complementary to current gRPC transport. It uses gRPC as control plane to setup rendezvous for each tensor transmission, and utilizes GPU Direct RDMA whenever possible to transmit tensors in remote GPU memory through network interface card (NIC), bypassing host memory and CPU entirely. It gracefully falls back to ordinary RDMA or even gRPC when GDR is not available.

Design

The GDR out-of-band transport is designed to avoid any unnecessary memory copies, especially for large tensors (>100MB). That typically requires registration of tensor buffers to NIC on the fly, which is rather slow as described in the design trade-off of the verbs runtime. The verbs runtime thus chooses to manage its own NIC-registered buffers and copy the tensors from/to those buffers for every single tensor transfer.

We show that, however, such design trade-off is not always relevant. In this patch, we manage both computation and communication buffers in a unified manner. By pre-registration of large buffers to NIC and allocating small tensors from the buffer pool using a BFC allocator, it is possible to avoid both buffer registration on the fly and memory copies all together.

For the actual tensor transport, we rely on gRPC to transmit the remote buffer information. This greatly simplifies our design, and there are only 2 types of RDMA messages: a single READ to retrieve the tensor data (bypassing remote CPU), and another invalidate using WRITE with IMM to release the tensor buffer on the remote side. The remote side will only be polling the invalidate message and Unref the tensor buffers that read by its peer.

Environment

To fully utilize GDR, the target environment has to meet 3 conditions:

  1. There is an RDMA capable device with corresponding OFED package installed (detailed information is available from your Infiniband/RoCE/iWarp vendor), which could be verified through ibv_devinfo, e.g.
$ ibv_devinfo
hca_id:	mlx4_0
	transport:			InfiniBand (0)
	fw_ver:				2.40.7000
	node_guid:			248a:0703:00f6:3370
	sys_image_guid:			248a:0703:00f6:3370
	vendor_id:			0x02c9
	vendor_part_id:			4099
	hw_ver:				0x1
	board_id:			MT_1090110023
	phys_port_cnt:			2
	Device ports:
		port:	1
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet

		port:	2
			state:			PORT_ACTIVE (4)
			max_mtu:		4096 (5)
			active_mtu:		1024 (3)
			sm_lid:			0
			port_lid:		0
			port_lmc:		0x00
			link_layer:		Ethernet
  1. There is a GDR capable GPU, i.e. of Fermi, Kepler or later architecture with corresponding driver installed. The PCI-e topology could be confirmed by nvidia-smi topo -m. For example, in the following topology, GPU2 and GPU3 are adjacent to mlx4_0, and tensors on these devices could benefit from GDR in current implementation.
$ nvidia-smi topo -m
	GPU0	GPU1	GPU2	GPU3	mlx4_0	CPU Affinity
GPU0	 X 	PHB	SOC	SOC	SOC	0-5
GPU1	PHB	 X 	SOC	SOC	SOC	0-5
GPU2	SOC	SOC	 X 	PHB	PHB	6-11
GPU3	SOC	SOC	PHB	 X 	PHB	6-11
mlx4_0	SOC	SOC	PHB	PHB	 X

Legend:

  X   = Self
  SOC  = Connection traversing PCIe as well as the SMP link between CPU sockets(e.g. QPI)
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing a single PCIe switch
  NV#  = Connection traversing a bonded set of # NVLinks
  1. The nv_peer_mem kernel module is installed.

How to build and run in GDR mode

To test it out on a GDR capable environment, choose to enable GDR in your configure script.

Do you wish to build TensorFlow with GDR support? [y/N]: y
GDR support will be enabled for TensorFlow.

Change your protocol to grpc+gdr to enable GDR in your deployment.

server = tf.train.Server(cluster, job_name="local", task_index=0, protocol='grpc+gdr') # default protocol is 'grpc'

Currently the out-of-band transport service listens to the same IP and port address as specified in gRPC.

A successful initialization looks like this:

2017-08-05 19:10:38.601718: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0)
2017-08-05 19:10:38.601728: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K40m, pci bus id: 0000:03:00.0)
2017-08-05 19:10:38.601736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K40m, pci bus id: 0000:82:00.0)
2017-08-05 19:10:38.601742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K40m, pci bus id: 0000:83:00.0)
2017-08-05 19:10:39.591026: I tensorflow/contrib/gdr/gdr_memory_manager.cc:235] RDMA server is listening on 10.40.2.200:5001
2017-08-05 19:10:39.591071: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cuda_host_bfc
2017-08-05 19:10:39.591083: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_pool
2017-08-05 19:10:39.591095: I tensorflow/contrib/gdr/gdr_memory_manager.cc:285] Instrumenting CPU allocator cpu_rdma_bfc
2017-08-05 19:10:39.591278: I tensorflow/contrib/gdr/gdr_memory_manager.cc:78] NUMA node for device: mlx4_0 is 1
2017-08-05 19:10:39.740253: I tensorflow/contrib/gdr/gdr_memory_manager.cc:296] Instrumenting GPU allocator with bus_id 2

The last line suggests that the GPUs with bus id 2 (mapped to pci bus id prefixed 0000:8) will benefit from GDR and host memory bypass, which is /gpu:2 and /gpu:3 in this case.

Caveats

In current implementation, only tensors that reside in host memory or in GPU memory such that the GPU is adjacent to an RDMA capable NIC will use direct RDMA as its transport. When RDMA is available but not GDR, a temporary tensor copy on host memory will be used as RDMA source/destination (and copied from/to the target device). When there is no RDMA device present, it can even fallback to the original gRPC runtime. While it is theoretically possible to mix GDR enabled TF with non-GDR deployments in the same job, make sure the environment is properly setup so the GDR mode is enabled whenever possible (i.e. do not fall back to gRPC when it is not absolutely necessary).

@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@drpngx
Copy link
Contributor

drpngx commented Jul 9, 2017

Jenkins, test this please.

@drpngx
Copy link
Contributor

drpngx commented Jul 9, 2017

Are you depending on a host RDMA library to be installed?

tensorflow/core/distributed_runtime/rpc/rdma.cc:13:27: fatal error: rdma/rdma_cma.h: No such file or directory
 #include <rdma/rdma_cma.h>

@byronyi
Copy link
Contributor Author

byronyi commented Jul 9, 2017

@drpngx Yes, as in case of CUDA, we assume an OFED package is installed with all the drivers and user space libraries for RDMA to properly function. libibverbs and librdmacm are included in any OFED package.

I agree that we need some sort of compile switch, but I haven't figure out the best way to do it. Alternatively we could add linux-rdma/rdma-core as a third party compile dependency, and detect whether the GDR environment is available only at runtime.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

I have done some investigations and would like to purpose adding librdmacm and libibverbs as third party dependencies (librdmacm depends on libibverbs). All of these compile on latest Linux without any hardware requirements. For the runtime, one could simply call ibv_get_device_list(3) and check if the target environment requirements are met.

Now of course all of these are Linux specific, so we need an extra compatibility layer against non-Linux platforms. I do know there are RDMA APIs available on Windows, but I don't have a Windows box, so this has to be left to future contributors.

And for the functional testing, we could use Software RDMA over Ethernet as a software emulated RDMA layer that runs on any platform. It does require upstream Linux kernel version at least 4.8 to enable such feature, so we do need to setup a separated CI environment for that purpose.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

@shamoya Any interests on this patch? I would certainly appreciate any help from Mellanox on testing and setting up CI environment with GDR (and possibly RoCE).

@drpngx
Copy link
Contributor

drpngx commented Jul 10, 2017 via email

@gunan
Copy link
Contributor

gunan commented Jul 10, 2017

buldifier checks are failing here. Thus all the other tests are blocked.

Running do_buildifier on 204 files

tensorflow/core/distributed_runtime/rpc/BUILD # reformat callsort listsort unsafesort sort:tf_cuda_library.deps

buildifier took 1 s

FAIL: buildifier found errors and/or warnings in above BUILD files.
buildifier suggested the following changes:
248a249,254
>     linkopts = select({
>         "//conditions:default": [
>             "-libverbs",
>             "-lrdmacm",
>         ],
>     }),
250d255
<         "//tensorflow/core/distributed_runtime:rdma",
254a260
>         "//tensorflow/core/distributed_runtime:rdma",
256,258d261
<     linkopts = select({
<         "//conditions:default": ["-libverbs", "-lrdmacm"],
<     }),
Please fix manually or run buildifier <file> to auto-fix.

Could you look into this failure, so that the rest of our CI tests can run?

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

@gunan Yes, I am looking into it. Will ping you after it's resolved.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

I am working on introducing the RDMA user space libraries as third party dependencies, but I am not familiar with converting Makefile or CMake rules into bazel BUILD files. A similar case #5349 of integrating cURL seems to be a much larger effort than I expected. Would appreciate if @jart or @gunan could give some suggestions on this.

@drpngx
Copy link
Contributor

drpngx commented Jul 10, 2017

Jenkins, test this please.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

@drpngx Sorry I was pushing the wrong commit. Could you ask Jenkins to test it again?

@drpngx
Copy link
Contributor

drpngx commented Jul 10, 2017 via email

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

Seems a rebase is needed.

@frankchn frankchn added awaiting review Pull request awaiting review and removed awaiting review Pull request awaiting review labels Jul 10, 2017
@shamoya
Copy link
Contributor

shamoya commented Jul 10, 2017

Thanks @byronyi
Let me take a look at this patch and get back to you soon.
We thought of GDR in the context of the verbs code till now, so this is interesting.
Adding the RDMA libraries as third party is a good idea, good also for the grpc+verbs protocol flow.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

@shamoya Glad to have your prompt reply. The last commit turns out to be a unsuccessful attempt to introduce RDMA dependency. If you would like to try out the patch on a machine with OFED installed, you could revert the last commit and build it from source.

Update: I've reverted it myself so it should work by just checking out from this branch.

@shamoya
Copy link
Contributor

shamoya commented Jul 10, 2017

The libraries in third_party are linked statically to the code right ?
In general, how can one provide a different library version then the one in third_party (grpc for example) ?

@byronyi
Copy link
Contributor Author

byronyi commented Jul 10, 2017

My initial attempt was to include only RDMA headers, which apparently is not runnable as it isn't linked against the actual library. I haven't figured it out how to use bazel to build rdma-core; they are using CMake and it's hard to port it to bazel in general. I guess we need to consult someone in the TF team on platform and portability issues.

Regarding to the version of specific third party library, I think it is hardcoded into the compilation directives. Google seems to compile everything from source for their projects, and I guess this isn't a issue for them.

@drpngx
Copy link
Contributor

drpngx commented Jul 11, 2017

Jenkins, test this please.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 11, 2017

@drpngx I am still working on resolving the build issues...maybe we should wait for review from @poxvoculi first?

@drpngx
Copy link
Contributor

drpngx commented Jul 11, 2017 via email

@byronyi
Copy link
Contributor Author

byronyi commented Jul 11, 2017

Current CI doesn't have GDR environment so we are specifically testing the "fallback to gRPC" behaviour.

I've already implemented such logic in my code (here and there); but GDR environment requires RDMA user space libraries, which are currently not linked into the TF core library. Currently I am stuck by adding them as a third-party dependency (porting CMake based project to bazel).

Alternatively I could make it a separated runtime (like grpc+verbs and grpc+mpi), but that will defeat the whole point of being able to fallback to gRPC.

@byronyi
Copy link
Contributor Author

byronyi commented Jul 11, 2017

Or I could just add a compiler directive in configure and disable everything that's useful.

@byronyi byronyi changed the title GPU Direct RDMA Out-of-Band Tensor Transport [WIP] GPU Direct RDMA Out-of-Band Tensor Transport Jul 11, 2017
@byronyi
Copy link
Contributor Author

byronyi commented Jul 11, 2017

@drpngx I've tested my non-RDMA box and everything works just fine. It does require the box to have librdmacm and libibverbs installed, though. It falls back to gRPC correctly as suggested.

Is there anyway to setup a CI environment with those two libraries? They are available on all Linux distros I know of. After we confirm everything builds and tests properly, we can arrange a review on my patch.

@martinwicke
Copy link
Member

martinwicke commented Aug 8, 2017 via email

@byronyi
Copy link
Contributor Author

byronyi commented Aug 8, 2017

Thanks for clarification! I actually just realized that :)

@rmlarsen
Copy link
Member

rmlarsen commented Aug 8, 2017

@tensorflow-jenkins test this please

@rmlarsen rmlarsen self-assigned this Aug 8, 2017
@@ -285,12 +290,12 @@ Status GdrMemoryManager::Init() {
}

#if GOOGLE_CUDA
VisitableAllocator::Visitor cuda_alloc_visitor =
std::bind(&GdrMemoryManager::InsertCUDAMemoryRegion, this, _1, _2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand what you're doing correctly, you're registering the entire GPU memory to have the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS property. This seems to be a technique that NVIDIA provides because it makes concurrent programming easier, but I would avoid it because it likely will also damage performance considerably.

It sounds like you suspect a race condition on either the RDMA read, or some use of the final value. For background, here's how we generally handle this issue.

TensorFlow tries hard to use as few synchronous CUDA calls as possible, and to introduce as few sync points as possible. The techniques used rely on careful use of streams. The general discipline is that 4 streams are created for each GPU, by the StreamGroupFactory. All compute Ops (i.e. CUDA kernels) are launched on the single compute stream, and the other three are used for memcpys: one for H2D, one for D2H, and one for D2D copies. In the normal case where an op relies only on the completion of prior ops in the same stream, it can be launched async without danger or any other temporal dependency. In the case where it relies on prior completion of an op on another stream, we need to introduce a sync dependency, which can be done in more than one way. If we need to wait for an ordinary compute op prior to e.g. a D2H copy, we can just introduce a wait on the current compute stream prior to the copy, as here. This will cause the i/o stream to wait for completion of all ops pending at the time of call to complete before launching the next op added to itself.

In the case where we need to wait for a memcpy to terminate on one stream before launching a compute on another, the technique used is here. We don't want the compute stream to stall until the i/o is done, so we wait for the i/o to complete before queuing the compute op. The wait is accomplished by a call the the EventMgr.

In our internal networking which uses GPUDirect, we use the EventMgr to ensure that we don't start a send before the op writing the buffer to be read has completed. In the other direction, our RPC system provides a callback that executes only after the target buffer area has been written, and we use that callback to trigger subsequent ops that want to read that area.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the stream executor and I think that it is a good design for both pipelining and parallelism. It is just that I have not figured out a viable path to accommodate my current design with it, and I need to make a temporary fix to the race condition issue.

Let me take a closer look and see if I can do better. Thanks for the comment! It is very helpful.

@rmlarsen
Copy link
Member

rmlarsen commented Aug 8, 2017

@tensorflow-jenkins test this please

@byronyi byronyi changed the title GPU Direct RDMA Out-of-Band Tensor Transport [WIP] GPU Direct RDMA Out-of-Band Tensor Transport Aug 8, 2017
@rmlarsen rmlarsen merged commit 11e2aef into tensorflow:master Aug 9, 2017
@byronyi byronyi changed the title [WIP] GPU Direct RDMA Out-of-Band Tensor Transport GPU Direct RDMA Out-of-Band Tensor Transport Aug 9, 2017
@tensorflow-jenkins
Copy link
Collaborator

Can one of the admins verify this patch?

@byronyi
Copy link
Contributor Author

byronyi commented Aug 9, 2017

I thought I will make some further improvements, but @rmlarsen seems to be eager to merge this. Alright then, I think I will leave them to the future 😆

Good job guys, thank you all for all the feedbacks in this thread. Hope we could have some users soon so we could hear back from them.

@shamoya
Copy link
Contributor

shamoya commented Aug 9, 2017

Thank you @byronyi for the contribution!
I hope we'll start testing it soon (+ @bkovalev)
J

@suiyuan2009
Copy link
Contributor

I find first gpu's forward propagation is much slower than other gpus on same machine when training resnet101 using grpc+gdr on a 4 gpus worker machine (grpc may have same problem, but grpc is much slower so it may hide the problem.), I have run many times, trace file is in here.

@byronyi byronyi changed the title GPU Direct RDMA Out-of-Band Tensor Transport GPUDirect RDMA Out-of-Band Tensor Transport Sep 2, 2017
@avolkov1
Copy link

@byronyi I keep getting an error: UnavailableError: No such device: cannot bind to rdma://hsw215:2302

No matter what port I've tried I get that error. The hsw215 is a compute node. This is my cluster spec:

CLUSTER_SPEC_DICT: {'ps': ['hsw215:2300'], 'worker': ['hsw215:2301', 'hsw215:2302', 'hsw215:2303', 'hsw215:2304']}

I compiled TF 1.4 with GDR, and I have nv_peer_mem running:

$ ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         10.16.1006
        node_guid:                      e41d:2d03:0006:ae40
        sys_image_guid:                 e41d:2d03:0006:ae40
        vendor_id:                      0x02c9
        vendor_part_id:                 4113
        hw_ver:                         0x0
        board_id:                       MT_1210110019
        phys_port_cnt:                  2
        Device ports:
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 1
                        port_lid:               16
                        port_lmc:               0x00
                        link_layer:             InfiniBand

                port:   2
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 0
                        port_lid:               65535
                        port_lmc:               0x00
                        link_layer:             InfiniBand


$ service nv_peer_mem status
nv_peer_mem module is loaded.

My code runs fine with protocol='grpc+verbs'. I get messages like:

I tensorflow/contrib/verbs/rdma_mgr.cc:56] connecting to remote node /job:worker/replica:0/task:0

Rdma runs well, but I can't get the gdr working. Maybe I'm doing something stupid, but it's not clear from the README what else needs to be done. Thanks.

@byronyi
Copy link
Contributor Author

byronyi commented Oct 19, 2017

@avolkov1 The GDR runtime binds to whatever IP you assigned, even if it is not an IB device. Please check with IP address first. Sometimes it resolves to different IP for the same host on different machines.

@avolkov1
Copy link

@byronyi Thank you. Made some progress. I'm getting Unavailable: Cannot find pinned memory region errors. Have you come across that? I'm trying to run multigpu training. Again, this worked with grpc+verbs. Are there some gotchas with parameter servers on CPU.

I get the RDMA GDR endpoints working:

1: 2017-10-19 11:17:07.804151: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
2: 2017-10-19 11:17:07.803164: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301
0: 2017-10-19 11:17:08.033571: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301
1: 2017-10-19 11:17:08.033677: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:21.449746: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2300
2: 2017-10-19 11:17:21.448892: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:21.455124: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2300
2: 2017-10-19 11:17:21.455214: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:22.995750: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2300
0: 2017-10-19 11:17:22.995812: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
2: 2017-10-19 11:17:29.572063: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2301
3: 2017-10-19 11:17:29.572255: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:29.572637: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2300
0: 2017-10-19 11:17:29.573710: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
0: 2017-10-19 11:17:38.169948: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw216.ib.cluster:2301
3: 2017-10-19 11:17:38.169127: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
1: 2017-10-19 11:17:49.647909: I tensorflow/contrib/gdr/gdr_memory_manager.cc:333] Accepted new RDMA connection
3: 2017-10-19 11:17:49.646921: I tensorflow/contrib/gdr/gdr_memory_manager.cc:678] RDMA endpoint connected to rdma://hsw215.ib.cluster:2301

I'm running this on slurm. The first number above is the task id. Here's my cluster setup:

CLUSTER_SPEC_DICT: {'ps': ['hsw215.ib.cluster:2300', 'hsw216.ib.cluster:2300'], 'worker': ['hsw215.ib.cluster:2301', 'hsw216.ib.cluster:2301']}
# slurm task 0 - hsw215.ib.cluster:2300 - ps 0
# slurm task 1 - hsw215.ib.cluster:2301 - worker 0
# slurm task 2 - hsw216.ib.cluster:2300 - ps 1
# slurm task 3 - hsw216.ib.cluster:2301 - worker 1

The parameter servers run on cpu and each worker has 4 GPUs. Devices list:

['/job:ps/task:0', '/job:ps/task:1']
['/job:worker/task:0/device:GPU:0', '/job:worker/task:0/device:GPU:1', '/job:worker/task:0/device:GPU:2', '/job:worker/task:0/device:GPU:3', '/job:worker/task:1/device:GPU:0', '/job:worker
/task:1/device:GPU:1', '/job:worker/task:1/device:GPU:2', '/job:worker/task:1/device:GPU:3']

ERRORS:

1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]
1: 2017-10-19 11:17:49.641150: W tensorflow/core/framework/op_kernel.cc:1192] Unavailable: Cannot find pinned memor
1: y region
1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]
1: 2017-10-19 11:17:49.641155: W tensorflow/core/framework/op_kernel.cc:1192] Unavailable: Cannot find pinned memory region
1:       [[Node: _recv_concatenate_1_target_0_S581 = _Recv[client_terminated=false, recv_device="/job:ps/replica:0/task:0/device:GPU:0", send_device="/job:worker/replica:0/task:0/device:CPU:0", send_device_in
carnation=1867630968541887687, tensor_name="edge_3354__recv_concatenate_1_target_0", tensor_type=DT_FLOAT, _device="/job:ps/replica:0/task:0/device:GPU:0"]()]]
1:       [[Node: training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2_S699 = _Recv[client_terminated=false, recv_device="/job:worker/replica:0/task:0/device:GPU:2", send_device="/job:ps/replica:0/task
:0/device:CPU:0", send_device_incarnation=-1707922374272032114, tensor_name="edge_1755_training/RMSprop/gradients/concatenate_1/concat_grad/Slice_2", tensor_type=DT_FLOAT, _device="/job:worker/replica:0/task:
0/device:GPU:2"]()]]

3: 2017-10-19 11:18:25.395597: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.395852: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cuda_host_bfc
3: 2017-10-19 11:18:25.395913: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.396150: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool
3: 2017-10-19 11:18:25.396200: E tensorflow/contrib/gdr/gdr_rendezvous_mgr.cc:71] Cannot find pinned memory region from allocator cpu_pool

@byronyi
Copy link
Contributor Author

byronyi commented Oct 20, 2017

@avolkov1 I suspect it is because of the limit on page locked memory size. Check with ulimit -l or ulimit -a to see the limit of your current user.

EDIT: you might need your sysadmin to modify the ulimit. Usually it shall be set unlimited with your IB driver installation, but it does depend on your vendor. See man limits.conf for explanation.

@avolkov1
Copy link

@byronyi Thanks. I looked at it. It seems to be unlimited.

hsw228$ ulimit -l
unlimited
hsw228$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1031351
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

@byronyi
Copy link
Contributor Author

byronyi commented Oct 23, 2017

@avolkov1 I failed to reproduce your issue on a fresh installation of vendor provided OFED on my testbed. Would you mind to paste the full log?

@avolkov1
Copy link

@byronyi Will do. Thank you very much. Give me a day or two (busy with work related stuff). I'll post the error log and the code I use to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet