Implement async TensorFromTransportOptions for GDR #24058

byronyi · 2018-11-30T04:07:51Z

Instead of blocking on completion of an RDMA op, RecvTensor client will now post a work request to the NIC send queue and return immediately.

The GDR background polling thread will handle the callback after the corresponding RDMA op is completed, i.e. polled from the completion queue on NIC. The old epoll based mechanism is removed to trade higher CPU usage for improved throughput and lower latencies for RDMA ops.

The maximum numbers of work request (WR) in the send/recv queues on NIC are increased to entertain the increased number of concurrent RDMA ops. The threshold of tensor size below which we pass the tensor content in metadata is also increased to reduce the pressure to send/recv queues on NIC.

This fixes #23933.

Signed-off-by: Bairen Yi byronyi@clustar.ai

byronyi · 2018-11-30T04:09:55Z

@poxvoculi I was blocked on #23933 for the past week, and I am glad I can fix it now.

I will go back to refactoring of networking plugins in the next couple of days.

poxvoculi · 2018-12-01T01:33:56Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

-    return;
-  }
-
-  ibv_mr* mr = FindMemoryRegion(addr, length);

 #if GOOGLE_CUDA
  if (device->tensorflow_gpu_device_info() && !on_host) {
    Allocator* alloc = GPUProcessState::singleton()->GetCUDAHostAllocator(0);


OK as-is for now. Eventually instead of 0 you should use the numa node of either the GPU you're copying from or the NIC that's going to read the CPU buffer.

Resolved. I believe we had this discussion a long time ago in #10629. Things certainly changed :P

poxvoculi · 2018-12-01T01:40:58Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

-          mutable_transport_options->PackFrom(remote_mr);
+    uint64_t checksum = 0;
+    if (VLOG_IS_ON(2)) {
+      checksum = GPUUtil::Checksum(device, device_context, tensor);


Nit. GPUUtil::Checksum actually copies tensor to tmp CPU buffer then computes checksum there, since we don't (yet) have a GPU native checksum function. So you could compute this on host_copy instead.

Resolved. Thanks for catching this!

poxvoculi · 2018-12-01T02:05:04Z

tensorflow/contrib/gdr/gdr_memory_manager.cc

+                  << "!=" << remote_mr.checksum();
+            }
+
+            done(Status::OK());


FYI: Be aware that many GPUUtil functions, like CopyCPUTensorToGPU, execute their callbacks in the one polling thread of the EventMgr associated with the GPU. It's the responsibility of the programmer to ensure that these callbacks are not blocking or otherwise long-running. If they are, no further events will be processed for that GPU until the callback closure finally terminates which can delay other Ops. If the callback needs to trigger expensive processing it must be moved to another thread.

Thanks for the heads up. I was reading codebase of EventMgr and it was a good read.

Instead of blocking on completion of an RDMA op, RecvTensor client will now post a work request to the NIC send queue and return immediately. The GDR background polling thread will handle the callback after the corresponding RDMA op is completed, i.e. polled from the completion queue on NIC. The old epoll based mechanism is removed to trade higher CPU usage for improved throughput and lower latencies for RDMA ops. The maximum numbers of work request (WR) in the send/recv queues on NIC are increased to entertain the increased number of concurrent RDMA ops. The threshold of tensor size below which we pass the tensor content in metadata is also increased to reduce the pressure to send/recv queues on NIC. This fixes #23933. Signed-off-by: Bairen Yi <byronyi@clustar.ai>

byronyi · 2018-12-06T07:10:02Z

I've rebased to latest master and cleanup the code.

Note that the checksum part is removed, as it is not guaranteed that the async RDMA client will read the same tensor content through network compared with gRPC. It turns out that sometimes the worker will RDMA read weights of the next steps from PS after the checksum is transmitted though transport_options (which was computed with stale weights on RecvTensor server). I am not sure it is a bug or not, but the same trick of sharing backed tensor_buffer could be found in gRPC as well, so I'll just leave it there. The original purpose of checksumming (i.e. debugging) should be achieved via some other approaches (e.g. unit testing).

byronyi · 2018-12-07T09:33:12Z

@Harshini-Gadige Mind to kick off Kokoro testing again?

Harshini-Gadige · 2018-12-07T17:08:45Z

@Harshini-Gadige Mind to kick off Kokoro testing again?

Done

byronyi · 2018-12-11T08:43:51Z

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

googlebot · 2018-12-11T08:43:55Z

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

byronyi · 2018-12-11T10:13:04Z

@googlebot CLA shall be fine. Let me double check.

googlebot · 2018-12-11T10:13:07Z

CLAs look good, thanks!

byronyi · 2018-12-13T14:55:45Z

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

Harshini-Gadige · 2018-12-13T19:51:13Z

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

Done

PiperOrigin-RevId: 225428257

googlebot added the cla: yes label Nov 30, 2018

Harshini-Gadige requested a review from poxvoculi November 30, 2018 17:55

Harshini-Gadige self-assigned this Nov 30, 2018

Harshini-Gadige added the awaiting review Pull request awaiting review label Nov 30, 2018

poxvoculi reviewed Dec 1, 2018

View reviewed changes

poxvoculi previously approved these changes Dec 1, 2018

View reviewed changes

poxvoculi reviewed Dec 1, 2018

View reviewed changes

tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 1, 2018

byronyi dismissed poxvoculi’s stale review via 9d22541 December 1, 2018 15:38

poxvoculi previously approved these changes Dec 3, 2018

View reviewed changes

Harshini-Gadige added the kokoro:force-run Tests on submitted change label Dec 3, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 3, 2018

Harshini-Gadige added kokoro:force-run Tests on submitted change awaiting testing (then merge) labels Dec 4, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 4, 2018

ymodak removed the awaiting testing (then merge) label Dec 5, 2018

Harshini-Gadige added the ready to pull PR ready for merge process label Dec 5, 2018

byronyi dismissed poxvoculi’s stale review via 7903905 December 6, 2018 06:58

Cleanup unnecessary GOOGLE_CUDA and tf_cuda_library

2295e1b

poxvoculi approved these changes Dec 7, 2018

View reviewed changes

Harshini-Gadige added kokoro:force-run Tests on submitted change and removed ready to pull PR ready for merge process labels Dec 7, 2018

kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 7, 2018

googlebot added cla: no and removed cla: yes labels Dec 11, 2018

googlebot added cla: yes and removed cla: no labels Dec 11, 2018

Harshini-Gadige added the ready to pull PR ready for merge process label Dec 13, 2018

tensorflow-copybara merged commit 2295e1b into tensorflow:master Dec 13, 2018

tensorflow-copybara pushed a commit that referenced this pull request Dec 13, 2018

Merge pull request #24058 from byronyi:fix-23933

68c3b4e

PiperOrigin-RevId: 225428257

byronyi deleted the fix-23933 branch December 13, 2018 23:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement async TensorFromTransportOptions for GDR #24058

Implement async TensorFromTransportOptions for GDR #24058

byronyi commented Nov 30, 2018 •

edited

byronyi commented Nov 30, 2018

poxvoculi Dec 1, 2018

byronyi Dec 1, 2018

poxvoculi Dec 1, 2018

byronyi Dec 1, 2018

poxvoculi Dec 1, 2018

byronyi Dec 1, 2018

byronyi commented Dec 6, 2018 •

edited

byronyi commented Dec 7, 2018

Harshini-Gadige commented Dec 7, 2018

byronyi commented Dec 11, 2018

googlebot commented Dec 11, 2018

byronyi commented Dec 11, 2018

googlebot commented Dec 11, 2018

byronyi commented Dec 13, 2018

Harshini-Gadige commented Dec 13, 2018

Implement async TensorFromTransportOptions for GDR #24058

Implement async TensorFromTransportOptions for GDR #24058

Conversation

byronyi commented Nov 30, 2018 • edited

byronyi commented Nov 30, 2018

poxvoculi Dec 1, 2018

Choose a reason for hiding this comment

byronyi Dec 1, 2018

Choose a reason for hiding this comment

poxvoculi Dec 1, 2018

Choose a reason for hiding this comment

byronyi Dec 1, 2018

Choose a reason for hiding this comment

poxvoculi Dec 1, 2018

Choose a reason for hiding this comment

byronyi Dec 1, 2018

Choose a reason for hiding this comment

byronyi commented Dec 6, 2018 • edited

byronyi commented Dec 7, 2018

Harshini-Gadige commented Dec 7, 2018

byronyi commented Dec 11, 2018

googlebot commented Dec 11, 2018

byronyi commented Dec 11, 2018

googlebot commented Dec 11, 2018

byronyi commented Dec 13, 2018

Harshini-Gadige commented Dec 13, 2018

byronyi commented Nov 30, 2018 •

edited

byronyi commented Dec 6, 2018 •

edited