New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement async TensorFromTransportOptions for GDR #24058
Conversation
@poxvoculi I was blocked on #23933 for the past week, and I am glad I can fix it now. I will go back to refactoring of networking plugins in the next couple of days. |
return; | ||
} | ||
|
||
ibv_mr* mr = FindMemoryRegion(addr, length); | ||
|
||
#if GOOGLE_CUDA | ||
if (device->tensorflow_gpu_device_info() && !on_host) { | ||
Allocator* alloc = GPUProcessState::singleton()->GetCUDAHostAllocator(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK as-is for now. Eventually instead of 0 you should use the numa node of either the GPU you're copying from or the NIC that's going to read the CPU buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. I believe we had this discussion a long time ago in #10629. Things certainly changed :P
mutable_transport_options->PackFrom(remote_mr); | ||
uint64_t checksum = 0; | ||
if (VLOG_IS_ON(2)) { | ||
checksum = GPUUtil::Checksum(device, device_context, tensor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit. GPUUtil::Checksum actually copies tensor to tmp CPU buffer then computes checksum there, since we don't (yet) have a GPU native checksum function. So you could compute this on host_copy instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved. Thanks for catching this!
<< "!=" << remote_mr.checksum(); | ||
} | ||
|
||
done(Status::OK()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: Be aware that many GPUUtil functions, like CopyCPUTensorToGPU, execute their callbacks in the one polling thread of the EventMgr associated with the GPU. It's the responsibility of the programmer to ensure that these callbacks are not blocking or otherwise long-running. If they are, no further events will be processed for that GPU until the callback closure finally terminates which can delay other Ops. If the callback needs to trigger expensive processing it must be moved to another thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads up. I was reading codebase of EventMgr and it was a good read.
Instead of blocking on completion of an RDMA op, RecvTensor client will now post a work request to the NIC send queue and return immediately. The GDR background polling thread will handle the callback after the corresponding RDMA op is completed, i.e. polled from the completion queue on NIC. The old epoll based mechanism is removed to trade higher CPU usage for improved throughput and lower latencies for RDMA ops. The maximum numbers of work request (WR) in the send/recv queues on NIC are increased to entertain the increased number of concurrent RDMA ops. The threshold of tensor size below which we pass the tensor content in metadata is also increased to reduce the pressure to send/recv queues on NIC. This fixes #23933. Signed-off-by: Bairen Yi <byronyi@clustar.ai>
I've rebased to latest master and cleanup the code. Note that the checksum part is removed, as it is not guaranteed that the async RDMA client will read the same tensor content through network compared with gRPC. It turns out that sometimes the worker will RDMA read weights of the next steps from PS after the checksum is transmitted though transport_options (which was computed with stale weights on RecvTensor server). I am not sure it is a bug or not, but the same trick of sharing backed tensor_buffer could be found in gRPC as well, so I'll just leave it there. The original purpose of checksumming (i.e. debugging) should be achieved via some other approaches (e.g. unit testing). |
@Harshini-Gadige Mind to kick off Kokoro testing again? |
Done |
@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release. |
We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google. |
@googlebot CLA shall be fine. Let me double check. |
CLAs look good, thanks! |
@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release. |
Done |
PiperOrigin-RevId: 225428257
Instead of blocking on completion of an RDMA op, RecvTensor client will now post a work request to the NIC send queue and return immediately.
The GDR background polling thread will handle the callback after the corresponding RDMA op is completed, i.e. polled from the completion queue on NIC. The old epoll based mechanism is removed to trade higher CPU usage for improved throughput and lower latencies for RDMA ops.
The maximum numbers of work request (WR) in the send/recv queues on NIC are increased to entertain the increased number of concurrent RDMA ops. The threshold of tensor size below which we pass the tensor content in metadata is also increased to reduce the pressure to send/recv queues on NIC.
This fixes #23933.
Signed-off-by: Bairen Yi byronyi@clustar.ai