Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement async TensorFromTransportOptions for GDR #24058

Merged
merged 2 commits into from Dec 13, 2018
Merged

Implement async TensorFromTransportOptions for GDR #24058

merged 2 commits into from Dec 13, 2018

Conversation

byronyi
Copy link
Contributor

@byronyi byronyi commented Nov 30, 2018

Instead of blocking on completion of an RDMA op, RecvTensor client will now post a work request to the NIC send queue and return immediately.

The GDR background polling thread will handle the callback after the corresponding RDMA op is completed, i.e. polled from the completion queue on NIC. The old epoll based mechanism is removed to trade higher CPU usage for improved throughput and lower latencies for RDMA ops.

The maximum numbers of work request (WR) in the send/recv queues on NIC are increased to entertain the increased number of concurrent RDMA ops. The threshold of tensor size below which we pass the tensor content in metadata is also increased to reduce the pressure to send/recv queues on NIC.

This fixes #23933.

Signed-off-by: Bairen Yi byronyi@clustar.ai

@byronyi
Copy link
Contributor Author

byronyi commented Nov 30, 2018

@poxvoculi I was blocked on #23933 for the past week, and I am glad I can fix it now.

I will go back to refactoring of networking plugins in the next couple of days.

@Harshini-Gadige Harshini-Gadige self-assigned this Nov 30, 2018
@Harshini-Gadige Harshini-Gadige added the awaiting review Pull request awaiting review label Nov 30, 2018
return;
}

ibv_mr* mr = FindMemoryRegion(addr, length);

#if GOOGLE_CUDA
if (device->tensorflow_gpu_device_info() && !on_host) {
Allocator* alloc = GPUProcessState::singleton()->GetCUDAHostAllocator(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK as-is for now. Eventually instead of 0 you should use the numa node of either the GPU you're copying from or the NIC that's going to read the CPU buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. I believe we had this discussion a long time ago in #10629. Things certainly changed :P

mutable_transport_options->PackFrom(remote_mr);
uint64_t checksum = 0;
if (VLOG_IS_ON(2)) {
checksum = GPUUtil::Checksum(device, device_context, tensor);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. GPUUtil::Checksum actually copies tensor to tmp CPU buffer then computes checksum there, since we don't (yet) have a GPU native checksum function. So you could compute this on host_copy instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. Thanks for catching this!

poxvoculi
poxvoculi previously approved these changes Dec 1, 2018
<< "!=" << remote_mr.checksum();
}

done(Status::OK());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Be aware that many GPUUtil functions, like CopyCPUTensorToGPU, execute their callbacks in the one polling thread of the EventMgr associated with the GPU. It's the responsibility of the programmer to ensure that these callbacks are not blocking or otherwise long-running. If they are, no further events will be processed for that GPU until the callback closure finally terminates which can delay other Ops. If the callback needs to trigger expensive processing it must be moved to another thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up. I was reading codebase of EventMgr and it was a good read.

@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Dec 1, 2018
poxvoculi
poxvoculi previously approved these changes Dec 3, 2018
@Harshini-Gadige Harshini-Gadige added the kokoro:force-run Tests on submitted change label Dec 3, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 3, 2018
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 4, 2018
@Harshini-Gadige Harshini-Gadige added the ready to pull PR ready for merge process label Dec 5, 2018
Instead of blocking on completion of an RDMA op, RecvTensor client will
now post a work request to the NIC send queue and return immediately.
The GDR background polling thread will handle the callback after the
corresponding RDMA op is completed, i.e. polled from the completion
queue on NIC. The old epoll based mechanism is removed to trade higher
CPU usage for improved throughput and lower latencies for RDMA ops.

The maximum numbers of work request (WR) in the send/recv queues on
NIC are increased to entertain the increased number of concurrent
RDMA ops. The threshold of tensor size below which we pass the tensor
content in metadata is also increased to reduce the pressure to send/recv
queues on NIC.

This fixes #23933.

Signed-off-by: Bairen Yi <byronyi@clustar.ai>
@byronyi
Copy link
Contributor Author

byronyi commented Dec 6, 2018

I've rebased to latest master and cleanup the code.

Note that the checksum part is removed, as it is not guaranteed that the async RDMA client will read the same tensor content through network compared with gRPC. It turns out that sometimes the worker will RDMA read weights of the next steps from PS after the checksum is transmitted though transport_options (which was computed with stale weights on RecvTensor server). I am not sure it is a bug or not, but the same trick of sharing backed tensor_buffer could be found in gRPC as well, so I'll just leave it there. The original purpose of checksumming (i.e. debugging) should be achieved via some other approaches (e.g. unit testing).

@byronyi
Copy link
Contributor Author

byronyi commented Dec 7, 2018

@Harshini-Gadige Mind to kick off Kokoro testing again?

@Harshini-Gadige Harshini-Gadige added kokoro:force-run Tests on submitted change and removed ready to pull PR ready for merge process labels Dec 7, 2018
@Harshini-Gadige
Copy link

@Harshini-Gadige Mind to kick off Kokoro testing again?

Done

@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Dec 7, 2018
@byronyi
Copy link
Contributor Author

byronyi commented Dec 11, 2018

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

@googlebot
Copy link

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and have the pull request author add another comment and the bot will run again. If the bot doesn't comment, it means it doesn't think anything has changed.

@byronyi
Copy link
Contributor Author

byronyi commented Dec 11, 2018

@googlebot CLA shall be fine. Let me double check.

@googlebot
Copy link

CLAs look good, thanks!

@byronyi
Copy link
Contributor Author

byronyi commented Dec 13, 2018

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

@Harshini-Gadige Harshini-Gadige added the ready to pull PR ready for merge process label Dec 13, 2018
@Harshini-Gadige
Copy link

@Harshini-Gadige Ready to pull? It’d be great if I could make it into the 1.13 source release.

Done

@tensorflow-copybara tensorflow-copybara merged commit 2295e1b into tensorflow:master Dec 13, 2018
tensorflow-copybara pushed a commit that referenced this pull request Dec 13, 2018
PiperOrigin-RevId: 225428257
@byronyi byronyi deleted the fix-23933 branch December 13, 2018 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

38c9b12 stucks gdr in ResNet50_v1.5 official benchmark
8 participants