New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPUDirect RDMA Out-of-Band Tensor Transport #11392
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. |
Are you depending on a host RDMA library to be installed?
|
@drpngx Yes, as in case of CUDA, we assume an OFED package is installed with all the drivers and user space libraries for RDMA to properly function. I agree that we need some sort of compile switch, but I haven't figure out the best way to do it. Alternatively we could add linux-rdma/rdma-core as a third party compile dependency, and detect whether the GDR environment is available only at runtime. |
I have done some investigations and would like to purpose adding Now of course all of these are Linux specific, so we need an extra compatibility layer against non-Linux platforms. I do know there are RDMA APIs available on Windows, but I don't have a Windows box, so this has to be left to future contributors. And for the functional testing, we could use Software RDMA over Ethernet as a software emulated RDMA layer that runs on any platform. It does require upstream Linux kernel version at least 4.8 to enable such feature, so we do need to setup a separated CI environment for that purpose. |
@shamoya Any interests on this patch? I would certainly appreciate any help from Mellanox on testing and setting up CI environment with GDR (and possibly RoCE). |
… On Jul 10, 2017 3:32 AM, "Bairen Yi" ***@***.***> wrote:
@shamoya <https://github.com/shamoya> Any interests on this patch? I
would certainly appreciate any help from Mellanox on testing and setting up
CI environment with GDR (and possibly with RoCE).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11392 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AT_SbYeP7V3-l2Duo9aG9508zR3sQb1Yks5sMf2ugaJpZM4OSCMo>
.
|
buldifier checks are failing here. Thus all the other tests are blocked.
Could you look into this failure, so that the rest of our CI tests can run? |
@gunan Yes, I am looking into it. Will ping you after it's resolved. |
I am working on introducing the RDMA user space libraries as third party dependencies, but I am not familiar with converting Makefile or CMake rules into bazel BUILD files. A similar case #5349 of integrating cURL seems to be a much larger effort than I expected. Would appreciate if @jart or @gunan could give some suggestions on this. |
Jenkins, test this please. |
@drpngx Sorry I was pushing the wrong commit. Could you ask Jenkins to test it again? |
Jenkins, test this please.
…On Jul 10, 2017 10:55 AM, "Bairen Yi" ***@***.***> wrote:
@drpngx <https://github.com/drpngx> Sorry I was pushing the wrong commit.
Could you ask Jenkins to test it again?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11392 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AT_SbVW1ng50wP9vJmxxwfvJZTEFypiQks5sMmWcgaJpZM4OSCMo>
.
|
Seems a rebase is needed. |
Thanks @byronyi |
@shamoya Glad to have your prompt reply. The last commit turns out to be a unsuccessful attempt to introduce RDMA dependency. If you would like to try out the patch on a machine with OFED installed, you could revert the last commit and build it from source. Update: I've reverted it myself so it should work by just checking out from this branch. |
The libraries in third_party are linked statically to the code right ? |
My initial attempt was to include only RDMA headers, which apparently is not runnable as it isn't linked against the actual library. I haven't figured it out how to use bazel to build rdma-core; they are using CMake and it's hard to port it to bazel in general. I guess we need to consult someone in the TF team on platform and portability issues. Regarding to the version of specific third party library, I think it is hardcoded into the compilation directives. Google seems to compile everything from source for their projects, and I guess this isn't a issue for them. |
Jenkins, test this please. |
@drpngx I am still working on resolving the build issues...maybe we should wait for review from @poxvoculi first? |
OK. I usually prefer to review when everything builds and tests properly.
Can you summarize what your questions are?
…On Tue, Jul 11, 2017 at 9:46 AM, Bairen Yi ***@***.***> wrote:
@drpngx <https://github.com/drpngx> I am still working on resolving the
build issues...maybe we should wait for review from @poxvoculi
<https://github.com/poxvoculi> first?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11392 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AT_SbSZTqPDhgHxUFwIp2qRQXqKnumSoks5sM6bKgaJpZM4OSCMo>
.
|
Current CI doesn't have GDR environment so we are specifically testing the "fallback to gRPC" behaviour. I've already implemented such logic in my code (here and there); but GDR environment requires RDMA user space libraries, which are currently not linked into the TF core library. Currently I am stuck by adding them as a third-party dependency (porting CMake based project to bazel). Alternatively I could make it a separated runtime (like grpc+verbs and grpc+mpi), but that will defeat the whole point of being able to fallback to gRPC. |
Or I could just add a compiler directive in |
@drpngx I've tested my non-RDMA box and everything works just fine. It does require the box to have Is there anyway to setup a CI environment with those two libraries? They are available on all Linux distros I know of. After we confirm everything builds and tests properly, we can arrange a review on my patch. |
There is no source release other than getting the source code from GitHub.
So once we merge this it's available for people to use.
However if someone specifically asks to build the 1.3 branch, they will get
the source code which builds the 1.3 binary, which would not include this
change.
|
Thanks for clarification! I actually just realized that :) |
@tensorflow-jenkins test this please |
@@ -285,12 +290,12 @@ Status GdrMemoryManager::Init() { | |||
} | |||
|
|||
#if GOOGLE_CUDA | |||
VisitableAllocator::Visitor cuda_alloc_visitor = | |||
std::bind(&GdrMemoryManager::InsertCUDAMemoryRegion, this, _1, _2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand what you're doing correctly, you're registering the entire GPU memory to have the CU_POINTER_ATTRIBUTE_SYNC_MEMOPS property. This seems to be a technique that NVIDIA provides because it makes concurrent programming easier, but I would avoid it because it likely will also damage performance considerably.
It sounds like you suspect a race condition on either the RDMA read, or some use of the final value. For background, here's how we generally handle this issue.
TensorFlow tries hard to use as few synchronous CUDA calls as possible, and to introduce as few sync points as possible. The techniques used rely on careful use of streams. The general discipline is that 4 streams are created for each GPU, by the StreamGroupFactory. All compute Ops (i.e. CUDA kernels) are launched on the single compute stream, and the other three are used for memcpys: one for H2D, one for D2H, and one for D2D copies. In the normal case where an op relies only on the completion of prior ops in the same stream, it can be launched async without danger or any other temporal dependency. In the case where it relies on prior completion of an op on another stream, we need to introduce a sync dependency, which can be done in more than one way. If we need to wait for an ordinary compute op prior to e.g. a D2H copy, we can just introduce a wait on the current compute stream prior to the copy, as here. This will cause the i/o stream to wait for completion of all ops pending at the time of call to complete before launching the next op added to itself.
In the case where we need to wait for a memcpy to terminate on one stream before launching a compute on another, the technique used is here. We don't want the compute stream to stall until the i/o is done, so we wait for the i/o to complete before queuing the compute op. The wait is accomplished by a call the the EventMgr.
In our internal networking which uses GPUDirect, we use the EventMgr to ensure that we don't start a send before the op writing the buffer to be read has completed. In the other direction, our RPC system provides a callback that executes only after the target buffer area has been written, and we use that callback to trigger subsequent ops that want to read that area.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the stream executor and I think that it is a good design for both pipelining and parallelism. It is just that I have not figured out a viable path to accommodate my current design with it, and I need to make a temporary fix to the race condition issue.
Let me take a closer look and see if I can do better. Thanks for the comment! It is very helpful.
@tensorflow-jenkins test this please |
Can one of the admins verify this patch? |
I thought I will make some further improvements, but @rmlarsen seems to be eager to merge this. Alright then, I think I will leave them to the future 😆 Good job guys, thank you all for all the feedbacks in this thread. Hope we could have some users soon so we could hear back from them. |
I find first gpu's forward propagation is much slower than other gpus on same machine when training resnet101 using |
@byronyi I keep getting an error: No matter what port I've tried I get that error. The hsw215 is a compute node. This is my cluster spec:
I compiled TF 1.4 with GDR, and I have nv_peer_mem running:
My code runs fine with
Rdma runs well, but I can't get the gdr working. Maybe I'm doing something stupid, but it's not clear from the README what else needs to be done. Thanks. |
@avolkov1 The GDR runtime binds to whatever IP you assigned, even if it is not an IB device. Please check with IP address first. Sometimes it resolves to different IP for the same host on different machines. |
@byronyi Thank you. Made some progress. I'm getting I get the RDMA GDR endpoints working:
I'm running this on slurm. The first number above is the task id. Here's my cluster setup:
The parameter servers run on cpu and each worker has 4 GPUs. Devices list:
ERRORS:
|
@avolkov1 I suspect it is because of the limit on page locked memory size. Check with EDIT: you might need your sysadmin to modify the ulimit. Usually it shall be set |
@byronyi Thanks. I looked at it. It seems to be unlimited.
|
@avolkov1 I failed to reproduce your issue on a fresh installation of vendor provided OFED on my testbed. Would you mind to paste the full log? |
@byronyi Will do. Thank you very much. Give me a day or two (busy with work related stuff). I'll post the error log and the code I use to run. |
Introduction
This PR implements GDR out-of-band transport for TensorFlow distributed runtime, complementary to current gRPC transport. It uses gRPC as control plane to setup rendezvous for each tensor transmission, and utilizes GPU Direct RDMA whenever possible to transmit tensors in remote GPU memory through network interface card (NIC), bypassing host memory and CPU entirely. It gracefully falls back to ordinary RDMA or even gRPC when GDR is not available.
Design
The GDR out-of-band transport is designed to avoid any unnecessary memory copies, especially for large tensors (>100MB). That typically requires registration of tensor buffers to NIC on the fly, which is rather slow as described in the design trade-off of the verbs runtime. The verbs runtime thus chooses to manage its own NIC-registered buffers and copy the tensors from/to those buffers for every single tensor transfer.
We show that, however, such design trade-off is not always relevant. In this patch, we manage both computation and communication buffers in a unified manner. By pre-registration of large buffers to NIC and allocating small tensors from the buffer pool using a BFC allocator, it is possible to avoid both buffer registration on the fly and memory copies all together.
For the actual tensor transport, we rely on gRPC to transmit the remote buffer information. This greatly simplifies our design, and there are only 2 types of RDMA messages: a single READ to retrieve the tensor data (bypassing remote CPU), and another invalidate using WRITE with IMM to release the tensor buffer on the remote side. The remote side will only be polling the invalidate message and
Unref
the tensor buffers that read by its peer.Environment
To fully utilize GDR, the target environment has to meet 3 conditions:
ibv_devinfo
, e.g.nvidia-smi topo -m
. For example, in the following topology,GPU2
andGPU3
are adjacent tomlx4_0
, and tensors on these devices could benefit from GDR in current implementation.nv_peer_mem
kernel module is installed.How to build and run in GDR mode
To test it out on a GDR capable environment, choose to enable GDR in your configure script.
Change your
protocol
togrpc+gdr
to enable GDR in your deployment.Currently the out-of-band transport service listens to the same IP and port address as specified in gRPC.
A successful initialization looks like this:
The last line suggests that the GPUs with bus id 2 (mapped to pci bus id prefixed 0000:8) will benefit from GDR and host memory bypass, which is
/gpu:2
and/gpu:3
in this case.Caveats
In current implementation, only tensors that reside in host memory or in GPU memory such that the GPU is adjacent to an RDMA capable NIC will use direct RDMA as its transport. When RDMA is available but not GDR, a temporary tensor copy on host memory will be used as RDMA source/destination (and copied from/to the target device). When there is no RDMA device present, it can even fallback to the original gRPC runtime. While it is theoretically possible to mix GDR enabled TF with non-GDR deployments in the same job, make sure the environment is properly setup so the GDR mode is enabled whenever possible (i.e. do not fall back to gRPC when it is not absolutely necessary).