-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU-resident queue for prefetching over PCIe #4526
Comments
@zheng-xq can tf.assign copy data to GPU at the same time as the GPU runs computations? BTW, transfers between TensorFlow and Python runtimes can be pretty slow, so if you are getting data from feed_dict/numpy as in the SO question, that's likely to be the bottleneck and GPU-resident queue won't help you. Recently I found a case where data transfer rate was limited to 65 MB/second at the TF/Python boundary |
tf.assign itself doesn't do the copy. The send/recv pair after the graph partition does that. And it runs in parallel with the computation. The data transfer between TF and Python through feed_dict is often hidden through input queues. A GPU-resident queue could save the additional overhead from CPU to GPU once the data is in TF. However, we had discussed the possibility of a GPU-resident queue, or GPU-cached queue in this context in the past few days. However, a naive implementation might be problematic since it might consume too much GPU memory, which is often much less than its CPU counterpart. If we go with a GPU-cached queue, we have to be very careful, since there are a number of tricky issues to iron out so it doesn't interfere with the rest of the system. |
Will close this for now. Feel free to reopen this with more updates. Thanks. |
A GPU queue would only need to double-buffer the input; this doesn't seem like a big deal with respect to memory use(?). Is a GPU queue more complicated than a CPU queue besides the different allocator? |
Modern models are very aggressive in using the queue capacities. In order to hide the latency, they tend to use a very large queue capacity. |
I think the issues with H2D transfers are quite real. Here's an example from AlexNet where the TF queues are feeding the GPUs with data for every mini-batch. The lack of overlap is demonstrated here for 2 GPUs (see screenshots). Because AlexNet has a low compute-to-I/O ratio, the problem is more clearly visible than in say Inception v3 or ResNet, so even though AlexNet isn't interesting for academic or business application reasons anymore, it's interesting for infrastructural/engineering reasons. Note how for 2 GPUs, H2D transfers are causing the GPU to wait for >2 seconds at each iteration. Hence, GPU compute efficiency drops to about 26%. This is on two GTX 1080s with batch sizes at 1024 per GPU. With smaller batch sizes, the problem is more pronounced. For a single GPU, the GPU compute utilization is significantly better (80%) but the kernel pipeline still blocks on H2D transfer, even though the transfer is taking place in a separate CUDA stream. I'd like to support @benbarsdell's point here. This doesn't look like an optimized prefetch - the compute/memcpy overlap for H2D is rather hard to see. |
Here's what I meant when I said that GPU-resident queue could be doing using existing ops:
Essentially you would have a rotating buffer of examples on GPU, and load that data asynchronously |
@yaroslavvb Yes, basically a backpressure pattern (some common tech doing this type of stuff these days being Akka, RxJava, and Flink, and for that matter, even flow control in TCP for a more dated reference). I think though that it would be good for this to be baked into the framework rather than shifting flow control workarounds to the user. Data prefetch is pretty much a universal need, since the input has to come from somewhere, hopefully in an efficient manner to leverage available compute cycles. |
We had some discussions about this. It seems a good idea to have a separate GPU queue in this case. We still need a larger CPU queue to hide latency between Python and TF. But this introduces a separate transfer stage from CPU to each GPU. We expect the GPU queue on each device to be much smaller: one or two in queue capacity. The down-side is that this introduces more client side threads to drive the new data transfer. But there is a separate effort to migrate them into TF itself. So we will ignore that problem for now. |
I also agree with @mkolod and @benbarsdell that the H2D (and D2H) transfers can be problematic. Related to #2848 where there is further discussion on tensorflow's GPU transfer scheduling. Part of the problem is that the Send/Recv ops are scheduled immediately and not necessarily in the order of their dependent operations. One suggested solution is to use Thinking on this some more, would multiple |
Also, the following quote from the documentation suggests one can pin queues to a GPU?
EDIT: Its not possible to pin queues to GPUs, yet. |
Now the stage/unstage operators are added to TF, but they seem to require an extra client side thread indeed. Can you share what was the plan you mentioned (or if there is a solution already)? |
The plan is not to have separate python threads driving them, but to embed the parallelism in the graph. We will publish some scripts demonstrate how to do this in near future. Stay tuned. |
Still waiting! |
@Neltherion |
It would improve performance in some cases to be able to asynchronously prefetch data over the PCIe bus while GPU computation is taking place. A GPU-resident queue seems like the natural way to achieve this.
In the SO thread below, @yaroslavvb mentions using Variables pinned to the GPU to achieve the same effect, but I was unable to find a way to get this to work.
Related threads:
https://stackoverflow.com/questions/38751736/understanding-tensorflow-queues-and-cpu-gpu-transfer
#3009 (comment)
#3377 (comment)
The text was updated successfully, but these errors were encountered: