GPU-resident queue for prefetching over PCIe #4526

benbarsdell · 2016-09-22T01:19:57Z

It would improve performance in some cases to be able to asynchronously prefetch data over the PCIe bus while GPU computation is taking place. A GPU-resident queue seems like the natural way to achieve this.

In the SO thread below, @yaroslavvb mentions using Variables pinned to the GPU to achieve the same effect, but I was unable to find a way to get this to work.

Related threads:

https://stackoverflow.com/questions/38751736/understanding-tensorflow-queues-and-cpu-gpu-transfer
#3009 (comment)
#3377 (comment)

yaroslavvb · 2016-09-22T01:49:39Z

@zheng-xq can tf.assign copy data to GPU at the same time as the GPU runs computations?

BTW, transfers between TensorFlow and Python runtimes can be pretty slow, so if you are getting data from feed_dict/numpy as in the SO question, that's likely to be the bottleneck and GPU-resident queue won't help you. Recently I found a case where data transfer rate was limited to 65 MB/second at the TF/Python boundary

zheng-xq · 2016-09-22T03:16:44Z

tf.assign itself doesn't do the copy. The send/recv pair after the graph partition does that. And it runs in parallel with the computation.

The data transfer between TF and Python through feed_dict is often hidden through input queues. A GPU-resident queue could save the additional overhead from CPU to GPU once the data is in TF.

However, we had discussed the possibility of a GPU-resident queue, or GPU-cached queue in this context in the past few days. However, a naive implementation might be problematic since it might consume too much GPU memory, which is often much less than its CPU counterpart. If we go with a GPU-cached queue, we have to be very careful, since there are a number of tricky issues to iron out so it doesn't interfere with the rest of the system.

jmchen-g · 2016-09-22T05:10:01Z

Will close this for now. Feel free to reopen this with more updates. Thanks.

benbarsdell · 2016-09-22T19:55:59Z

A GPU queue would only need to double-buffer the input; this doesn't seem like a big deal with respect to memory use(?).

Is a GPU queue more complicated than a CPU queue besides the different allocator?

zheng-xq · 2016-09-22T20:08:16Z

Modern models are very aggressive in using the queue capacities. In order to hide the latency, they tend to use a very large queue capacity.

mkolod · 2016-09-22T20:10:53Z

I think the issues with H2D transfers are quite real. Here's an example from AlexNet where the TF queues are feeding the GPUs with data for every mini-batch. The lack of overlap is demonstrated here for 2 GPUs (see screenshots). Because AlexNet has a low compute-to-I/O ratio, the problem is more clearly visible than in say Inception v3 or ResNet, so even though AlexNet isn't interesting for academic or business application reasons anymore, it's interesting for infrastructural/engineering reasons. Note how for 2 GPUs, H2D transfers are causing the GPU to wait for >2 seconds at each iteration. Hence, GPU compute efficiency drops to about 26%. This is on two GTX 1080s with batch sizes at 1024 per GPU. With smaller batch sizes, the problem is more pronounced. For a single GPU, the GPU compute utilization is significantly better (80%) but the kernel pipeline still blocks on H2D transfer, even though the transfer is taking place in a separate CUDA stream. I'd like to support @benbarsdell's point here. This doesn't look like an optimized prefetch - the compute/memcpy overlap for H2D is rather hard to see.

mkolod · 2016-09-22T20:11:31Z

Here's the single-GPU example (I previously posted a screenshot for 2 GPUs).

yaroslavvb · 2016-09-22T20:40:58Z

Here's what I meant when I said that GPU-resident queue could be doing using existing ops:

Keep a buffer of examples on GPU in several variables
Regular FIFO queue is enqueued with example indices, while actual example data lies on GPU
A global counter keeps track of number of examples that have been consumed
A separate thread periodically checks the counter, and replaces stale variables with new example data using tf.assign

Essentially you would have a rotating buffer of examples on GPU, and load that data asynchronously

mkolod · 2016-09-22T20:59:18Z

@yaroslavvb Yes, basically a backpressure pattern (some common tech doing this type of stuff these days being Akka, RxJava, and Flink, and for that matter, even flow control in TCP for a more dated reference). I think though that it would be good for this to be baked into the framework rather than shifting flow control workarounds to the user. Data prefetch is pretty much a universal need, since the input has to come from somewhere, hopefully in an efficient manner to leverage available compute cycles.

zheng-xq · 2016-09-23T18:06:27Z

We had some discussions about this. It seems a good idea to have a separate GPU queue in this case. We still need a larger CPU queue to hide latency between Python and TF. But this introduces a separate transfer stage from CPU to each GPU. We expect the GPU queue on each device to be much smaller: one or two in queue capacity.

The down-side is that this introduces more client side threads to drive the new data transfer. But there is a separate effort to migrate them into TF itself. So we will ignore that problem for now.

sjperkins · 2016-10-07T11:30:03Z

I also agree with @mkolod and @benbarsdell that the H2D (and D2H) transfers can be problematic. Related to #2848 where there is further discussion on tensorflow's GPU transfer scheduling. Part of the problem is that the Send/Recv ops are scheduled immediately and not necessarily in the order of their dependent operations. One suggested solution is to use tf.identity to order transfer to the GPU.

Thinking on this some more, would multiple towers (in the cifar10 example terminology) per GPU be a solution? Is the scheduler intelligent enough to interleave the GPU transfers and compute for both towers in parallel? I'm trying to come up with something

sjperkins · 2016-10-07T11:55:29Z

Also, the following quote from the documentation suggests one can pin queues to a GPU?

N.B. Queue methods (such as q.enqueue(...)) must run on the same device as the queue. Incompatible device placement directives will be ignored when creating these operations.

EDIT: Its not possible to pin queues to GPUs, yet.

ppwwyyxx · 2017-02-13T13:52:10Z

@zheng-xq

The down-side is that this introduces more client side threads to drive the new data transfer. But there is a separate effort to migrate them into TF itself. So we will ignore that problem for now.

Now the stage/unstage operators are added to TF, but they seem to require an extra client side thread indeed. Can you share what was the plan you mentioned (or if there is a solution already)?

zheng-xq · 2017-02-13T19:34:21Z

The plan is not to have separate python threads driving them, but to embed the parallelism in the graph. We will publish some scripts demonstrate how to do this in near future. Stay tuned.

Neltherion · 2018-04-25T14:51:35Z

Still waiting!

ppwwyyxx · 2018-04-25T16:54:56Z

@Neltherion from tensorflow.python.ops.data_flow_ops import StagingArea has been there for a year.

jmchen-g closed this as completed Sep 22, 2016

yaroslavvb mentioned this issue Nov 30, 2016

Async prefetching queue data on GPU #5722

Closed

ppwwyyxx mentioned this issue Apr 26, 2019

Bug: StagingArea.size() always return 0 when placed on a different device #15463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU-resident queue for prefetching over PCIe #4526

GPU-resident queue for prefetching over PCIe #4526

benbarsdell commented Sep 22, 2016

yaroslavvb commented Sep 22, 2016 •

edited

zheng-xq commented Sep 22, 2016

jmchen-g commented Sep 22, 2016

benbarsdell commented Sep 22, 2016

zheng-xq commented Sep 22, 2016

mkolod commented Sep 22, 2016 •

edited

mkolod commented Sep 22, 2016 •

edited

yaroslavvb commented Sep 22, 2016 •

edited

mkolod commented Sep 22, 2016

zheng-xq commented Sep 23, 2016

sjperkins commented Oct 7, 2016 •

edited

sjperkins commented Oct 7, 2016 •

edited

ppwwyyxx commented Feb 13, 2017 •

edited

zheng-xq commented Feb 13, 2017

Neltherion commented Apr 25, 2018

ppwwyyxx commented Apr 25, 2018

GPU-resident queue for prefetching over PCIe #4526

GPU-resident queue for prefetching over PCIe #4526

Comments

benbarsdell commented Sep 22, 2016

Related threads:

yaroslavvb commented Sep 22, 2016 • edited

zheng-xq commented Sep 22, 2016

jmchen-g commented Sep 22, 2016

benbarsdell commented Sep 22, 2016

zheng-xq commented Sep 22, 2016

mkolod commented Sep 22, 2016 • edited

mkolod commented Sep 22, 2016 • edited

yaroslavvb commented Sep 22, 2016 • edited

mkolod commented Sep 22, 2016

zheng-xq commented Sep 23, 2016

sjperkins commented Oct 7, 2016 • edited

sjperkins commented Oct 7, 2016 • edited

ppwwyyxx commented Feb 13, 2017 • edited

zheng-xq commented Feb 13, 2017

Neltherion commented Apr 25, 2018

ppwwyyxx commented Apr 25, 2018

yaroslavvb commented Sep 22, 2016 •

edited

mkolod commented Sep 22, 2016 •

edited

mkolod commented Sep 22, 2016 •

edited

yaroslavvb commented Sep 22, 2016 •

edited

sjperkins commented Oct 7, 2016 •

edited

sjperkins commented Oct 7, 2016 •

edited

ppwwyyxx commented Feb 13, 2017 •

edited