Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU-resident queue for prefetching over PCIe #4526

Closed
benbarsdell opened this issue Sep 22, 2016 · 16 comments
Closed

GPU-resident queue for prefetching over PCIe #4526

benbarsdell opened this issue Sep 22, 2016 · 16 comments

Comments

@benbarsdell
Copy link
Contributor

It would improve performance in some cases to be able to asynchronously prefetch data over the PCIe bus while GPU computation is taking place. A GPU-resident queue seems like the natural way to achieve this.

In the SO thread below, @yaroslavvb mentions using Variables pinned to the GPU to achieve the same effect, but I was unable to find a way to get this to work.

Related threads:

https://stackoverflow.com/questions/38751736/understanding-tensorflow-queues-and-cpu-gpu-transfer
#3009 (comment)
#3377 (comment)

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Sep 22, 2016

@zheng-xq can tf.assign copy data to GPU at the same time as the GPU runs computations?

BTW, transfers between TensorFlow and Python runtimes can be pretty slow, so if you are getting data from feed_dict/numpy as in the SO question, that's likely to be the bottleneck and GPU-resident queue won't help you. Recently I found a case where data transfer rate was limited to 65 MB/second at the TF/Python boundary

@zheng-xq
Copy link
Contributor

tf.assign itself doesn't do the copy. The send/recv pair after the graph partition does that. And it runs in parallel with the computation.

The data transfer between TF and Python through feed_dict is often hidden through input queues. A GPU-resident queue could save the additional overhead from CPU to GPU once the data is in TF.

However, we had discussed the possibility of a GPU-resident queue, or GPU-cached queue in this context in the past few days. However, a naive implementation might be problematic since it might consume too much GPU memory, which is often much less than its CPU counterpart. If we go with a GPU-cached queue, we have to be very careful, since there are a number of tricky issues to iron out so it doesn't interfere with the rest of the system.

@jmchen-g
Copy link
Contributor

Will close this for now. Feel free to reopen this with more updates. Thanks.

@benbarsdell
Copy link
Contributor Author

A GPU queue would only need to double-buffer the input; this doesn't seem like a big deal with respect to memory use(?).

Is a GPU queue more complicated than a CPU queue besides the different allocator?

@zheng-xq
Copy link
Contributor

Modern models are very aggressive in using the queue capacities. In order to hide the latency, they tend to use a very large queue capacity.

@mkolod
Copy link
Contributor

mkolod commented Sep 22, 2016

I think the issues with H2D transfers are quite real. Here's an example from AlexNet where the TF queues are feeding the GPUs with data for every mini-batch. The lack of overlap is demonstrated here for 2 GPUs (see screenshots). Because AlexNet has a low compute-to-I/O ratio, the problem is more clearly visible than in say Inception v3 or ResNet, so even though AlexNet isn't interesting for academic or business application reasons anymore, it's interesting for infrastructural/engineering reasons. Note how for 2 GPUs, H2D transfers are causing the GPU to wait for >2 seconds at each iteration. Hence, GPU compute efficiency drops to about 26%. This is on two GTX 1080s with batch sizes at 1024 per GPU. With smaller batch sizes, the problem is more pronounced. For a single GPU, the GPU compute utilization is significantly better (80%) but the kernel pipeline still blocks on H2D transfer, even though the transfer is taking place in a separate CUDA stream. I'd like to support @benbarsdell's point here. This doesn't look like an optimized prefetch - the compute/memcpy overlap for H2D is rather hard to see.

alexnet_2_gpus

@mkolod
Copy link
Contributor

mkolod commented Sep 22, 2016

Here's the single-GPU example (I previously posted a screenshot for 2 GPUs).

alexnet_1_gpu

@yaroslavvb
Copy link
Contributor

yaroslavvb commented Sep 22, 2016

Here's what I meant when I said that GPU-resident queue could be doing using existing ops:

  • Keep a buffer of examples on GPU in several variables
  • Regular FIFO queue is enqueued with example indices, while actual example data lies on GPU
  • A global counter keeps track of number of examples that have been consumed
  • A separate thread periodically checks the counter, and replaces stale variables with new example data using tf.assign

Essentially you would have a rotating buffer of examples on GPU, and load that data asynchronously

@mkolod
Copy link
Contributor

mkolod commented Sep 22, 2016

@yaroslavvb Yes, basically a backpressure pattern (some common tech doing this type of stuff these days being Akka, RxJava, and Flink, and for that matter, even flow control in TCP for a more dated reference). I think though that it would be good for this to be baked into the framework rather than shifting flow control workarounds to the user. Data prefetch is pretty much a universal need, since the input has to come from somewhere, hopefully in an efficient manner to leverage available compute cycles.

@zheng-xq
Copy link
Contributor

We had some discussions about this. It seems a good idea to have a separate GPU queue in this case. We still need a larger CPU queue to hide latency between Python and TF. But this introduces a separate transfer stage from CPU to each GPU. We expect the GPU queue on each device to be much smaller: one or two in queue capacity.

The down-side is that this introduces more client side threads to drive the new data transfer. But there is a separate effort to migrate them into TF itself. So we will ignore that problem for now.

@sjperkins
Copy link
Contributor

sjperkins commented Oct 7, 2016

I also agree with @mkolod and @benbarsdell that the H2D (and D2H) transfers can be problematic. Related to #2848 where there is further discussion on tensorflow's GPU transfer scheduling. Part of the problem is that the Send/Recv ops are scheduled immediately and not necessarily in the order of their dependent operations. One suggested solution is to use tf.identity to order transfer to the GPU.

Thinking on this some more, would multiple towers (in the cifar10 example terminology) per GPU be a solution? Is the scheduler intelligent enough to interleave the GPU transfers and compute for both towers in parallel? I'm trying to come up with something

@sjperkins
Copy link
Contributor

sjperkins commented Oct 7, 2016

Also, the following quote from the documentation suggests one can pin queues to a GPU?

N.B. Queue methods (such as q.enqueue(...)) must run on the same device as the queue. Incompatible device placement directives will be ignored when creating these operations.

EDIT: Its not possible to pin queues to GPUs, yet.

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Feb 13, 2017

@zheng-xq

The down-side is that this introduces more client side threads to drive the new data transfer. But there is a separate effort to migrate them into TF itself. So we will ignore that problem for now.

Now the stage/unstage operators are added to TF, but they seem to require an extra client side thread indeed. Can you share what was the plan you mentioned (or if there is a solution already)?

@zheng-xq
Copy link
Contributor

The plan is not to have separate python threads driving them, but to embed the parallelism in the graph. We will publish some scripts demonstrate how to do this in near future. Stay tuned.

@Neltherion
Copy link

Still waiting!

@ppwwyyxx
Copy link
Contributor

@Neltherion from tensorflow.python.ops.data_flow_ops import StagingArea has been there for a year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants