-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reading_data/fully_connected_reader.py VERY slow relative to fully_connected_feed.py #837
Comments
I was able to get 68x68 images reading/decoding from TF-examples fast enough to saturate my K40 with input pipeline using 6 threads/1 CPU. These tutorials have not been tuned to work efficiently on GPU, so there could be some small ops that are placed on GPU suboptimally and are causing unnecessary data transfers -- try pinning your input pipeline to CPU manually. See #838 for an example of suboptimal placement making GPU version run 100x slower |
Re: "Pinning the pipeline to the CPU": here's how I would optimize it -- pin everything to CPU (export On Fri, Jan 22, 2016 at 2:59 PM, nryant notifications@github.com wrote:
|
This actually fixes #838. Pinning the pipeline to CPU for On Friday, January 22, 2016, Vijay Vasudevan notifications@github.com
Neville Ryant |
oops, off-by-1 error on my part. |
@ebrevdo: Could you take a look since it's queue related? |
@josh11b could this be due to lack of caching in the readers? not sure if this bug is still relevant given the changes that have been pushed between when this bug report and now. |
I believe there is now the ability to read batches from a reader that can reduce overhead, assuming there is no problem with the examples having different dimensions. Also I recall someone is working on ParseExample performance improvements? |
Yes; the ParseExample work will hopefully get checked in w/in a week or two. On Wed, Aug 10, 2016 at 10:31 AM, josh11b notifications@github.com wrote:
|
I'm assuming that this is checked in. Feel free to open a new github issue if the problem still persists in recent versions. |
I noticed that when using a data reader to provide minibatches of examples to a model that performance is greatly reduced relative to just supplying the examples via
feed_dict
. For instance, when runningreading_data/fully_connected_reader.py
with the following flags::it takes 28.7 seconds to process 600 minibatches with a GPU utilization of 13%. If I edit the code so that
num_threads=16
(instead ofnum_threads=2
) whenshuffle_batch
is called, these numbers improve to 14.9 seconds and 23% GPU utilization. However, training the same model viafully_connected_feed.py
takes only 2.63 seconds and achieves a GPU utilization of 55%. This is hardly rigorous, but it seems that the overhead involved in reading the Example protos from the TFRecords file, putting them into a queue, etc is much higher than I would expect.These numbers were compiled using 039981f and running on a Titan X card with no other background processes running.
The text was updated successfully, but these errors were encountered: