Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading_data/fully_connected_reader.py VERY slow relative to fully_connected_feed.py #837

Closed
nryant opened this issue Jan 22, 2016 · 10 comments
Assignees

Comments

@nryant
Copy link

nryant commented Jan 22, 2016

I noticed that when using a data reader to provide minibatches of examples to a model that performance is greatly reduced relative to just supplying the examples via feed_dict. For instance, when running reading_data/fully_connected_reader.py with the following flags::

--hidden1 512 --hidden2 512 --batch_size 128

it takes 28.7 seconds to process 600 minibatches with a GPU utilization of 13%. If I edit the code so that num_threads=16 (instead of num_threads=2) when shuffle_batch is called, these numbers improve to 14.9 seconds and 23% GPU utilization. However, training the same model via fully_connected_feed.py takes only 2.63 seconds and achieves a GPU utilization of 55%. This is hardly rigorous, but it seems that the overhead involved in reading the Example protos from the TFRecords file, putting them into a queue, etc is much higher than I would expect.

These numbers were compiled using 039981f and running on a Titan X card with no other background processes running.

@nryant
Copy link
Author

nryant commented Jan 22, 2016

Related to #551, #763 ?

@yaroslavvb
Copy link
Contributor

I was able to get 68x68 images reading/decoding from TF-examples fast enough to saturate my K40 with input pipeline using 6 threads/1 CPU. These tutorials have not been tuned to work efficiently on GPU, so there could be some small ops that are placed on GPU suboptimally and are causing unnecessary data transfers -- try pinning your input pipeline to CPU manually. See #838 for an example of suboptimal placement making GPU version run 100x slower

@yaroslavvb
Copy link
Contributor

Re: "Pinning the pipeline to the CPU":

here's how I would optimize it -- pin everything to CPU (export
TF_MIN_GPU_MULTIPROCESSOR_COUNT=800), and remove all the non-reading ops.
Tweak your pipeline design/number of threads until you get maximum
throughput. Then re-enable GPU, and use manual pinning to make sure that
your input pipeline throughput is unchanged. Then attach your processing
ops (on GPU)

On Fri, Jan 22, 2016 at 2:59 PM, nryant notifications@github.com wrote:

Pinning the pipeline to the CPU helps somewhat, but still is worse than I
would expect. For num_threads=2 the time reduces to 12.6 seconds with 13%
GPU utilization and for num_threads=16 to 9.4 seconds with 18% utilization


Reply to this email directly or view it on GitHub
#837 (comment)
.

@vrv vrv closed this as completed in ebe109b Jan 23, 2016
@nryant
Copy link
Author

nryant commented Jan 23, 2016

This actually fixes #838. Pinning the pipeline to CPU for
fully_connected_reader.py helps somewhat, but performance still lags
fully_connected_feed.py. I did some benchmarking this afternoon before
leaving the office and most of the remaining peformance gap seems to be
from the fact that reading an epoch's worth of images takes 20-25x as long
using a reader (.2 seconds vs about 5 seconds; sorry, out of office and
don't have the precise timings with me). For a more realistically sized
network, this additional overhead wouldn't be such an issue, so this
probably should be closed after fully_connected_reader.py has been
modified.

On Friday, January 22, 2016, Vijay Vasudevan notifications@github.com
wrote:

Closed #837 #837 via
ebe109b
ebe109b
.


Reply to this email directly or view it on GitHub
#837 (comment).

Neville Ryant
Linguistic Data Consortium
University of Pennsylvania

@vrv vrv reopened this Jan 23, 2016
@yaroslavvb
Copy link
Contributor

oops, off-by-1 error on my part.
When you are using a reader, there's more work done at the beginning because of prefetching, could it be that the extra time is due to it filling up a queue of examples?

@girving
Copy link
Contributor

girving commented Jun 6, 2016

@ebrevdo: Could you take a look since it's queue related?

@ebrevdo
Copy link
Contributor

ebrevdo commented Aug 10, 2016

@josh11b could this be due to lack of caching in the readers? not sure if this bug is still relevant given the changes that have been pushed between when this bug report and now.

@josh11b
Copy link
Contributor

josh11b commented Aug 10, 2016

I believe there is now the ability to read batches from a reader that can reduce overhead, assuming there is no problem with the examples having different dimensions. Also I recall someone is working on ParseExample performance improvements?

@ebrevdo
Copy link
Contributor

ebrevdo commented Aug 11, 2016

Yes; the ParseExample work will hopefully get checked in w/in a week or two.

On Wed, Aug 10, 2016 at 10:31 AM, josh11b notifications@github.com wrote:

I believe there is now the ability to read batches from a reader that can
reduce overhead, assuming there is no problem with the examples having
different dimensions. Also I recall someone is working on ParseExample
performance improvements?


You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
#837 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABtim6qIclGyr6l59aA8O2h8x9g4HCFOks5qegrVgaJpZM4HKB-s
.

@drpngx
Copy link
Contributor

drpngx commented Jan 24, 2017

I'm assuming that this is checked in. Feel free to open a new github issue if the problem still persists in recent versions.

@drpngx drpngx closed this as completed Jan 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants