-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
map_and_batch slower than map + batch #20059
Comments
Can you try setting /cc @jsimsa |
Thanks for the fast reply!
This is TF 1.8. Do I have to try the RC 1.9 ? |
Ah yes, that argument was only added in bf228e1, so you'd need to upgrade to use it. As a proxy however, does it speed up your program if you cut (Incidentlly, the reason we added |
Thanks @mrry for the reply! It took me a while to come back with new results as there were other things running on the machine, so benchmarking was not feasible. I tried with I also tried with v1.9 RC |
That is surprising. I'll assign this to @jsimsa, since he has been working on the performance of @jsimsa The only thing I can think of here is that the tensorflow/tensorflow/core/kernels/data/map_and_batch_dataset_op.cc Lines 309 to 310 in b7300de
...might be slower than sequential concat in some cases. For example, we might be using too many threads to perform each copy, and they could be contending. I'm not convinced that multithreading that copy is always a good idea when we'd expect to have @cipri-tom We might have some difficulty reproducing your workload without more details. As a proxy, would you be able to capture a performance trace using a tool like pprof and share the results when running each version? Also, could you try running with Thanks! |
Hi @cipri-tom, I evaluated the performance of This is the program that I used for my evaluation:
See if you can use it as a starting point to generate an example that reproduces the issue you have encountered. As a side note, since you seem to care about performance, I recommend you build TensorFlow from source with AVX, AVX2, or FMA enabled (assuming your CPU supports these). Doing so will likely benefit the performance of your pipeline. |
@jsimsa thank you for getting back! We have things to keep the GPUs busy until next week, so I can't try anything before that. I'll get back when I get any conclusive results |
@jsimsa Thank you for the tests and the benchmarking program! Indeed, running it with various configurations doesn't reflect any troubles with either pipeline. On my side, there are no conclusive results. I still see the mentioned slow-down, but the causes are very weird and most probably tied to my system/program and not to TF. This is because I ran one very long training on a separate and more performant machine, and during the training I saw the It is interesting that the intervals of performance drop/increase are synchronised with the epochs. In other words, each lasts ~4000 steps which is the size of one epoch, and I trained for 10 epochs. If you have any suggestions for this, they would be very welcome. Otherwise, it is safe to let this issue die 😀 |
@cipri-tom thank you for reporting your findings ... my best guess this is related to either I/O or memory alignment ... to better understand what is going on, I would collect and compare pprof traces for epochs that are performing differently |
System information
v1.8.0-0-g93bc2e2072
Describe the problem
Using
map_and_batch
in my use case results in a slower input pipeline than using normalmap
followed bybatch
.batch_size=512
Here is my code. The
augment_data
andpadding_inputs_width
are quite heavyWhile I use the same number of parallel stuff, I think the difference comes from the fact that the map function is heavy and when using
map_and_batch
only one thread is used for producing each batch.How much slower ?
It is hard to quantify. With
map_and_batch
I just see lower numbers for GPU utilisation and even reaching zero at times. I tried increasing theprefetch
to 4 to make up for this, but no improvement.Here I ran with the first input pipeline for a bit and then with the
map_and_batch
. You can see a difference of about 30%.Feature request
The reason for this issue is that the documentation for
map_and_batch
says it will be done automatically in future versions. I think that in its current version, this can be a regression, as shown above. I believe (though I'm most probably wrong) that there should be a parameter inmap_and_batch
controlling the number of threads for themap
operation, and another one for thenum_parallel_batches
. Or along those lines...Edit: Python version is 3.5.2
The text was updated successfully, but these errors were encountered: