Unexpected performance changes as a function of batch size #4132

tillahoffmann · 2016-08-31T17:40:13Z

I am observing unexpected performance from tensorflow as I change the batch size that I feed to the session.

I have created a small jupyter notebook to demonstrate the issue. Errors bars correspond to the standard deviation of the mean over multiple runs.

In some of our more complex models, the jump in runtime occurs at small batch sizes (around 200 images of 40 by 80 pixels).

poxvoculi · 2016-08-31T22:34:26Z

I tried your simple example and got a plot that doesn't show the trend reversal to the same degree, only a very slight one.

I ran on a Google server which likely explains the difference from your results.

It's not surprising that there would be some kind of U-shaped curve in this experiment. Intuitively, it seems that amortized per-element computation costs should decrease as the batch size increases, but that trend can't continue forever. Eventually you'll exceed some limit that poses increased costs, likely related to memory or thread management. So, in practice the ideal batch size will be < infinity, and you'll need to tune for it.

tillahoffmann · 2016-08-31T22:50:51Z

I agree that the trend should not continue forever but as sharp an increase in computation time as observed in my experiments seems to hint at a problem. I observe problems even when the amount of memory per batch is only a few megabytes--smaller than a typical image taken on a cell phone.

I am using a 48 core server with 128GB of memory running centOS.

poxvoculi · 2016-09-01T20:35:14Z

I'm now quite puzzled. The chart I posted above was from running in a python notebook, not jupyter, but something similar, with unknown tensorflow version and configuration state. I was able to reproduce your results by switching to a different tensorflow version that I thought was built from a recent code tree sync. The execution times I was seeing were 10x too slow for large tensors (batches, in your program), and the problem seemed to be related to use of placeholder, e.g. if I substituted use of a Var or constant for the fed placeholder, the slowdown disappeared. In the process of looking deeper I did a recompile which caused the anomaly to disappear entirely.

I then retested your program using the most recent tensorflow-0.10.0rc0, and again I don't see the problem. What version are you using?

tillahoffmann · 2016-09-01T22:29:17Z

Great to hear back from you. It's interesting that you could reproduce the problem with some versions of tensorflow. I think we pip installed ours and are using 0.10.0rc0. I will run some experiments tomorrow and will get back to you.

Regarding the variables: Could you not reproduce the problem at all when you were using variables or was it reproducible if you replaced the variable values in the feed_dict?

tillahoffmann · 2016-09-02T11:24:22Z

I have now rerun the benchmark with the convolutional network in a range of settings and could reproduce the problem in all of them. In particular, I used a python 3.5.2 environment installed with conda and tried the following tensorflow installations

pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl
conda install tensorflow using conda-forge
installed from master compiled without GPU support

Interestingly, I cannot reproduce the problem on my MacBook which has fewer resources than the big server.

poxvoculi · 2016-09-02T17:53:39Z

I'm going to look at this a bit more, but first, in case you're willing to do some more experiments, let me give my thoughts. I'm suspicious of the interaction between python and the backend graph execution environment. I'm much more familiar with that backend environment and how it actually executes large tensor Ops, and there's nothing I know of there that seems like a plausible cause for the problem. Due to the hybrid python/compiled nature of the binary, my usual profiling tools are not much good, so it's difficult to identify where the time is going, but I'm doubtful it's actually the tensor Op execution. I wonder whether we might sometimes be seeing a slow data transfer from python to feed the placeholder. Grasping at straws, maybe data alignment might be an issue. If you see reproducible differences between python versions (and/or SWIG?) that would be interesting.

tillahoffmann · 2016-09-06T11:14:30Z

We were able to resolve the problem by updating our linux kernel.

Old: 3.10.0-327.13.1.el7.x86_64 
New: 3.10.0-327.28.3.el7.x86_64

poxvoculi · 2016-09-06T19:43:52Z

I've spent some more time on this, and I have not been able to reproduce your original results since last week. I've seen highly variable behavior in the execution times (on the order of ~2x, not 10x) and it looks like that's mainly due to threadpool contention (at least in my environment).

Recap: Your model basically tests doing one wide pairwise Op versus a sequential series of narrow pairwise Ops, where the inner loop performs the same number of atomic ops in aggregate. In wide configuration, there's one iteration on the entire input, in narrow configuration there's many iterations on shorter segments. In narrow configuration there's only a few threads active in the inner loop. In wide configuration there may be as many threads active as cores available.

What I'm seeing is that sometimes there's little thread contention on my test machine and the wide configuration runs slightly faster than narrow, as expected, with low latency. However, sometimes there's contention from other processes and the wide configuration is much more vulnerable to having one or more closures delayed, so the execution variance is much higher and the mean time also drifts up above that over the narrow configuration.

I'm going to close this issue unless evidence surfaces that there's a fixable problem.

poxvoculi added the stat:awaiting response Status - Awaiting response from author label Aug 31, 2016

poxvoculi self-assigned this Sep 1, 2016

poxvoculi added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Sep 1, 2016

poxvoculi added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 1, 2016

poxvoculi added bug and removed stat:awaiting response Status - Awaiting response from author labels Sep 2, 2016

poxvoculi removed the bug label Sep 6, 2016

poxvoculi closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected performance changes as a function of batch size #4132

Unexpected performance changes as a function of batch size #4132

tillahoffmann commented Aug 31, 2016

poxvoculi commented Aug 31, 2016

tillahoffmann commented Aug 31, 2016

poxvoculi commented Sep 1, 2016

tillahoffmann commented Sep 1, 2016 •

edited

tillahoffmann commented Sep 2, 2016

poxvoculi commented Sep 2, 2016

tillahoffmann commented Sep 6, 2016

poxvoculi commented Sep 6, 2016

Unexpected performance changes as a function of batch size #4132

Unexpected performance changes as a function of batch size #4132

Comments

tillahoffmann commented Aug 31, 2016

poxvoculi commented Aug 31, 2016

tillahoffmann commented Aug 31, 2016

poxvoculi commented Sep 1, 2016

tillahoffmann commented Sep 1, 2016 • edited

tillahoffmann commented Sep 2, 2016

poxvoculi commented Sep 2, 2016

tillahoffmann commented Sep 6, 2016

poxvoculi commented Sep 6, 2016

tillahoffmann commented Sep 1, 2016 •

edited