-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected performance changes as a function of batch size #4132
Comments
I tried your simple example and got a plot that doesn't show the trend reversal to the same degree, only a very slight one. I ran on a Google server which likely explains the difference from your results. It's not surprising that there would be some kind of U-shaped curve in this experiment. Intuitively, it seems that amortized per-element computation costs should decrease as the batch size increases, but that trend can't continue forever. Eventually you'll exceed some limit that poses increased costs, likely related to memory or thread management. So, in practice the ideal batch size will be < infinity, and you'll need to tune for it. |
I agree that the trend should not continue forever but as sharp an increase in computation time as observed in my experiments seems to hint at a problem. I observe problems even when the amount of memory per batch is only a few megabytes--smaller than a typical image taken on a cell phone. I am using a 48 core server with 128GB of memory running centOS. |
I'm now quite puzzled. The chart I posted above was from running in a python notebook, not jupyter, but something similar, with unknown tensorflow version and configuration state. I was able to reproduce your results by switching to a different tensorflow version that I thought was built from a recent code tree sync. The execution times I was seeing were 10x too slow for large tensors (batches, in your program), and the problem seemed to be related to use of placeholder, e.g. if I substituted use of a Var or constant for the fed placeholder, the slowdown disappeared. In the process of looking deeper I did a recompile which caused the anomaly to disappear entirely. I then retested your program using the most recent tensorflow-0.10.0rc0, and again I don't see the problem. What version are you using? |
Great to hear back from you. It's interesting that you could reproduce the problem with some versions of tensorflow. I think we pip installed ours and are using 0.10.0rc0. I will run some experiments tomorrow and will get back to you. Regarding the variables: Could you not reproduce the problem at all when you were using variables or was it reproducible if you replaced the variable values in the |
I have now rerun the benchmark with the convolutional network in a range of settings and could reproduce the problem in all of them. In particular, I used a python 3.5.2 environment installed with conda and tried the following tensorflow installations
Interestingly, I cannot reproduce the problem on my MacBook which has fewer resources than the big server. |
I'm going to look at this a bit more, but first, in case you're willing to do some more experiments, let me give my thoughts. I'm suspicious of the interaction between python and the backend graph execution environment. I'm much more familiar with that backend environment and how it actually executes large tensor Ops, and there's nothing I know of there that seems like a plausible cause for the problem. Due to the hybrid python/compiled nature of the binary, my usual profiling tools are not much good, so it's difficult to identify where the time is going, but I'm doubtful it's actually the tensor Op execution. I wonder whether we might sometimes be seeing a slow data transfer from python to feed the placeholder. Grasping at straws, maybe data alignment might be an issue. If you see reproducible differences between python versions (and/or SWIG?) that would be interesting. |
We were able to resolve the problem by updating our linux kernel. Old: 3.10.0-327.13.1.el7.x86_64
|
I've spent some more time on this, and I have not been able to reproduce your original results since last week. I've seen highly variable behavior in the execution times (on the order of ~2x, not 10x) and it looks like that's mainly due to threadpool contention (at least in my environment). Recap: Your model basically tests doing one wide pairwise Op versus a sequential series of narrow pairwise Ops, where the inner loop performs the same number of atomic ops in aggregate. In wide configuration, there's one iteration on the entire input, in narrow configuration there's many iterations on shorter segments. In narrow configuration there's only a few threads active in the inner loop. In wide configuration there may be as many threads active as cores available. What I'm seeing is that sometimes there's little thread contention on my test machine and the wide configuration runs slightly faster than narrow, as expected, with low latency. However, sometimes there's contention from other processes and the wide configuration is much more vulnerable to having one or more closures delayed, so the execution variance is much higher and the mean time also drifts up above that over the narrow configuration. I'm going to close this issue unless evidence surfaces that there's a fixable problem. |
I am observing unexpected performance from tensorflow as I change the batch size that I feed to the session.
I have created a small jupyter notebook to demonstrate the issue. Errors bars correspond to the standard deviation of the mean over multiple runs.
In some of our more complex models, the jump in runtime occurs at small batch sizes (around 200 images of 40 by 80 pixels).
The text was updated successfully, but these errors were encountered: