Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected performance changes as a function of batch size #4132

Closed
tillahoffmann opened this issue Aug 31, 2016 · 8 comments
Closed

Unexpected performance changes as a function of batch size #4132

tillahoffmann opened this issue Aug 31, 2016 · 8 comments
Assignees

Comments

@tillahoffmann
Copy link
Contributor

I am observing unexpected performance from tensorflow as I change the batch size that I feed to the session.

image

I have created a small jupyter notebook to demonstrate the issue. Errors bars correspond to the standard deviation of the mean over multiple runs.

In some of our more complex models, the jump in runtime occurs at small batch sizes (around 200 images of 40 by 80 pixels).

@poxvoculi
Copy link
Contributor

I tried your simple example and got a plot that doesn't show the trend reversal to the same degree, only a very slight one.

screenshot from 2016-08-31 15 19 22

I ran on a Google server which likely explains the difference from your results.

It's not surprising that there would be some kind of U-shaped curve in this experiment. Intuitively, it seems that amortized per-element computation costs should decrease as the batch size increases, but that trend can't continue forever. Eventually you'll exceed some limit that poses increased costs, likely related to memory or thread management. So, in practice the ideal batch size will be < infinity, and you'll need to tune for it.

@poxvoculi poxvoculi added the stat:awaiting response Status - Awaiting response from author label Aug 31, 2016
@tillahoffmann
Copy link
Contributor Author

I agree that the trend should not continue forever but as sharp an increase in computation time as observed in my experiments seems to hint at a problem. I observe problems even when the amount of memory per batch is only a few megabytes--smaller than a typical image taken on a cell phone.

I am using a 48 core server with 128GB of memory running centOS.

@poxvoculi poxvoculi self-assigned this Sep 1, 2016
@poxvoculi poxvoculi added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Sep 1, 2016
@poxvoculi
Copy link
Contributor

I'm now quite puzzled. The chart I posted above was from running in a python notebook, not jupyter, but something similar, with unknown tensorflow version and configuration state. I was able to reproduce your results by switching to a different tensorflow version that I thought was built from a recent code tree sync. The execution times I was seeing were 10x too slow for large tensors (batches, in your program), and the problem seemed to be related to use of placeholder, e.g. if I substituted use of a Var or constant for the fed placeholder, the slowdown disappeared. In the process of looking deeper I did a recompile which caused the anomaly to disappear entirely.

I then retested your program using the most recent tensorflow-0.10.0rc0, and again I don't see the problem. What version are you using?

@poxvoculi poxvoculi added stat:awaiting response Status - Awaiting response from author and removed stat:awaiting tensorflower Status - Awaiting response from tensorflower labels Sep 1, 2016
@tillahoffmann
Copy link
Contributor Author

tillahoffmann commented Sep 1, 2016

Great to hear back from you. It's interesting that you could reproduce the problem with some versions of tensorflow. I think we pip installed ours and are using 0.10.0rc0. I will run some experiments tomorrow and will get back to you.

Regarding the variables: Could you not reproduce the problem at all when you were using variables or was it reproducible if you replaced the variable values in the feed_dict?

@tillahoffmann
Copy link
Contributor Author

I have now rerun the benchmark with the convolutional network in a range of settings and could reproduce the problem in all of them. In particular, I used a python 3.5.2 environment installed with conda and tried the following tensorflow installations

  • pip install https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl
  • pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp35-cp35m-linux_x86_64.whl
  • conda install tensorflow using conda-forge
  • installed from master compiled without GPU support

Interestingly, I cannot reproduce the problem on my MacBook which has fewer resources than the big server.

@poxvoculi
Copy link
Contributor

I'm going to look at this a bit more, but first, in case you're willing to do some more experiments, let me give my thoughts. I'm suspicious of the interaction between python and the backend graph execution environment. I'm much more familiar with that backend environment and how it actually executes large tensor Ops, and there's nothing I know of there that seems like a plausible cause for the problem. Due to the hybrid python/compiled nature of the binary, my usual profiling tools are not much good, so it's difficult to identify where the time is going, but I'm doubtful it's actually the tensor Op execution. I wonder whether we might sometimes be seeing a slow data transfer from python to feed the placeholder. Grasping at straws, maybe data alignment might be an issue. If you see reproducible differences between python versions (and/or SWIG?) that would be interesting.

@poxvoculi poxvoculi added bug and removed stat:awaiting response Status - Awaiting response from author labels Sep 2, 2016
@tillahoffmann
Copy link
Contributor Author

We were able to resolve the problem by updating our linux kernel.

Old: 3.10.0-327.13.1.el7.x86_64

New: 3.10.0-327.28.3.el7.x86_64

@poxvoculi poxvoculi removed the bug label Sep 6, 2016
@poxvoculi
Copy link
Contributor

I've spent some more time on this, and I have not been able to reproduce your original results since last week. I've seen highly variable behavior in the execution times (on the order of ~2x, not 10x) and it looks like that's mainly due to threadpool contention (at least in my environment).

Recap: Your model basically tests doing one wide pairwise Op versus a sequential series of narrow pairwise Ops, where the inner loop performs the same number of atomic ops in aggregate. In wide configuration, there's one iteration on the entire input, in narrow configuration there's many iterations on shorter segments. In narrow configuration there's only a few threads active in the inner loop. In wide configuration there may be as many threads active as cores available.

What I'm seeing is that sometimes there's little thread contention on my test machine and the wide configuration runs slightly faster than narrow, as expected, with low latency. However, sometimes there's contention from other processes and the wide configuration is much more vulnerable to having one or more closures delayed, so the execution variance is much higher and the mean time also drifts up above that over the narrow configuration.

I'm going to close this issue unless evidence surfaces that there's a fixable problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants