No performance improved on batch 128 ? #30

zhaoerchao · 2017-06-12T11:12:27Z

I run the script followed this:

python tf_cnn_benchmarks.py --local_parameter_device=gpu --num_gpus=4 --batch_size=128 --model=resnet50 --variable_update=replicated --nodistortions --nccl True --trace_file ~/timeline.json

But there is no improvement at all. The speed is equal to the one on batch 64:

Step Img/sec loss
1 images/sec: 725.2 +/- 0.0 (jitter = 0.0) 7.463
10 images/sec: 736.7 +/- 1.4 (jitter = 2.9) 7.180
20 images/sec: 731.7 +/- 2.4 (jitter = 6.7) 7.048
30 images/sec: 723.7 +/- 2.6 (jitter = 19.3) 6.971
40 images/sec: 719.0 +/- 2.4 (jitter = 15.9) 6.929
50 images/sec: 716.0 +/- 2.1 (jitter = 12.2) 6.898

What's the reason? And why is the speed slower and slower when the step is bigger ?

The text was updated successfully, but these errors were encountered:

ekelsen · 2017-06-13T00:55:19Z

The underlying convolution routines won't get any faster when the batch_size goes from 64 -> 128, so it isn't surprising that the overall training doesn't either.

ilovechai · 2017-07-11T14:23:18Z

@zhaoerchao @ekelsen These benchmark programs are giving for example say 570images/sec where as when you run the same model normally it gives half of that of the benchmark programs gave, why so?

zhaoerchao · 2017-07-17T01:13:12Z

@cryptox31 Do you run the program on the same GPU with the same version TF?

ilovechai · 2017-07-18T22:51:29Z

@zhaoerchao I am currently using 4 Tesla P100 GPUs and running Tensorflow 1.01 inception v3 model, and I am not getting optimum results.

tfboyd · 2017-07-25T14:32:26Z

Increasing the batch-size will not always increase performance. I am far from an expert. From my testing, I find that each model and hardware combination will have a point where even if there is more memory, increasing batch size is does not help. Increasing batch size normally helps when the step time is very fast and increasing the batch-size slows down the step time enough to hide the transfer times and other calculations that are impossible to "hide" with a very fast step time. I know that is not a very technical explanation. One good example of this is notice that "everyone" runs alexnet at a batch size of 512 or more now but use to run much small batches. I have not been working with ML very long but if you test alexnet with 32, 128, 256, and then 512 on most ML platforms you will see a significant speedup as the batch-size increases. If I remember correctly, even more so on multi-GPU.

Finally, the goal is normally to converge at the best possible top_1. I know people are training with large total batches for ResNet but I have not seen anyone training with 128 per GPU. Of course there is so much happening it likely has happened and I did not see it.

Closing as this is kind of expected. If you are having unexpected results with batch-size 64 or 32 please let me know and I will see if I can figure it out.

Merge internal changes into public repository (change 168924045)

Adding modules/functions common to Q2 POR development

tfboyd closed this as completed Jul 25, 2017

freedomtan pushed a commit to freedomtan/benchmarks that referenced this issue Apr 18, 2018

Merge pull request tensorflow#30 from tensorflow/internal-to-github-sync

8b06326

Merge internal changes into public repository (change 168924045)

shengfuintel pushed a commit to Intel-tensorflow/benchmarks that referenced this issue May 23, 2018

Merge pull request tensorflow#30 from NervanaSystems/Q2-validation-OOB

31191a2

Adding modules/functions common to Q2 POR development

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No performance improved on batch 128 ? #30

No performance improved on batch 128 ? #30

zhaoerchao commented Jun 12, 2017

ekelsen commented Jun 13, 2017

ilovechai commented Jul 11, 2017

zhaoerchao commented Jul 17, 2017

ilovechai commented Jul 18, 2017

tfboyd commented Jul 25, 2017

No performance improved on batch 128 ? #30

No performance improved on batch 128 ? #30

Comments

zhaoerchao commented Jun 12, 2017

ekelsen commented Jun 13, 2017

ilovechai commented Jul 11, 2017

zhaoerchao commented Jul 17, 2017

ilovechai commented Jul 18, 2017

tfboyd commented Jul 25, 2017