New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor Scalability of TensorFlow MultiGPU Training on a Single Machine [Performance Bug] #10723

Closed
GD06 opened this Issue Jun 15, 2017 · 12 comments

Comments

Projects
None yet
6 participants
@GD06

GD06 commented Jun 15, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Debian 3.16.7

  • TensorFlow installed from (source or binary): source, hash: b1e174e

  • TensorFlow version (use command below): v1.1.0-rc2-119-gb1e174e 1.1.0-rc2

  • Bazel version (if compiling from source): 0.4.5-jdk7

  • CUDA/cuDNN version: CUDA 8.0, cudnn-v5.1

  • GPU model and memory: Titan X with 12 GiB

  • Exact command to reproduce: ./tf_multiGPU.py

  • tensorflow/benchmark version: source, hash: 9165a70

Describe the problem

Recently, we are testing the TensorFlow scalability on multiGPU machines. We use the official scripts provided in the benchmarks using codes from GitHub repository tensorflow/benchmark. We execute the script according to the official website to test the scalability of TensorFlow on the machine equipped with 8 Titan X GPUs. We test the model VGG16 with batch size equaling to 64.

The results are shown in the following table:

Num of GPUs Throughput (images/sec)
1 83.13
2 155.06
3 211.8
4 278.51
5 265.53
6 268.19
7 272.8
8 302.27

We are surprised to find that the total throughput of 5 GPUs is smaller than 4 GPUs, which means TensorFlow incurs significant overheads when the number of GPU is larger than 4. Because I don't know whether this performance issue belongs to the tensorflow/tensorflow repository or tensorflow/benchmark repository, I submit this issue here looking forward to the official response. The script for reproducing this issue can be found in Source code/logs section with the results.

The scalability is strongly related to the topology of GPU interconnection. In our machine, we have a tree topology for GPUs. The topology details can be found at the nv-topo-matrix.txt

We believe this is a performance bug since if more GPUs can not achieve higher throughput, at least they should obtain similar throughput.

Source code / logs

Source code: tf_multiGPU.py
The log after executing the script above: https://drive.google.com/file/d/0Bw3_V-EwBVToYXVqZDkzQkN6c3M/view?usp=sharing

@aselle

This comment has been minimized.

Member

aselle commented Jun 15, 2017

@tfboyd , could you take a look and offer some suggestions?

@tfboyd

This comment has been minimized.

Member

tfboyd commented Jun 16, 2017

I would have to look closer but I do not find this completely surprising but your number is lower than I would expect given my experience doing the benchmarks.. I believe VGG16 has a lot of parameters. Did you try different parameter server approaches? On the DGX-1 replicated (with NCCL) was the best approach for VGG16. parameter_server on gpu was only a good option on AWS where they have GPUDirect. I would try the following options when using VGG16:

# My guess is with your ENV this will most likely to give you the best scaling for VGG16
- variable_update = replicated  local_parameter_device = gpu  use_nccl = true
or
# this is a good fall back in most environments.  
- variable_update = parameter_server  local_parameter_device = cpu

If your Titan X setup is more similar to a DGX-1 in topology, I no longer have access and did not research it then the nccl setup will work best. If peering is not setup perfectly then cpu is often the best option. Based on your starting number I would guess you should be able to get over 500 images/sec with one of the options that works best with your topology.

Also VGG16 does not scale as well as the other models as the GPUs speedup. Here is the graph, with real data it starts to dip and you only hit something like 5.8 or so speedup. On K80s (much slower GPUs) the scaling is over 6x on GCE. I also do not see many people showing VGG16 distributed numbers and from what I have read it is the same reason. The large number of parameters makes it difficult to scale.

NCCL is suppose to "magically" solve these types of problems but that does not seem to be the case currently. My testing showed that the best performance requires trying a few options.

I have run other platforms and this can happen on them as well if using an variable placement option that does not work with with the hardware platform.

I want to stress, these are guesses based on my experience testing on a variety of systems but not remotely all of them. When it is a system I can access, it is much easier to speak in absolutes. I hope you get a better speedup trying the other options.

@tfboyd tfboyd self-assigned this Jun 16, 2017

@ppwwyyxx

This comment has been minimized.

Contributor

ppwwyyxx commented Jun 16, 2017

With " variable_update = replicated local_parameter_device = gpu use_nccl = true" I can get:
4GPU: 270im/s
8GPU: 530im/s
with Tesla M40s (roughly same speed as TitanX).

With "parameter_server" I can still get 4GPU:270im/s, 8GPU:470im/s which is faster than what's reported above. But my GPU topology may be better connected.

@tfboyd

This comment has been minimized.

Member

tfboyd commented Jun 16, 2017

Thanks @ppwwyyxx It has been a while since we were on the same thread. :-) Off topic but DataSets is great if you are going multi-GPU and do not need/want to feed your data via a dict. It is contrib but is moving to core very soon.

Ohh did you use synthetic or real data? Just curious I want to use your number as a mental data point if people ask about M40 performance. I have access to 4x Titan X Maxwells but not a box of M40s.

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/data

@tfboyd

This comment has been minimized.

Member

tfboyd commented Jun 17, 2017

Closing, please let me know if you still have issues. One of the options I suggested should get you much closer to the published performance numbers.

@tfboyd tfboyd closed this Jun 17, 2017

@GD06

This comment has been minimized.

GD06 commented Jun 19, 2017

@tfboyd @ppwwyyxx
Hi Toby & Yuxin,

Thanks a lot for your patient explanation. I performed the multiGPU tests for three argument sets and the results are shown as the following tables:
The abbreviations for the parameter sets are as followed:

  1. replicated_gpu_nccl: variable_update=replicated local_parameter_device=gpu use_nccl=True
  2. parameter_server_gpu: variable_update=parameter_server local_parameter_device=cpu use_nccl=False
  3. parameter_server_gpu_nccl: variable_update=parameter_server local_parameter_device=gpu use_nccl=True
Number of GPUs replicated_gpu_nccl parameter_server_cpu parameter_server_gpu_nccl
1 82.31 76.53 83.14
2 154.59 140.35 155.53
3 214.84 219.67 212.39
4 280.17 276.82 266.04
5 338.93 302.22 267.23
6 405.59 370.58 275.08
7 470.66 410.25 273.08
8 517.31 373.5 308.48

From the table presented above, I would like to conclude that the parameter set used in the second column is the best practice for my system and platform. Do the results in the second column achieve the expected speedup of VGG16 on 8 GPUs?

Thank you again for your helpful suggestions! I am still a little confused about two modes for variable_update. What's the difference between "replicated" and "parameter_server"? Comparing with the second column and the fourth column of the table above, these two modes result in significant performance difference even both of them use GPU as local parameter device and reduce parameters through NCCL. Can anyone help me?

Thanks,
Xinfeng

@byronyi

This comment has been minimized.

Contributor

byronyi commented Jun 19, 2017

You should read through the benchmarks page. It does contain detailed explanations to the difference between parameter update strategies.

@GD06

This comment has been minimized.

GD06 commented Jun 19, 2017

@byronyi
Thanks for your kind remind! I am sorry that I didn't claim the problem very clearly. To be more specific, my question is that what's the difference causing so significant performance difference. To my poor understanding, both of programs resulting in the second and fourth column data aggregate gradients on one GPU and use NCCL for the reduction and they have the same amount of data transfer. The program using replicated variable updates needs to transfer gradients from each GPU and aggregated gradients to each GPU. While the program using parameter server for variable updates needs to transfer gradients from each GPU and updated gradients from the parameter server. The only difference is that whether each GPU has a local copy of weights. I think the overhead of multiGPU mainly comes from the communication. However, with the amount of transferred data the same, these two programs still have so large performance difference.

Above stated is exactly what I am confused. I doubt I have an incorrect understanding of these two modes. Thanks for your patience!

@tfboyd

This comment has been minimized.

Member

tfboyd commented Jun 19, 2017

First, I am really excited that having the benchmarks public helped us to have a reference point to figure this out (maybe a pat on the back but it justifies the time and money we spend running these tests). Second, I am glad you were able to try multiple configurations and we have another semi-reference point that most likely M40s and decently connected Titan X (maxwells) likely behave like a DGX-1.

My attempt (others have already given possibly better answers and I will add a few links).

One of your configs did not do what you might have expected:

parameter_server_gpu_nccl: variable_update=parameter_server local_parameter_device=gpu use_nccl=True

In the above config what happened was the variables were spread across the 8 GPUs and NCCL was not involved. NCCL is only a valid option if you set parameter_server=replicated and local_ps_device=gpu. The settings can be confusing and we do not do any or very many checks because it is a tool and there are so many options.

Deeply understanding the differences is beyond me. We had a long internal thread discussing why variable_update=parameter_server and local_parameter_device=cpu is faster on DGX-1 for inception and resnet. For my level of understanding it seems that the more parameters there are the more likely NCCL will be the better option. I also know VGG16 does not scale very well in distributed mode (across servers), which is also an indication.

I am sorry I cannot provide you a deeper technical explanation. I would rather say I do not know than make something up. :-)

@GD06

This comment has been minimized.

GD06 commented Jun 19, 2017

@tfboyd
Hi Toby,

I am thankful for your patient explanations and helpful information! At least I confirmed that I had a misunderstanding about these modes. It's so great that TensorFlow provides these flexible options. Understanding system performance is always hard including TensorFlow. XD
We are making our steps forward to understand the performance of TensorFlow and so wonderful we can hear from TensorFlow experts for the insightful guide!

Thanks,
Xinfeng

@kazemSafari

This comment has been minimized.

kazemSafari commented Oct 3, 2018

I was wondering if you could provide a tutorial to follow which we can train a model on local machine with multiple gpus using nccl library.

The only one i found so far is in Chinese:
http://openresearch.ai/t/nccl-efficient-tensorflow-multigpu-training/159

@tfboyd

This comment has been minimized.

Member

tfboyd commented Oct 3, 2018

@kazemSafari
I have asked the team to post better examples. The new distribution strategies makes it very easy to take a model and scale it well across multiple gpus with very few changes if you are using estimator. While cifar10 Resnet is not going to scale it is easier to follow because it trains quickly.

https://github.com/tfboyd/models/tree/resnet_perf_tweaks/official/resnet
This Pull Request has the commands I used when testing

The example is decent but you have to look around for the code because it was made overly modular and lots of functions passed around.

This page explains the UI, I strongly suggest using the Estimator section unless you are already using Keras.

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/README.md

There is an effort as we speak to update tensorflow.org to make this easier to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment