-
Notifications
You must be signed in to change notification settings - Fork 75.3k
How to train multi-gpu on tensorflow using nccl library? #22692
Description
System information
-
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
Yes -
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
Linux Ubuntu 16.04 -
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
Local custom build PC -
TensorFlow installed from (source or binary):
binary -
TensorFlow version (use command below):
1.10 -
Python version:
3.5 -
Bazel version (if compiling from source):
No -
GCC/Compiler version (if compiling from source):
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609 -
CUDA/cuDNN version:
9.0/7.2 -
GPU model and memory:
Two GTX 1080 Ti each with 11G of memory -
Exact command to reproduce:
IDK
Describe the problem:
Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.
I was wondering if you could provide a tutorial on how to train a simple CNN on multiple gpus on MNIST, or Cifar dataset that also explains how to use the nccl library.
The only tutorials available are the following: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py#L101
http://blog.s-schoener.com/2017-12-15-parallel-tensorflow-intro/
https://proteusmaster.urcf.drexel.edu/urcfwiki/index.php/Job_Script_Example_09_TensorFlow_MNIST_Multi-GPU-CNN
None of which explains how to use nccl.
I also search for an answer on many websites but the only one i found is this from openai in Chinese:
http://openresearch.ai/t/nccl-efficient-tensorflow-multigpu-training/159
Although 1/6 of the population of earth does speak Chinese, the other 5/6 does not! So i was wondering if someone in tensorflow team that knows English and perhaps Chinese could help with this issue. Thank you. I would also appreciate if you do me a favor and do not refer or pass me to stackoverflow. Thanks again.