Skip to content

How to train multi-gpu on tensorflow using nccl library? #22692

@kazemSafari

Description

@kazemSafari

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
    Yes

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
    Linux Ubuntu 16.04

  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
    Local custom build PC

  • TensorFlow installed from (source or binary):
    binary

  • TensorFlow version (use command below):
    1.10

  • Python version:
    3.5

  • Bazel version (if compiling from source):
    No

  • GCC/Compiler version (if compiling from source):
    gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

  • CUDA/cuDNN version:
    9.0/7.2

  • GPU model and memory:
    Two GTX 1080 Ti each with 11G of memory

  • Exact command to reproduce:
    IDK

Describe the problem:

Describe the problem clearly here. Be sure to convey here why it's a bug in TensorFlow or a feature request.

I was wondering if you could provide a tutorial on how to train a simple CNN on multiple gpus on MNIST, or Cifar dataset that also explains how to use the nccl library.
The only tutorials available are the following: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py#L101
http://blog.s-schoener.com/2017-12-15-parallel-tensorflow-intro/

https://proteusmaster.urcf.drexel.edu/urcfwiki/index.php/Job_Script_Example_09_TensorFlow_MNIST_Multi-GPU-CNN
None of which explains how to use nccl.

I also search for an answer on many websites but the only one i found is this from openai in Chinese:
http://openresearch.ai/t/nccl-efficient-tensorflow-multigpu-training/159

Although 1/6 of the population of earth does speak Chinese, the other 5/6 does not! So i was wondering if someone in tensorflow team that knows English and perhaps Chinese could help with this issue. Thank you. I would also appreciate if you do me a favor and do not refer or pass me to stackoverflow. Thanks again.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions