motivation for this project #9

chentingpc · 2017-08-20T23:04:10Z

Hi, I'm curious about the motivation for this project, since Tensorflow itself provides supports for distributed training. Is it because the open-sourced distributed Tensorflow is slow? If it is the case, what are the bottlenecks (e.g. grpc?) and why? Are there some benchmarks we can compare? Thanks.

rohskopf · 2017-08-20T23:18:57Z

Just to throw in some discussion - I think it's just because Distributed TensorFlow doesn't use MPI. MPI is convenient and powerful for high performance computing, and this project brings MPI to TensorFlow.

alsrgv · 2017-08-20T23:35:56Z

The primary motivation for us was to make it easy to take single GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects: (1) how much modifications does one have to make to program to make it distributed, and how easy is it to run it, and (2) how much faster would it run in distributed mode?

Internally, we found that it's much easier for people to understand MPI model that requires minimal changes to source code (as described in README) than set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers, tf.Server(), tf.ClusterSpec(), SyncReplicasOptimizer, tf.replicas_device_setter() and etc.

We also found that performance that MPI and NCCL 2 pack is pretty good. We are still working on publish-able benchmark, but I can say that on 16 GPUs deployed across 4 servers connected with InfiniBand we got 15.6x scaling for Inception V3 and 13.8x scaling for VGG-16. We got slightly worse numbers on TCP.

All in all, we wanted to give back to TensorFlow community something that can help them train their models faster and with less efforts.

alsrgv · 2017-08-21T18:48:02Z

Added motivation to README as per #10. Closing this issue.

alsrgv added the question label Aug 20, 2017

alsrgv closed this as completed Aug 21, 2017

heliangliang91 mentioned this issue Mar 10, 2018

Segmentation fault (11) in the worker with rank=0 #107

Closed

abidmalikwaterloo mentioned this issue Sep 17, 2018

Running Pytorch with Horovod #492

Closed

bapriddy mentioned this issue Dec 19, 2018

tensorflow_synthetic_benchmark.py vs pytorch_imagenet_resnet50.py #712

Closed

wangzhimingchn mentioned this issue Feb 26, 2019

horovod with pytorch produces seg fault #761

Closed

CXMANDTXW mentioned this issue Sep 12, 2019

Only 2X speedup using 4GPUs on the same machine #1386

Closed

CXMANDTXW mentioned this issue Sep 19, 2019

Really need help! Try horovodrun performs slower! #1403

Closed

PiseyYou mentioned this issue Sep 24, 2019

Mismatched ALLREDUCE CPU/GPU #748

Closed

johnkim126 mentioned this issue Nov 6, 2019

OpenMPI 3.0.0 hangs initialize step in SGE #1500

Closed

anweshpanda mentioned this issue Jul 15, 2020

Error while trying to use gradient compression #2108

Closed

dingdingbin mentioned this issue Aug 20, 2020

When I used Horovod with Pytorch to distribute train DLRM on CPU nodes(two nodes), the result shown 100x slower than single node Pytorch #2192

Closed

yma11 mentioned this issue Mar 23, 2021

poor scaling performance in CPU multi-node #2739

Closed

weberxie mentioned this issue May 28, 2021

Horovod will hang forever when run it with data parallel model (one process multiple GPUs) #2944

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

motivation for this project #9

motivation for this project #9

chentingpc commented Aug 20, 2017 •

edited

rohskopf commented Aug 20, 2017

alsrgv commented Aug 20, 2017

alsrgv commented Aug 21, 2017

Navigation Menu

motivation for this project #9

motivation for this project #9

Comments

chentingpc commented Aug 20, 2017 • edited

rohskopf commented Aug 20, 2017

alsrgv commented Aug 20, 2017

alsrgv commented Aug 21, 2017

chentingpc commented Aug 20, 2017 •

edited