Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

motivation for this project #9

Closed
chentingpc opened this issue Aug 20, 2017 · 3 comments
Closed

motivation for this project #9

chentingpc opened this issue Aug 20, 2017 · 3 comments
Labels

Comments

@chentingpc
Copy link

chentingpc commented Aug 20, 2017

Hi, I'm curious about the motivation for this project, since Tensorflow itself provides supports for distributed training. Is it because the open-sourced distributed Tensorflow is slow? If it is the case, what are the bottlenecks (e.g. grpc?) and why? Are there some benchmarks we can compare? Thanks.

@rohskopf
Copy link

Just to throw in some discussion - I think it's just because Distributed TensorFlow doesn't use MPI. MPI is convenient and powerful for high performance computing, and this project brings MPI to TensorFlow.

@alsrgv
Copy link
Member

alsrgv commented Aug 20, 2017

The primary motivation for us was to make it easy to take single GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects: (1) how much modifications does one have to make to program to make it distributed, and how easy is it to run it, and (2) how much faster would it run in distributed mode?

Internally, we found that it's much easier for people to understand MPI model that requires minimal changes to source code (as described in README) than set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers, tf.Server(), tf.ClusterSpec(), SyncReplicasOptimizer, tf.replicas_device_setter() and etc.

We also found that performance that MPI and NCCL 2 pack is pretty good. We are still working on publish-able benchmark, but I can say that on 16 GPUs deployed across 4 servers connected with InfiniBand we got 15.6x scaling for Inception V3 and 13.8x scaling for VGG-16. We got slightly worse numbers on TCP.

All in all, we wanted to give back to TensorFlow community something that can help them train their models faster and with less efforts.

@alsrgv
Copy link
Member

alsrgv commented Aug 21, 2017

Added motivation to README as per #10. Closing this issue.

@alsrgv alsrgv closed this as completed Aug 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants