New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
motivation for this project #9
Comments
Just to throw in some discussion - I think it's just because Distributed TensorFlow doesn't use MPI. MPI is convenient and powerful for high performance computing, and this project brings MPI to TensorFlow. |
The primary motivation for us was to make it easy to take single GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects: (1) how much modifications does one have to make to program to make it distributed, and how easy is it to run it, and (2) how much faster would it run in distributed mode? Internally, we found that it's much easier for people to understand MPI model that requires minimal changes to source code (as described in README) than set up regular Distributed TensorFlow. To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers, tf.Server(), tf.ClusterSpec(), SyncReplicasOptimizer, tf.replicas_device_setter() and etc. We also found that performance that MPI and NCCL 2 pack is pretty good. We are still working on publish-able benchmark, but I can say that on 16 GPUs deployed across 4 servers connected with InfiniBand we got 15.6x scaling for Inception V3 and 13.8x scaling for VGG-16. We got slightly worse numbers on TCP. All in all, we wanted to give back to TensorFlow community something that can help them train their models faster and with less efforts. |
Added motivation to README as per #10. Closing this issue. |
Hi, I'm curious about the motivation for this project, since Tensorflow itself provides supports for distributed training. Is it because the open-sourced distributed Tensorflow is slow? If it is the case, what are the bottlenecks (e.g. grpc?) and why? Are there some benchmarks we can compare? Thanks.
The text was updated successfully, but these errors were encountered: