Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Horovod support for MXNet framework #542
This is a PR for adding Horovod support to do distributed training with the MXNet deep learning framework. The core Allreduce and Broadcast functionality passes unit tests. @yuxihu and @apeforest are currently training ResNet-50 end-to-end, so that should tell us if this setup converges or not.
Our performance results are here showing throughput (scaling efficiency) with and without hierarchical allreduce (HA) on ResNet-50 with float32:
We have completed and merged the PR to make necessary changes on the MXNet for Horovod support. You can see the PR here: apache/incubator-mxnet#12666
Thanks a lot for the PR. I did the first pass of review and left a few comments. Could you take a look?