New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod support for MXNet framework #542

Open
wants to merge 108 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@ctcyang

ctcyang commented Oct 6, 2018

This is a PR for adding Horovod support to do distributed training with the MXNet deep learning framework. The core Allreduce and Broadcast functionality passes unit tests. @yuxihu and @apeforest are currently training ResNet-50 end-to-end, so that should tell us if this setup converges or not.

Our performance results are here showing throughput (scaling efficiency) with and without hierarchical allreduce (HA) on ResNet-50 with float32:

# gpus | Without HA |   With HA
---------------------------------
   8   |  3072  (NA)|  3078  (NA)
  16   |  6027 (98%)|  5859 (95%)
  32   | 12030 (98%)| 11675 (95%)
  64   | 22346 (83%)| 23166 (94%)
 128   | 40938 (84%)| 45972 (93%)
 256   | 64998 (66%)| 89858 (91%)

We have completed and merged the PR to make necessary changes on the MXNet for Horovod support. You can see the PR here: apache/incubator-mxnet#12666

Design document
Installation instructions

@ctcyang

This comment has been minimized.

ctcyang commented Nov 13, 2018

@alsrgv Amazon legal team has given us the okay, so feel free to start looking over the code.

@alsrgv

I saw a bunch of changes that seem unintended, can they be reverted first?

Show resolved Hide resolved Dockerfile Outdated

ctcyang added some commits Nov 14, 2018

Show resolved Hide resolved horovod/mxnet/adapter.cc Outdated

apeforest and others added some commits Nov 15, 2018

Use MXNET API to get lib path (#6)
* fix a bug and replace lib env path by API

* fix library path

* fix lib include

* fix a bug in lib path

* link mx libraries

* fix link error

* refactor in a more pythonic way
rename namespace from MX to mxnet (#5)
* rename namespace to mxnet and resolve naming clash

* remove duplicate typedef
@alsrgv

Thanks a lot for the PR. I did the first pass of review and left a few comments. Could you take a look?

@@ -2022,6 +2022,10 @@ int horovod_mpi_threads_supported() {
}
return horovod_global.mpi_threads_supported ? 1 : 0;
}
int horovod_barrier() {
return MPI_Barrier(horovod_global.mpi_comm);

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

This implementation of the barrier has a risk of getting a deadlock with the background thread.

For instance, on rank 1, the order of operations could be:

  1. MPI_Gather (in the background thread)
  2. MPI_Barrier

On rank 2:

  1. MPI_Barrier
  2. MPI_Gather (in the background thread)

Can you describe why barriers are necessary in the first place?

This comment has been minimized.

@ctcyang

ctcyang Nov 29, 2018

I thought they might've been necessary/handy to expose this MPI method, but we were able to get it to work without it and haven't seen any usecases, so I'll remove it.

parser.add_argument('--num-epochs', type=int, default=90,
help='number of training epochs.')
parser.add_argument('--lr', type=float, default=0.1,
help='learning rate. default is 0.1.')

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

nit: use commas before "default"

This comment has been minimized.

@apeforest
lr_scheduler = MultiFactorScheduler(step=steps, factor=lr_decay, base_lr=args.lr, warmup_steps=(args.warmup_epochs * epoch_size), warmup_begin_lr=args.warmup_lr)
# Two functions for reading data from record file or raw images
def get_data_rec(rec_train, rec_train_idx, rec_val, rec_val_idx, batch_size, data_nthreads):

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Can we put a link to instructions that describes how to get those record files?

This comment has been minimized.

@apeforest
num_epoch=args.num_epochs,
kvstore=None,
batch_end_callback=mx.callback.Speedometer(batch_size, 20),
#eval_end_callback=eval_end_callbacks,

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Uncomment? Remove?

This comment has been minimized.

@apeforest
optimizer_params = {'wd': args.wd,
'momentum': args.momentum,
'lr_scheduler': lr_scheduler,
'rescale_grad': 1.0/batch_size/num_workers,

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

What's the rationale for putting rescaling here instead of doing gradient averaging in the DistributedOptimizer?

This comment has been minimized.

@ctcyang

ctcyang Nov 29, 2018

The reason is that we have seen some performance degradation with doing gradient averaging in the DistributedOptimizer. This performance drop seems to happen, because of a wait_to_read() on the NDArray here. This wait_to_read() is necessary, because the elementwise operation (that follows it on Line 86) does not get captured in the task graph generated by the MXNet scheduler. The allreduce and the elementwise division by size() enter a race condition, which generates wrong results half the time.

More details in this commit: 0517fa0

check_call(MPI_MXNET_LIB_CTYPES.horovod_mxnet_broadcast_async(c_in, c_out, ctypes.c_int(root_rank), name))
return tensor
# TODO(@ctcyang):

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Please implement TODOs

setup.py Outdated
@@ -379,6 +437,7 @@ def get_common_options(build_ext):
cpp_flags = get_cpp_flags(build_ext)
link_flags = get_link_flags(build_ext)
mpi_flags = get_mpi_flags()
mxnet_include_dirs = os.environ.get('INCLUDES')

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

I'm not sure this environment variable will be set often. Is there any other way?

This comment has been minimized.

@apeforest

apeforest Nov 29, 2018

We will update this soon once the include path is added to MXNet pip package.

This comment has been minimized.

@yuxihu

yuxihu Dec 1, 2018

Use an API to get the path now.

setup.py Outdated
build_mx_extension(self, options)
built_plugins.append(True)
except Exception as e:
print(e)

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Remove print?

This comment has been minimized.

@apeforest
setup.py Outdated
built_plugins.append(True)
except Exception as e:
print(e)
if not os.environ.get('HOROVOD_WITHOUT_MXNET'):

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Should be HOROVOD_WITH_MXNET

This comment has been minimized.

@apeforest
GetOpNameHandle("allreduce", name, handle), device,
[handle, average, output](const Status& status) {
handle_manager.MarkDone(handle, status);
handle_manager.ExecuteCallback(handle);

This comment has been minimized.

@alsrgv

alsrgv Nov 28, 2018

Collaborator

Why don't just do cb() here? Do they need to be synchronized to each other as you do in ExecuteCallback?

This comment has been minimized.

@ctcyang

ctcyang Nov 29, 2018

We were seeing segfaults when they were not synchronized with each other. The segfaults would happen non-deterministically many epochs into the training. @apeforest and @yuxihu tracked the problem to calling cb() directly.

The specific commit: b2ef69e

apeforest and others added some commits Nov 28, 2018

Make mxnet build successful in CPU (#8)
* Make mxnet build successful in CPU

* update required mxnet version

* remove outdated comment

* remove commented line
Fix test_mxnet for CPU instances (#9)
* Make mxnet build successful in CPU

* update required mxnet version

* remove outdated comment

* remove commented line

* fix test in CPU

* refactor
Fix nccl undefined symbol (#10)
* Make mxnet build successful in CPU

* update required mxnet version

* remove outdated comment

* remove commented line

* fix test in CPU

* refactor

* link nccl to mpi_lib for mxnet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment