karakusc and alsrgv Parallelize hierarchical allreduce algorithm (#411)
* Change Hierarchical Allreduce algorithm into NCCLReduceScatter - MPIAllreduce - NCCLAllgather pattern to parallelize inter-node reduction and improve Hierarchical Allreduce performance

* Remove NCCL_REDUCE AND NCCL_BCAST definitions from header file, update timeline.md appropriately

* Offset buffers for ReduceScatter and Allgather to allow in-place operations; pad buffers before hierarchical allreduce to make data size a multiple of # local ranks

* Do hybrid hierarchical allreduce: First do NCCLReduceScatter - MPIAllreduce - NCCLAllgather for the part of data divisble by hvd.local_rank(), then do NCCLReduce - MPIAllreduce - NCCLBcast for the remainder

* Fix formatting and variable names

* Fix bug in offsetting pointer before operating on remainder data in hierarchical allreduce

* Add synchronization before operating on remainder data to make timeline work properly

* Make hierarchical allreduce steps pipelined: (NCCL ReduceScatter + NCCL Reduce) / single MPI Allreduce / (NCCL Allgather + NCCL Bcast)

* Clean up comments

* Add support for heterogeneous clusters; for homogeneous case allocate fusion buffer size divisible by local_size

* Round up tensor fusion threshold; round up even when the environment var is not set; do synchronous memory copy to host buffer to produce correct timeline

* For hierarchical allreduce make sure num_elements is divisible by 64 for improved performance

* Define fusion buffer atomic unit (64) as a constant

* Free the local_sizes buffer after using, during initialization
Latest commit 9166c1a Sep 6, 2018