Branch: master
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
82 lines (53 sloc) 2.55 KB

Build Python Package with distributed execution support powered by NCCL

Distributed execution is only supported on Linux. We tested it on Ubuntu 16.04 LTS.


In addition to requirements of NNabla without distributed execution, the build system requires:

  • Multiple NVidia CUDA-capable GPUs
  • NCCL: NVidia's multi-GPU and multi-node collective communication library optimized for their GPUs.
  • Open MPI


In order to use the distributed training, the only difference, when building, is the procedure described here.

Download NCCL according to your environment, then install it manually in case of ubuntu16.04,

sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo update
sudo apt-get install libnccl2 libnccl-dev

For developer, if you want to use another nccl not publicly distributed, specify NCCL_HOME environment variable as the following.

export NCCL_HOME=${path}/build

Here, we assume the directory structure,

  • ${path}/build/include
  • ${path}/build/lib


Distributed training also depends on MPI, so install it as follows,

sudo apt-get install libopenmpi-dev

Build and Install

Follow Build CUDA extension with a little modification in CMake's option (-DWITH_NCCL=ON).

cmake -DNNABLA_DIR=../../nnabla -DCPPLIB_LIBRARY=../../nnabla/build/lib/ -D WITH_NCCL=ON ..

You can confirm nccl and mpi includes and dependencies found in the output screen,


CUDA libs: /usr/local/cuda-8.0/lib64/;/usr/local/cuda-8.0/lib64/;/usr/local/cuda-8.0/lib64/;/usr/lib/x86_64-linux-gnu/;/usr/lib/openmpi/lib/;/usr/lib/openmpi/lib/;/usr/local/cuda/lib64/
CUDA includes: /usr/local/cuda-8.0/include;/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent;/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include;/usr/lib/openmpi/include;/usr/lib/openmpi/include/openmpi;/usr/local/cuda-8.0/include

Unit test

Follow the unit test section in Build CUDA extension. Now you could see the communicator test passed.

communicator/ PASSED

Now you can use Data Parallel Distributed Training using multiple GPUs and multiple nodes, please go to CIFAR-10 example for the usage.