Build Python Package with distributed execution support powered by NCCL
Distributed execution is only supported on Linux. We tested it on Ubuntu 16.04 LTS.
In addition to requirements of NNabla without distributed execution, the build system requires:
- Multiple NVidia CUDA-capable GPUs
- NCCL: NVidia's multi-GPU and multi-node collective communication library optimized for their GPUs.
- Open MPI
In order to use the distributed training, the only difference, when building, is the procedure described here.
Download NCCL according to your environment, then install it manually in case of ubuntu16.04,
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb sudo update sudo apt-get install libnccl2 libnccl-dev
For developer, if you want to use another nccl not publicly distributed, specify NCCL_HOME environment variable as the following.
Here, we assume the directory structure,
Distributed training also depends on MPI, so install it as follows,
sudo apt-get install libopenmpi-dev
Build and Install
Follow Build CUDA extension with a little modification in CMake's option (
cmake -DNNABLA_DIR=../../nnabla -DCPPLIB_LIBRARY=../../nnabla/build/lib/libnnabla.so -D WITH_NCCL=ON ..
You can confirm nccl and mpi includes and dependencies found in the output screen,
... CUDA libs: /usr/local/cuda-8.0/lib64/libcudart.so;/usr/local/cuda-8.0/lib64/libcublas.so;/usr/local/cuda-8.0/lib64/libcurand.so;/usr/lib/x86_64-linux-gnu/libnccl.so;/usr/lib/openmpi/lib/libmpi_cxx.so;/usr/lib/openmpi/lib/libmpi.so;/usr/local/cuda/lib64/libcudnn.so CUDA includes: /usr/local/cuda-8.0/include;/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent;/usr/lib/openmpi/include/openmpi/opal/mca/event/libevent2021/libevent/include;/usr/lib/openmpi/include;/usr/lib/openmpi/include/openmpi;/usr/local/cuda-8.0/include ...
Follow the unit test section in Build CUDA extension. Now you could see the communicator test passed.
... ... communicator/test_data_parallel_communicator.py::test_data_parallel_communicator PASSED ...
Now you can use Data Parallel Distributed Training using multiple GPUs and multiple nodes, please go to CIFAR-10 example for the usage.