MPI Parallel framework for training deep learning models built in Theano
Python Cuda Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
docs
examples
show
test
theanompi
.gitignore
LICENSE
README.md
reinstall.sh
setup.py

README.md

Theano-MPI

Theano-MPI is a python framework for distributed training of deep learning models built in Theano. It implements data-parallelism in serveral ways, e.g., Bulk Synchronous Parallel, Elastic Averaging SGD and Gossip SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.

@article{ma2016theano,
  title = {Theano-MPI: a Theano-based Distributed Training Framework},
  author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
  journal = {arXiv preprint arXiv:1605.08325},
  year = {2016}
}

Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. Theano-MPI also comes with a light-weight layer library for you to build customized models. See wiki for a quick guide on building customized neural networks based on them. Check out the examples of building Lasagne VGGNet, Wasserstein GAN, LS-GAN and Keras Wide-ResNet.

Dependencies

Theano-MPI depends on the following libraries and packages. We provide some guidance to the installing them in wiki.

Installation

Once all dependeices are ready, one can clone Theano-MPI and install it by the following.

 $ python setup.py install [--user]

Usage

To accelerate the training of Theano models in a distributed way, Theano-MPI tries to identify two components:

  • the iterative update function of the Theano model
  • the parameter sharing rule between instances of the Theano model

It is recommended to organize your model and data definition in the following way.

  • launch_session.py or launch_session.cfg
  • models/*.py
    • __init__.py
    • modelfile.py : defines your customized ModelClass
    • data/*.py
      • dataname.py : defines your customized DataClass

Your ModelClass in modelfile.py should at least have the following attributes and methods:

  • self.params : a list of Theano shared variables, i.e. trainable model parameters
  • self.data : an instance of your customized DataClass defined in dataname.py
  • self.compile_iter_fns : a method, your way of compiling train_iter_fn and val_iter_fn
  • self.train_iter : a method, your way of using your train_iter_fn
  • self.val_iter : a method, your way of using your val_iter_fn
  • self.adjust_hyperp : a method, your way of adjusting hyperparameters, e.g., learning rate.
  • self.cleanup : a method, necessary model and data clean-up steps.

Your DataClass in dataname.py should at least have the following attributes:

  • self.n_batch_train : an integer, the amount of training batches needed to go through in an epoch
  • self.n_batch_val : an integer, the amount of validation batches needed to go through during validation

After your model definition is complete, you can choose the desired way of sharing parameters among model instances:

  • BSP (Bulk Syncrhonous Parallel)
  • EASGD (Elastic Averaging SGD)
  • GOSGD (Gossip SGD)

Below is an example launch config file for training a customized ModelClass on two GPUs.

# launch_session.cfg
RULE=BSP
MODELFILE=models.modelfile
MODELCLASS=ModelClass
DEVICES=cuda0,cuda1

Then you can launch the training session by calling the following command:

 $ tmlauncher -cfg=launch_session.cfg

Alternatively, you can launch sessions within python as shown below:

# launch_session.py
from theanompi import BSP

rule=BSP()
# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1'] , 
          modelfile = 'models.modelfile', 
          modelclass = 'ModelClass') 
rule.wait()

More examples can be found here.

Example Performance

BSP tested on up to eight Tesla K80 GPUs

Training (+communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on copper]

Model 1GPU 2GPU 4GPU 8GPU
AlexNet-128b 20.50 10.35+0.78 5.13+0.54 2.63+0.61
GoogLeNet-32b 63.89 31.40+1.00 15.51+0.71 7.69+0.80
VGG16-16b 358.29 176.08+13.90 90.44+9.28 55.12+12.59
VGG16-32b 343.37 169.12+7.14 86.97+4.80 43.29+5.41
ResNet50-64b 163.15 80.09+0.81 40.25+0.56 20.12+0.57

More details on the benchmark can be found in this notebook.

Note

  • Test your single GPU model with theanompi/models/test_model.py before trying data-paralle rule.

  • You may want to use those helper functions in /theanompi/lib/opt.py to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence.

  • Binding cores according to your NUMA topology may give better performance. Try the -bind option with the launcher (needs hwloc depedency).

  • Using the launcher script is prefered to start training. Using python to start training currently cause core binding problem especially on a NUMA system.

  • Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.

  • Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.

License

© Contributors, 2016-2017. Licensed under an ECL-2.0 license.