MPI Parallel framework for training deep learning models built in Theano
Python Cuda Shell
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Theano-MPI is a python framework for distributed training of deep learning models built in Theano. It implements data-parallelism in serveral ways, e.g., Bulk Synchronous Parallel, Elastic Averaging SGD and Gossip SGD. This project is an extension to theano_alexnet, aiming to scale up the training framework to more than 8 GPUs and across nodes. Please take a look at this technical report for an overview of implementation details. To cite our work, please use the following bibtex entry.

  title = {Theano-MPI: a Theano-based Distributed Training Framework},
  author = {Ma, He and Mao, Fei and Taylor, Graham~W.},
  journal = {arXiv preprint arXiv:1605.08325},
  year = {2016}

Theano-MPI is compatible for training models built in different framework libraries, e.g., Lasagne, Keras, Blocks, as long as its model parameters can be exposed as theano shared variables. Theano-MPI also comes with a light-weight layer library for you to build customized models. See wiki for a quick guide on building customized neural networks based on them. Check out the examples of building Lasagne VGGNet, Wasserstein GAN, LS-GAN and Keras Wide-ResNet.


Theano-MPI depends on the following libraries and packages. We provide some guidance to the installing them in wiki.


Once all dependeices are ready, one can clone Theano-MPI and install it by the following.

 $ python install [--user]


To accelerate the training of Theano models in a distributed way, Theano-MPI tries to identify two components:

  • the iterative update function of the Theano model
  • the parameter sharing rule between instances of the Theano model

It is recommended to organize your model and data definition in the following way.

  • or launch_session.cfg
  • models/*.py
    • : defines your customized ModelClass
    • data/*.py
      • : defines your customized DataClass

Your ModelClass in should at least have the following attributes and methods:

  • self.params : a list of Theano shared variables, i.e. trainable model parameters
  • : an instance of your customized DataClass defined in
  • self.compile_iter_fns : a method, your way of compiling train_iter_fn and val_iter_fn
  • self.train_iter : a method, your way of using your train_iter_fn
  • self.val_iter : a method, your way of using your val_iter_fn
  • self.adjust_hyperp : a method, your way of adjusting hyperparameters, e.g., learning rate.
  • self.cleanup : a method, necessary model and data clean-up steps.

Your DataClass in should at least have the following attributes:

  • self.n_batch_train : an integer, the amount of training batches needed to go through in an epoch
  • self.n_batch_val : an integer, the amount of validation batches needed to go through during validation

After your model definition is complete, you can choose the desired way of sharing parameters among model instances:

  • BSP (Bulk Syncrhonous Parallel)
  • EASGD (Elastic Averaging SGD)
  • GOSGD (Gossip SGD)

Below is an example launch config file for training a customized ModelClass on two GPUs.

# launch_session.cfg

Then you can launch the training session by calling the following command:

 $ tmlauncher -cfg=launch_session.cfg

Alternatively, you can launch sessions within python as shown below:

from theanompi import BSP

# modelfile: the relative path to the model file
# modelclass: the class name of the model to be imported from that file
rule.init(devices=['cuda0', 'cuda1'] , 
          modelfile = 'models.modelfile', 
          modelclass = 'ModelClass') 

More examples can be found here.

Example Performance

BSP tested on up to eight Tesla K80 GPUs

Training (+communication) time per 5120 images in seconds: [allow_gc = True, using nccl32 on copper]

AlexNet-128b 20.50 10.35+0.78 5.13+0.54 2.63+0.61
GoogLeNet-32b 63.89 31.40+1.00 15.51+0.71 7.69+0.80
VGG16-16b 358.29 176.08+13.90 90.44+9.28 55.12+12.59
VGG16-32b 343.37 169.12+7.14 86.97+4.80 43.29+5.41
ResNet50-64b 163.15 80.09+0.81 40.25+0.56 20.12+0.57

More details on the benchmark can be found in this notebook.


  • Test your single GPU model with theanompi/models/ before trying data-paralle rule.

  • You may want to use those helper functions in /theanompi/lib/ to construct optimizers in order to avoid common pitfalls mentioned in (#22) and get better convergence.

  • Binding cores according to your NUMA topology may give better performance. Try the -bind option with the launcher (needs hwloc depedency).

  • Using the launcher script is prefered to start training. Using python to start training currently cause core binding problem especially on a NUMA system.

  • Shuffling training examples before asynchronous training makes the loss surface a lot smoother during model converging.

  • Some known bugs and possible enhancement are listed in Issues. We welcome all kinds of participation (bug reporting, discussion, pull request, etc) in improving the framework.


© Contributors, 2016-2017. Licensed under an ECL-2.0 license.