This repository contains the source code implementation for the paper "Group-based Interleaved Pipeline Parallelism for Large-scale DNN Training"
WPipe's runtime, which implements model parallelism, input pipelining, as well as as comunication in PyTorch. This can be fused with data parallelism to give hybrid model and data parallelism, and input pipelining.
Image classification task entry point, as well as splits of model
NLP task entry point, as well splits of model
Experiments configurations
Experiments running scripts
Some helper scripts
To run WPipe, you will need a NVIDIA GPU with CUDA 10.1, GPU driver version 418.67, nvidia-docker2, and Python 3. On a Linux server with NVIDIA GPU(s) and Ubuntu 16.04
All dependencies are in the pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime
container, which can be downloaded using:
docker pull pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime
The PyTorch Docker Container can then be run using:
nvidia-docker run -it -v /mnt:/mnt --ipc=host --net=host pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime /bin/bash
Before runing wpipe program,
cd tool && sh init.sh
We run experiments for fine-tune using cifar10, cifar100 and oxford-flower-102, and the throughput experiments using oxford-flower-102 dataset
To download cifar10 and cifar100 from this website To download oxfordflowers102 from this website
We run experiments for fine-tune and throughput using a subset of the GLUE dataset(QQP and MNLI). To download the GLUE dataset use this script.
All experiments can be carried out using scripts in the experiments directory. You can perform an experiment as follows:
sh experiments/cv_throughput_single_node.sh