Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



60 Commits

Repository files navigation

Source code for "Accelerating Large-Scale Distributed Neural Network Training with SPMD Parallelism"

We recommend our Docker image ( for reproducing the experimental results in the paper.

=== Experiment VI.B ===

Experiment Summary: We evaluate HiDup, DeepSpeed and FastMoE in a cluster with up to 64 GPUs.

    - HiDup (provided in artifacts)
    - DeepSpeed 0.5.10 with patches
    - FastMoE 0.3.0
    - PyTorch 1.10.0
    - Python 3.7.11
    - CUDA 11.4.1
    - NVIDIA GPU driver 470.82.01
    - CUDNN 8.2.4

We patch DeepSpeed by editing line 282 of "moe/" and replacing "min_capacity" with "1". This variable is not used but causes undefined variable error without this patch. The docker image provided in artifacts is already patched.

Hardware: We use 8 ecs.gn6vc8g1.16xlarge instances in Alibaba Cloud (
    - CPU: 64vCPU Intel Xeon(Skylake) Platinum 8163
    - RAM: 256GiB
    - GPU: 8 NVIDIA V100 GPUs
    - Bandwidth: 9.71Gbps (measured using iperf3)

Workflow: The commands are executed in the "exp" directory of HiDup.
    1. Edit "master_addr" and "master_port" in "" according to the cluster configuration.
    2. Run "python" on any machine to get the estimated device flops. Set "profiler_data.device_flops" in "" to this value.
    3. Set "world_size" in "" to the total number of GPUs.
    4. Run "NODERANK=a CPN=b python" on each machine to measure the bandwidth of different collective operations. "a" is the rank of the machine (starts from 0) and "b" is the number of GPUs to use on each machine. Edit each field of "profiler_data" in "" accordingly.
    5. Set "model_name" in "" to any of the following: "Rmoe", "Rswitch", "Vmoe" and "Vswitch". They correspond to "BERT-SGMoE", "BERT-Switch", "ViT-SGMoE" and "ViT-Switch" models, respectively.
    6. Run "python" on any machine. It will print the searched strategy and save it to a file with the name "strategy_[model name]".
    7. Copy the edited "" and the generated strategy file to all machines.
    8. Run "NODERANK=a CPN=b python" on each machine to get the average per-iteration training time for HiDup.
    9. Run "NODERANK=a CPN=b python fastmoe/" on each machine to get the average per-iteration training time for FastMoE.
    10. Run "NODERANK=a CPN=b bash deepspeed/" on each machine to get the average per-iteration training time for DeepSpeed.
    11. Repeat steps 3 to 10 and vary the model and number of GPUs to collect data for Figure 7.

=== Experiment VI.C ===

Experiment Summary: We evaluate HiDup, DeepSpeed, FastMoE, Horovod and PyTorch DDP in a single machine of 8 GPUs.

    - Horovod 0.24.2
    - Others are the same as in Experiment VI.B

Hardware: Same as in Experiment VI.B.

    1. Follow steps 1 to 10 of Experiment VI.B.
    2. Run "NODERANK=0 CPN=b python" on each machine to get the per-iteration training time for PyTorch DDP. "b" is the number of GPUs to use on each machine.
    3. Run "NODERANK=0 CPN=b python" on each machine to get the per-iteration training time for Horovod.
    4. Repeat steps 1 to 3 and vary the model and number of GPUs to collect data for Figure 8.

=== Experiment VI.D ===

Experiment Summary: We evaluate HiDup, DeepSpeed and FastMoE on 2 machines under different bandwidth levels.

    - iproute2 4.15.0-2ubuntu1.3
    - Others are the same as in Experiment VI.B

    - CPU: Intel(R) Xeon(R) Gold 6230
    - RAM: 251G
    - GPU: 4 NVIDIA V100 GPUs
    - Bandwidth: 37.5Gbps (measured using iperf3)

    1. Run "tc qdisc add dev eth0 root tbf rate 30gibit latency 60ms burst 60m" on each machine to limit the available bandwidth. "eth0" should be replaced with the actual network interface and "30gibit" can be changed for the experiment.
    2. Follow steps 1 to 11 of Experiment VI.B to collect the results under the bandwidth level specified in step 1.
    3. Run "tc qdisc del dev eth0 root" on each machine to reset the bandwidth limit.
    4. Repeat steps 1 to 3 and change the bandwidth limit to collect data for Figure 9.

=== Experiment VI.E ===

Experiment Summary: We evaluate HiDup, DeepSpeed and FastMoE on 2 machines with different batch sizes.

Software: Same as in Experiment VI.D

    - 100Gbps RDMA network (Dell Z9100-ON)
    - Others are the same as in Experiment VI.D

    1. Change "batch_size" in "".
    2. Follow steps 1 to 10 of Experiment VI.B to collect the results with the batch size specified in step 1.
    3. Repeat steps 1 and 2 with different batch sizes to collect data for Figure 10.

=== Experiment VI.F ===

Experiment Summary: We evaluate HiDup, DeepSpeed and FastMoE on 2 machines with mixed precision training.

Software: Same as in Experiment VI.E

Hardware: Same as in Experiment VI.E

    1. Edit "" and set "fp16" to "True".
    2. Follow steps 1 to 10 of Experiment VI.B to collect the results with mixed precision training enabled.

=== Experiment VI.G ===

Experiment Summary: We analyze the per-iteration time of HiDup, DeepSpeed and FastMoE on 2 machines.

Software: Same as in Experiment VI.E

Hardware: Same as in Experiment VI.E

    1. Edit "" and set "trace" to "True".
    2. Follow steps 1 to 10 of Experiment VI.B. After running a training task, a file named "trace.json" will be generated in the folder. It uses the Chrome trace event format.
    3. Analyze the execution trace to make Table III. We treat NCCL operations and MemcpyDtoD as communication and other operations as computation.

=== Experiment VI.H ===

Experiment Summary: We evaluate HiDup, DeepSpeed and FastMoE on 2 machines by training BERT-SGMoE until convergence.

Software: Same as in Experiment VI.E

Hardware: Same as in Experiment VI.E

    1. Edit "" and set "run_iter" to a large number.
    2. Follow steps 1 to 10 of Experiment VI.B. Manually terminate the training when the loss stops decreasing.

=== Experiment VI.I ===

Experiment Summary: We evaluate the impact of our computation and communication time estimation on the strategy found by HiDup.

Software: Same as in Experiment VI.E

Hardware: Same as in Experiment VI.E

    1. Change "profile_noise" in "" to the noise level for "x% random noise". For example, "profile_noise = 0.8" corresponds to "80% random noise" in Table IV.
    2. Change "profiler_data" in "" by scaling the bandwidth for "±y% communication". For example, doubling the bandwidth corresponds to "-50% communication" in Table IV.
    3. Follow steps 1 to 8 of Experiment VI.B to collect the results for the strategy based on the noisy estimated times. Repeat 10 times for "x% random noise" under the same noise level and calculate the average relative time.
    4. Repeat steps 1 to 3 with different kinds of noise to collect data for Table IV.

=== Experiment VI.J ===

Experiment Summary: We study HiDup's strategy search time regarding the number of GPUs and the number of layers for the BERT-SGMoE model.

Software: Same as in Experiment VI.E

Hardware: Same as in Experiment VI.E

    1. Change "nlayers" and "world_size" in "", which are the number of layers and the number of GPUs, respectively.
    2. Run "time python" to record the strategy search time.
    3. Repeat steps 1 and 2 with different numbers of layers and numbers of GPUs to collect data for Figure 12.


No description, website, or topics provided.






No releases published


No packages published