# Artifact Evaluation 
> #### STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training (pap381s2)

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN requires high-performance GPU servers that are too expensive to purchase and maintain. Existing solutions for enabling larger DNN training with limited resources are inadequate because they suffer from high training time overhead. 

We present STRONGHOLD, a better approach for enabling large DNN model training by dynamically offloading data to the CPU RAM and using the secondary storage (e.g., an SSD drive). It maintains a working window to overlap the GPU computation with CPU-GPU data movement carefully and exploits the multi-core CPU for optimizer update. Compared to the state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput.

## Preliminaries

This interactive Jupyter notebook showcases the performance evaluation and comparison between STRONGHOLD and the existing state-of-the-art methods (e.g., Megatron-LM, L2L, ZeRO-Offload and ZeRO-Infinity). 


<!-- The main results of our works are to compare the performance of our approach (i.e. STRONGHOLD) with the existing state-of-the-art methods. The evaluation criteria include throughput, model sizes, scalability and inference time.

STRONGHOLD we proposed aims to increase the trainable model size through dynamic offloading. In this artifact, we show the performance of our apporach (i.e. STRONGHOLD) against the existing state-of-the-art methods:  Megatron-LM, L2L, Zero-offload and Zero-Infinity. The evaluation criteria includes throughput, model sizes, scalability and inference time. We will go through all the evalutions on the largest trainable deep neural network (DNN) size on a single GPU (NVIDIA 32GB V100). -->

- **Megatron-LM [1]**: the library supporting Transformer-based models, released by NVIDIA, optimized for tensor parallelism. We choose the `tags/v2.6` version as a reference for the testing throughput and trainable model size.

- **L2L [2]**: an offloading strategy, keeping only one Transformer layer in the GPU memory at a time and offloading model parameters between the GPU memory and CPU RAM sequentially. Since L2L stores the optimizer states on the GPU memory, it is limited mainly by the capacity of GPU memory.

- **ZeRO-Offload [3]**: a training method, statically storing the model states in the GPU memory and optimizer states in the CPU RAM. It utilizes the CPU computation cycle to update the model parameters through a CPU-version Adam optimizer.

- **ZeRO-Infinity [4]**: a training method based on ZeRO-3 [5], utilizing GPU, CPU RAM and/or NVMe secondary storage. In this notebook, we test STRONGHOLD on CPU and GPU against ZeRO-Infinity, since keeping high NVMe I/O for a long time causes a mistake issue of overload disk in the cloud platform, which triggers the process to be killed by the underlying hypervisor.


The metric assessed in this notebook includes throughput, model size, scalability, etc., shown as:

- The **largest trainable model size**, corresponding to **Fig.6a** in paper.
- The **throughput comparison on respective maximum trainable model size** of each baseline, corresponding to **Fig.7a** in the paper.
- The **throughput comparison on the maximum trainable model size** of Megatron-LM, corresponding to **Fig.8a** in the paper.
- The **nearly linear scaling** on iteration time of STRONGHOLD, corresponding to **Fig.8b** in the paper.
- The **impact of GPU working window size** on the performance of STRONGHOLD, corresponding to **Fig.9** in the paper.

<!-- - The inference time of running the different model sizes of STRONGHOLD.
- Changing two key settings of STONGHOLD (i.e **window size** and **multi-stream**) and evaluate the changes of the throughput. -->

PS: The present notebook runs on a rented ECS (virtual machine) with one 32GB-V100 GPU, 90GB CPU RAM and 12 CPU Cores. The hardware configuration differs from that in our paper. Thus, we reproduce the equivalent cases, and results reported in the paper but keep the relevant ratio value the same.

<!-- The absolute values might be distinct, but the relevant ratio keeps the same. -->


### Reference
> [1] Megatron-LM. https://github.com/NVIDIA/Megatron-LM. <br>
> [2] B. Pudipeddi et al., “Training large neural networks with constant memory using a new execution algorithm,” arXiv, 2020. <br>
> [3] J. Ren et al., “Zero-offload: Democratizing billion-scale model training,” in OSDI, 2021. <br>
> [4] S. Rajbhandari, O. Ruwase et al., “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in SC, 2021, pp. 1–14. <br>
> [5] S. Rajbhandari, J. Rasley, O. Ruwase et al., “Zero: Memory optimiza- tions toward training trillion parameter models,” in SC, 2020, pp. 1–16.



## Important Notes

**A few bash scripts take more than half an hour to complete; Please wait for the results before executing the next one.**

Overload might lead to a longer wait for results. This issue may occur if multiple reviewers simultaneously run the scripts to generate results. One possible way is to check the running process by `ps aux` before executing any other script.

The experiments are customisable as reviewers can edit the Jupyter notebook on the spot. Type your changes with different docker scripts we provided and re-run using **Cell > Run Cells** from the menu.

<!-- All experiments run on a single 32GB V100 GPU in order to briefly show that all of the evaultion results in our paper can be reproduced. Since multi-GPU running is not used, some of the results are not included in this artifact, but researchers can surely follow the similar ways to reproduce those results. -->


### Links to The Paper

**For each step, we highlight that the current evaluation is corresponding to which Section or Figure in the submitted paper.**

The main results presented in this notebook correspond to the submitted paper's Figures 6, 7, 8, and 9.


# 1. Basic Setup

### 1.1 Let's ensure the docker container (NAME: aetesting) is launched. 

`docker ps` command shows the current running docker containers. The next cell in this notebook should output the following information. If not produce similar output, please run `!docker stop aetesting` and `!docker start aetesting` commands in a cell to restart it.

```
CONTAINER ID   IMAGE                    COMMAND       CREATED         STATUS         PORTS     NAMES
d67abb7f151c   strongh/sc22-ae:latest   "/bin/bash"   3 minutes ago   Up 3 minutes             aetesting
```

In [2]:
!docker ps

CONTAINER ID   IMAGE                    COMMAND       CREATED         STATUS         PORTS     NAMES
d67abb7f151c   strongh/sc22-ae:latest   "/bin/bash"   3 minutes ago   Up 3 minutes             aetesting


### 1.2 Let's check the runtime environment in the docker container works well. 

`docker exec` supports executing a bash script in one running container. The following cell executes a command that should output the information, shown as the following:

```
root
/home/sys/STRONGHOLD
torch                         1.10.0a0+git71f889c /root/.pyenv/......3.9.10/lib/python3.9/site-packages
torchvision                   0.11.0a0+05eae32
```

In [102]:
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
whoami && pwd && pip list | grep torch && \
\
pyenv deactivate'

root
/home/sys/STRONGHOLD
torch                         1.10.0a0+git71f889c /root/.pyenv/versions/3.9.10/envs/py3.9.10/lib/python3.9/site-packages
torchvision                   0.11.0a0+05eae32


### 1.3 How to run customized commands in the docker container?

If executing the customized commands in the docker container, please `copy` the whole of the above commands and `change` the second to last line - `whoami && pwd && pip list | grep torch && \` as yours. 

PS: Please do not remove the other commands that initialize the environment variables of the current interactive shell session.

# 2. Evaluation

Next, we use five cases to test performance separately on the largest trainable model size, throughput and scalability. Each case matches a figure in the submitted paper. 

To reuse the existing log files produced in the previous cases, we recommend you to run these cases one by one, which reduces the total execution time to **about 5 hours**. 

- 2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A) - around 130 mins
- 2.2 CASE - Throughput  on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B) - around 40 mins
- 2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B) - around 45 mins
- 2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B) - around 60 mins
- 2.5 CASE - Impact of working window size (Figure 9 in Section VI.C) - around 50 mins

**All log files will be stored in `/home/sys/STRONGHOLD/results`** as a format of `log_[method]_l-[layers]_h-[hidden size]_bs-[BATCH_SIZE]_ws-[WINDOW_SIZE]_[date].txt`. We print the core content in the log files via `grep` and `awk` for you at the end of each execution.

**Launch script** `./examples/run.sh -m [method] -l [layers] -h [hidden size] -b [batch size] -w [window size]` accepts five arguements, where `[method]` takes the values of `megatron-lm`, `l2l`, `zero-offload`, `zero-infinity`, `stronghold` and `all`. Using all to automatically evaluate all approaches. Default values for `[layers]`, `[hidden size]`, `[batch size]`, `[window size]` are 16, 2048, 4 and 4, respectively.

PS: some cases would consume over a half-hour because we have to execute all baselines. Please have a coffee and wait for the output before the subsequent execution.

<!-- > #### Description of the output log
>
> `------------------------ arguments ------------------------` and the below information shows the running parameters in details.
> 
> `>>> done with compiling and loading fused kernels. Compilation time: X seconds` means that all the compiling settings including CUDA linking and loading CUDA module have been successfully done. And it also gives the compilation time..
> 
> `>>> done with compiling and loading strongh utils. Compilation time: X seconds` means all the optimizer modules have been successfully linked and compiled.
> 
> `> building train, validation, and test datasets ...` shows that we are now building the training, validation and test dataloaders. And the following information will show the dataset in details.
> 
> `[before the start of training step] datetime: 2022-06-22 22:22:59` <br>
> `done with setup ...` shows everything works well before starting the actual training. 
> 
> `time (ms) | model-and-optimizer-setup: X | train/valid/test-data-iterators-setup: X` shows each of the setup cost (ms).
> 
> `training ...` and the below log information shows the traing breakdowns. 
> 
> **The below log shows the actual model running results, thus should be the one we pay attention for. And those results correspond to Figure 7(a) and Figure 8(a) in our paper. Please refer to our paper for more details.**
> 
> `iteration       X/      50` shows the iteration level, where `X` is the number represents `X` out of the total iteration (50 in this artifact) .
> 
> `[Rank X] (after 10 iterations) memory (MB) | ...` shows the memory footprint.
>
> `time (ms) | ...` shows the training time breakdowns.
>
> **All the logs will be stored in `/home/sys/STRONGHOLD/results`** -->

In [103]:
# empty the previous log files in `results` folder
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
rm -rf /home/sys/STRONGHOLD/results/*.txt && \
\
pyenv deactivate'

## 2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A)

In this case, we use GPT-like models to exploit each method's largest trainable model size. Model size changes via increasing/decreasing the number of transformer layers.

Here, we evaluate Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores to exploit their largest trainable model size and bottleneck. During this process, we configure the `Heads=16, Sequence Length=1024, Batch Size=4` in all GPT-like models and training setups.

The largest model sizes have been tested in this notebook, shown in the following table. Please run the following cells to reproduce it. Thanks.

| Methods | Largest Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **1.717 B**| **32** | 2048 | 16 | 1024 | 4 |
| L2L | **4.033 B**| **78** | 2048 | 16 | 1024 | 4 |
| ZeRO-Offload | **2.522 B**| **48** | 2048 | 16 | 1024 | 4 |
| ZeRO-Infinity | **2.522 B**| **48** | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **5.141 B**| **100** | 2048 | 16 | 1024 | 4 |

PS: `Errors about GPU/CPU OOM` might be represented as other information, such as 'can not create XXX'.

In [104]:
######
# To check if there exists other running processes launched by other reviwers in case of GPU overlead.
# Just run it and no need to change anything in this cell.
#
# `ps aux` in docker container. 
######
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c 'export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && export PYENV_ROOT="/root/.pyenv" && export PATH="$PYENV_ROOT/bin:$PATH" && eval "$(pyenv init -)" && eval "$(pyenv virtualenv-init -)" && pyenv activate py3.9.10 && ps aux && pyenv deactivate'


USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2520    72 pts/0    Ss+  Jun28   0:00 sleep infinit
root           7  0.0  0.0   4348  2760 pts/1    Ss+  Jun28   0:00 /bin/bash
root         159  0.0  0.0   4348  2900 pts/2    Ss+  Jun28   0:00 /bin/bash
root       34863  0.0  0.0   3976  3144 pts/3    Ss+  00:55   0:00 /bin/bash -c 
root       35021  0.0  0.0   5892  2844 pts/3    R+   00:55   0:00 ps aux


### The following results correspond to Figure 6a in the submitted paper. Please refers to Section VI.A on page 8 for more details. Run around 130 mins.

In [105]:
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/run.sh -m "megatron-lm" -l 32 -h 2048 && \
./examples/run.sh -m "l2l" -l 78 -h 2048 && \
./examples/run.sh -m "zero-offload" -l 48 -h 2048 && \
./examples/run.sh -m "zero-infinity" -l 48 -h 2048 && \
./examples/run.sh -m "stronghold" -l 100 -h 2048 && \
\
pyenv deactivate'



 !!! The training model size in megatron-lm might be much smaller than others, such as zero-offload, stronghold, etc. !!! 

 
cd /home/sys/STRONGHOLD/examples/../Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../Megatron-LM/examples/sc22-gpt-megatron.sh 32 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_megatron-lm_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656464109.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ..................

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.153 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of work

  successfully saved checkpoint at iteration      50 to checkpoints/gpt2
time (ms) | save-checkpoint: 66891.40
[exiting program at iteration 50] datetime: 2022-06-29 01:00:50 
/home/sys/STRONGHOLD
cd /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/sc22-gpt-l2l.sh 78 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_l2l_l-78_hs-2048_bs-4_ws-4_2022-06-29.1656464452.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ..........

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.153 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default nu

`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
[2022-06-29 01:22:11,147] [INFO] [runner.py:398:main] cmd = /root/.pyenv/versions/3.9.10/envs/py3.9.10/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 pretrain_gpt2.py --model-parallel-size 1 --num-layers 48 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save checkpoints/gpt2_ds --load checkpoints/gpt2_ds --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-file /home/sys/STRONGHOLD/data/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-06-29 01:22:16,070] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-06-29 01:22:16,071] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-06-29 01:22:16,071] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.08 GB, percent = 3.4%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f4dcfda71f0>
train_batch = 4, micro_batch=None
[2022-06-29 01:22:16,254] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-06-29 01:22:16,255] [INFO] [utils.py:823:see_memory_usage] MA 9.39 GB        

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.003 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.003 seconds
    total number of samples: 108654
    total 

[2022-06-29 01:26:11,085] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.34 | optimizer_gradients: 270.08 | optimizer_step: 7393.54
[2022-06-29 01:26:11,086] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1763.23 | backward_microstep: 6377.99 | backward_inner_microstep: 6338.63 | backward_allreduce_microstep: 39.27 | step_microstep: 7722.09
[2022-06-29 01:26:11,086] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1763.31 | backward: 6377.99 | backward_inner: 6338.64 | backward_allreduce: 39.27 | step: 7722.10
[2022-06-29 01:26:26,962] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 269.20 | optimizer_step: 7408.53
[2022-06-29 01:26:26,963] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1760.92 | backward_microstep: 6376.31 | backward_inner_microstep: 6337.84 | backward_allreduce_microstep: 38.37 | step_mi

[2022-06-29 01:29:37,519] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.35 | optimizer_gradients: 264.97 | optimizer_step: 7388.26
[2022-06-29 01:29:37,519] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1759.43 | backward_microstep: 6383.61 | backward_inner_microstep: 6345.15 | backward_allreduce_microstep: 38.37 | step_microstep: 7711.75
[2022-06-29 01:29:37,519] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1759.52 | backward: 6383.61 | backward_inner: 6345.16 | backward_allreduce: 38.37 | step: 7711.75
[2022-06-29 01:29:53,365] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 264.47 | optimizer_step: 7392.00
[2022-06-29 01:29:53,366] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1755.86 | backward_microstep: 6372.57 | backward_inner_microstep: 6334.04 | backward_allreduce_microstep: 38.44 | step_mi

[2022-06-29 01:33:03,946] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 265.89 | optimizer_step: 7384.33
[2022-06-29 01:33:03,947] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1755.36 | backward_microstep: 6377.65 | backward_inner_microstep: 6339.19 | backward_allreduce_microstep: 38.36 | step_microstep: 7708.76
[2022-06-29 01:33:03,947] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1755.42 | backward: 6377.65 | backward_inner: 6339.20 | backward_allreduce: 38.37 | step: 7708.76
[2022-06-29 01:33:19,823] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 269.27 | optimizer_step: 7400.83
[2022-06-29 01:33:19,823] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1754.49 | backward_microstep: 6390.42 | backward_inner_microstep: 6351.86 | backward_allreduce_microstep: 38.47 | step_mi

[2022-06-29 01:36:30,237] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 271.23 | optimizer_step: 7408.36
[2022-06-29 01:36:30,237] [INFO] [logging.py:69:log_dist] [Rank 0] step=50, skipped=0, lr=[2.34375e-06, 2.34375e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-06-29 01:36:30,237] [INFO] [timer.py:181:stop] 0/50, SamplesPerSec=0.25199842464335126
[2022-06-29 01:36:30,238] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1754.44 | backward_microstep: 6372.73 | backward_inner_microstep: 6334.18 | backward_allreduce_microstep: 38.46 | step_microstep: 7738.42
[2022-06-29 01:36:30,238] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1754.53 | backward: 6372.73 | backward_inner: 6334.18 | backward_allreduce: 38.47 | step: 7738.42
 iteration       50/      50 | elapsed time per iteration (ms): 15866.7 | learning rate: 2.344E-06 | lm loss: 8.758082E+00 | number of skipped iteratio

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-06-29 01:36:46,842] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-06-29 01:36:46,843] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-06-29 01:36:46,843] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.08 GB, percent = 3.4%
beginging get_train_batch_size = <function get_train_batch_size at 0x7fc2986461f0>
train_batch = None, micro_batch=4
[2022-06-29 01:36:54,466] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-06-29 01:36:54,467] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 108654
    total 

[2022-06-29 01:40:53,783] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6228.15
[2022-06-29 01:40:53,784] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2611.08 | backward_microstep: 7522.57 | backward_inner_microstep: 7445.85 | backward_allreduce_microstep: 76.61 | step_microstep: 6273.97
[2022-06-29 01:40:53,784] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2611.09 | backward: 7522.57 | backward_inner: 7445.86 | backward_allreduce: 76.62 | step: 6273.98
[2022-06-29 01:41:10,264] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6256.10
[2022-06-29 01:41:10,265] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2636.03 | backward_microstep: 7540.76 | backward_inner_microstep: 7464.15 | backward_allreduce_microstep: 76.48 | step_microstep: 6299.09
[2022-06-29 01:41:10,265] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2636.0

[2022-06-29 01:44:44,081] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6243.18
[2022-06-29 01:44:44,082] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2622.09 | backward_microstep: 7546.32 | backward_inner_microstep: 7465.72 | backward_allreduce_microstep: 80.50 | step_microstep: 6288.35
[2022-06-29 01:44:44,082] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2622.11 | backward: 7546.32 | backward_inner: 7465.73 | backward_allreduce: 80.51 | step: 6288.35
[2022-06-29 01:45:00,563] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6287.69
[2022-06-29 01:45:00,564] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2617.51 | backward_microstep: 7527.39 | backward_inner_microstep: 7450.77 | backward_allreduce_microstep: 76.51 | step_microstep: 6332.27
[2022-06-29 01:45:00,564] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2617.5

[2022-06-29 01:48:34,283] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6235.07
[2022-06-29 01:48:34,284] [INFO] [logging.py:69:log_dist] [Rank 0] step=40, skipped=0, lr=[1.8749999999999998e-06, 1.8749999999999998e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-06-29 01:48:34,284] [INFO] [timer.py:181:stop] 0/40, SamplesPerSec=0.2432504830307154
[2022-06-29 01:48:34,284] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2640.63 | backward_microstep: 7558.83 | backward_inner_microstep: 7482.02 | backward_allreduce_microstep: 76.66 | step_microstep: 6279.17
[2022-06-29 01:48:34,284] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2640.67 | backward: 7558.83 | backward_inner: 7482.05 | backward_allreduce: 76.70 | step: 6279.18
 iteration       40/      50 | elapsed time per iteration (ms): 16444.1 | learning rate: 1.875E-06 | lm loss: 1.085358E+01 | number of skipped iterations:   0 | number of nan iterations:   

`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
--checkpoint-activations is no longer valid, use --activation-checkpoint-method instead. Defaulting to activation-checkpoint-method=uniform.
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_resid

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.167 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 25268.9 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.315873E+00 | loss scale: 1.0 | grad norm: 397.991 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 6.67 and total parameters 5.141 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 25.268943524360658;  SamplesPerSecond: 0.1582970810055347
time (ms) | e2e-time: 25269.06 | forward-compute: 3714.81 | backward-compute: 21543.24 | backward-embedding-all-reduce: 0.02 | optimizer: 2.02 | batch-generator: 1.38 | offloading-func-call-overhead: 51.58 | offloading-fwd-overhead: 3387.11 | offloading-bwd-overhead: 18918.47 | offloading-fwd-2gpu-overhead: 721.87 | offloading-fwd-2cpu-overhead: 2663.27 | offloading-bwd-2gpu-overhead: 589.25 | offloading-bwd-2cpu-overhead: 18326.52
 iteration       40/      50 | elapsed time per iteration (ms): 25335.5 | learning rate: 1.875E-06 | global batch size:     4 | lm los

In [124]:
# To print the relevant information from log files
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/case1.sh && \
\
pyenv deactivate'


Running:  grep -R 'total parameters' ./results/log_l2l_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656470788.txt ./results/log_l2l_l-78_hs-2048_bs-4_ws-4_2022-06-29.1656464452.txt ./results/log_megatron-lm_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656464109.txt ./results/log_stronghold_l-100_hs-2048_bs-4_ws-4_2022-06-29.1656467489.txt ./results/log_stronghold_l-16_hs-2048_bs-4_ws-15_2022-06-29.1656476653.txt ./results/log_stronghold_l-24_hs-2048_bs-4_ws-15_2022-06-29.1656476410.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-10_2022-06-29.1656478517.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-12_2022-06-29.1656478904.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-14_2022-06-29.1656479286.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-06-29.1656468888.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-2_2022-06-29.1656476815.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656477260.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-6_2022-06-29.1656477696.txt ./resul

## 2.2 CASE - Throughput  on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B)

In this case, we use GPT-like models to exploit the largest trainable model size supported by each baseline and compare the performance against STRONGHOLD on each largest model size. Model size changes via increasing/decreasing the number of transformer layers.

Here, we evaluate (Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity) v.s. STRONGHOLD on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores. During this process, we configure the `Heads=16, Sequence Length=1024, Batch Size=4` in all GPT-like models and training setups.

The throughput has been tested in this notebook, shown in the following table. Please run the next cells to reproduce it. Thanks.

| Methods | Throughput | Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **0.7496** |1.717 B | 32 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.6647** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
|
| L2L | **0.0529** | 4.033 B| 78 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.2271** | 4.033 B| 78 | 2048 | 16 | 1024 | 4 |
|
| ZeRO-Offload | **0.2523** |2.522 B | 48 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.3999**| 2.522 B| 48 | 2048 | 16 | 1024 | 4 |
|
| ZeRO-Infinity | **0.2439** | 2.522 B| 48 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.3999**| 2.522 B| 48 | 2048 | 16 | 1024 | 4 |

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

In [107]:
######
# To check if there exists other running processes launched by other reviwers in case of GPU overlead.
# Just run it and no need to change anything in this cell.
#
# `ps aux` in docker container. 
######
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c 'export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && export PYENV_ROOT="/root/.pyenv" && export PATH="$PYENV_ROOT/bin:$PATH" && eval "$(pyenv init -)" && eval "$(pyenv virtualenv-init -)" && pyenv activate py3.9.10 && ps aux && pyenv deactivate'

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2520    72 pts/0    Ss+  Jun28   0:00 sleep infinit
root           7  0.0  0.0   4348   696 pts/1    Ss+  Jun28   0:00 /bin/bash
root         159  0.0  0.0   4348   804 pts/2    Ss+  Jun28   0:00 /bin/bash
root       35813  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35825  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35855  0.0  0.0      0     0 ?        Z    01:01   0:01 [python] <def
root       35856  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35864  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35865  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       36874  0.0  0.0   3976  3200 pts/3    Ss+  02:14   0:00 /bin/bash -c 
root       37032  0.0  0.0   5892  2964 pts/3    R+   02:14   0:00 ps aux


### The following results correspond to Figure 7a in the submitted paper. Please refers to Section VI.B on page 9 for more details. Run around 40 mins.

In [108]:
# Code Here
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 48 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 78 -h 2048 -w 15 && \
\
pyenv deactivate'

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 32 2048 16 1024 4 15 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-06-29.1656468888.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 32 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-fi

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.156 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6051.8 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.052111E+01 | loss scale: 1.0 | grad norm: 12.352 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 9.29 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.051781535148621;  SamplesPerSecond: 0.6609623921101718
time (ms) | e2e-time: 6051.73 | forward-compute: 852.50 | backward-compute: 5188.55 | backward-embedding-all-reduce: 0.02 | optimizer: 2.45 | batch-generator: 1.21 | offloading-func-call-overhead: 14.16 | offloading-fwd-overhead: 751.53 | offloading-bwd-overhead: 61.74 | offloading-fwd-2gpu-overhead: 348.26 | offloading-fwd-2cpu-overhead: 402.66 | offloading-bwd-2gpu-overhead: 0.76 | offloading-bwd-2cpu-overhead: 60.09
 iteration       40/      50 | elapsed time per iteration (ms): 6017.3 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.036228E+01 |

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 10053.0 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.951887E+00 | loss scale: 1.0 | grad norm: 29079963.862 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.22 and total parameters 2.522 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 10.053036069869995;  SamplesPerSecond: 0.3978897491463718
time (ms) | e2e-time: 10053.03 | forward-compute: 1360.82 | backward-compute: 8681.45 | backward-embedding-all-reduce: 0.02 | optimizer: 2.33 | batch-generator: 1.31 | offloading-func-call-overhead: 23.37 | offloading-fwd-overhead: 1198.38 | offloading-bwd-overhead: 3.80 | offloading-fwd-2gpu-overhead: 533.43 | offloading-fwd-2cpu-overhead: 663.93 | offloading-bwd-2gpu-overhead: 1.71 | offloading-bwd-2cpu-overhead: 0.74
 iteration       40/      50 | elapsed time per iteration (ms): 9953.1 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.822

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.155 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 17616.2 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.012751E+01 | loss scale: 1.0 | grad norm: 521.329 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.5 and total parameters 4.033 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 17.616170144081117;  SamplesPerSecond: 0.22706411026257975
time (ms) | e2e-time: 17616.19 | forward-compute: 2594.27 | backward-compute: 15010.93 | backward-embedding-all-reduce: 0.02 | optimizer: 2.17 | batch-generator: 1.41 | offloading-func-call-overhead: 41.88 | offloading-fwd-overhead: 2329.38 | offloading-bwd-overhead: 221.90 | offloading-fwd-2gpu-overhead: 1134.03 | offloading-fwd-2cpu-overhead: 1193.82 | offloading-bwd-2gpu-overhead: 3.16 | offloading-bwd-2cpu-overhead: 216.40
 iteration       40/      50 | elapsed time per iteration (ms): 17583.2 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.

In [135]:
# To print the relevant information from log files
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/case2.sh && \
\
pyenv deactivate'


Running:  grep -R 'SamplesPerSec' ./results/log_zero-infinity_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656472684.txt ./results/log_zero-infinity_l-48_hs-2048_bs-4_ws-4_2022-06-29.1656466598.txt ./results/log_zero-offload_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656472088.txt ./results/log_zero-offload_l-48_hs-2048_bs-4_ws-4_2022-06-29.1656465726.txt | awk -v FS='[/,_: ]' '{print $5, $6, $4, $22}' | sort 

Running:  grep -R 'SamplesPerSec' ./results/log_l2l_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656470788.txt ./results/log_l2l_l-78_hs-2048_bs-4_ws-4_2022-06-29.1656464452.txt ./results/log_megatron-lm_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656464109.txt ./results/log_stronghold_l-100_hs-2048_bs-4_ws-4_2022-06-29.1656467489.txt ./results/log_stronghold_l-16_hs-2048_bs-4_ws-15_2022-06-29.1656476653.txt ./results/log_stronghold_l-24_hs-2048_bs-4_ws-15_2022-06-29.1656476410.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-10_2022-06-29.1656478517.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-12_2022-06-2

## 2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B)

This case shows the throughput performance of running Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD, respectively, on a 1.717 B model that is the largest trainable model size supported by Megatron-LM. The evaluation is conducted on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores.

The throughput results have been tested in this notebook, shown in the following table. Please run the following cells to reproduce it. Thanks.

| Methods | Throughput | Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **0.7496** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| L2L | **0.1729**| 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| ZeRO-Offload | **0.3711**| 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| ZeRO-Infinity | **0.3587** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.6647** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

In [110]:
######
# To check if there exists other running processes launched by other reviwers in case of GPU overlead.
# Just run it and no need to change anything in this cell.
#
# `ps aux` in docker container. 
######
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c 'export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && export PYENV_ROOT="/root/.pyenv" && export PATH="$PYENV_ROOT/bin:$PATH" && eval "$(pyenv init -)" && eval "$(pyenv virtualenv-init -)" && pyenv activate py3.9.10 && ps aux && pyenv deactivate'

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2520    72 pts/0    Ss+  Jun28   0:00 sleep infinit
root           7  0.0  0.0   4348   696 pts/1    Ss+  Jun28   0:00 /bin/bash
root         159  0.0  0.0   4348   804 pts/2    Ss+  Jun28   0:00 /bin/bash
root       35813  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35825  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35855  0.0  0.0      0     0 ?        Z    01:01   0:01 [python] <def
root       35856  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35864  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35865  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       38697  0.0  0.0   3976  3200 pts/3    Ss+  02:46   0:00 /bin/bash -c 
root       38855  0.0  0.0   5892  2836 pts/3    R+   02:46   0:00 ps aux


### The following results correspond to Figure 8a in the submitted paper. Please refers to Section VI.B on page 9 for more details. Run around 45 mins.

In [111]:
# Code Here
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/run.sh -m "l2l" -l 32 -h 2048 && \
./examples/run.sh -m "zero-offload" -l 32 -h 2048 && \
./examples/run.sh -m "zero-infinity" -l 32 -h 2048 && \
\
pyenv deactivate'

cd /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/sc22-gpt-l2l.sh 32 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_l2l_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656470788.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ..............................

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.159 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default nu

 iteration       50/      50 | elapsed time per iteration (ms): 22467.5 | learning rate: 2.344E-06 | global batch size:     4 | lm loss: 1.121118E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 2.5 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 22.467474150657655;  SamplesPerSecond: 0.17803514419011424
time (ms) | forward-compute: 4155.95 | backward-compute: 15576.49 | backward-params-all-reduce: 16.84 | backward-embedding-all-reduce: 0.04 | optimizer: 2700.05 | batch-generator: 1.44
saving checkpoint at iteration      50 to checkpoints/gpt2_345m_ds
  successfully saved checkpoint at iteration      50 to checkpoints/gpt2_345m_ds
time (ms) | save-checkpoint: 69348.96
[exiting program at iteration 50] datetime: 2022-06-29 03:08:05 
/home/sys/STRONGHOLD
cd /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Mega

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-06-29 03:08:16,589] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-06-29 03:08:16,590] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-06-29 03:08:16,590] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.2 GB, percent = 3.5%
beginging get_train_batch_size = <function get_train_batch_size at 0x7febec5d41f0>
train_batch = 4, micro_batch=None
[2022-06-29 03:08:16,729] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-06-29 03:08:16,730] [INFO] [utils.py:823:see_memory_usage] MA 6.39 GB         

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 108654
    total 

after 10 iterations memory (MB) | allocated: 6943.26611328125 | max allocated: 10359.83984375 | reserved: 20720.0 | max reserved: 20720.0
time (ms) | forward: 1284.07 | backward: 4303.04 | backward-backward: 4303.00 | backward-allreduce: 0.00 | optimizer: 6383.62 | batch generator: 1.98
Effective Tera Flops per GPU: 0.47 and total parameters 1.717 B
[2022-06-29 03:10:56,925] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.78 | optimizer_gradients: 184.14 | optimizer_step: 5058.33
[2022-06-29 03:10:56,925] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1190.52 | backward_microstep: 4306.71 | backward_inner_microstep: 4267.75 | backward_allreduce_microstep: 38.86 | step_microstep: 5282.66
[2022-06-29 03:10:56,925] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1190.60 | backward: 4306.71 | backward_inner: 4267.76 | backward_allreduce: 38.87 | step: 5282.66
[2022-06-29 03:11:07,719] [INFO] [logging.p

[2022-06-29 03:13:06,479] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.80 | optimizer_gradients: 182.84 | optimizer_step: 5065.79
[2022-06-29 03:13:06,480] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1191.35 | backward_microstep: 4300.56 | backward_inner_microstep: 4261.50 | backward_allreduce_microstep: 38.98 | step_microstep: 5288.95
[2022-06-29 03:13:06,480] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1191.42 | backward: 4300.56 | backward_inner: 4261.51 | backward_allreduce: 38.98 | step: 5288.95
[2022-06-29 03:13:17,264] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.78 | optimizer_gradients: 185.31 | optimizer_step: 5061.32
[2022-06-29 03:13:17,265] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1192.59 | backward_microstep: 4302.34 | backward_inner_microstep: 4263.34 | backward_allreduce_microstep: 38.91 | step_mi

[2022-06-29 03:15:26,453] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.77 | optimizer_gradients: 182.79 | optimizer_step: 5040.84
[2022-06-29 03:15:26,453] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1192.67 | backward_microstep: 4313.33 | backward_inner_microstep: 4274.22 | backward_allreduce_microstep: 39.02 | step_microstep: 5263.86
[2022-06-29 03:15:26,453] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1192.75 | backward: 4313.33 | backward_inner: 4274.23 | backward_allreduce: 39.02 | step: 5263.86
[2022-06-29 03:15:37,267] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.77 | optimizer_gradients: 183.32 | optimizer_step: 5078.17
[2022-06-29 03:15:37,268] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1192.95 | backward_microstep: 4316.16 | backward_inner_microstep: 4277.07 | backward_allreduce_microstep: 39.01 | step_mi

[2022-06-29 03:17:46,878] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.77 | optimizer_gradients: 183.46 | optimizer_step: 5077.13
[2022-06-29 03:17:46,879] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1191.11 | backward_microstep: 4312.57 | backward_inner_microstep: 4273.44 | backward_allreduce_microstep: 39.04 | step_microstep: 5300.83
[2022-06-29 03:17:46,879] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1191.18 | backward: 4312.57 | backward_inner: 4273.45 | backward_allreduce: 39.05 | step: 5300.83
[2022-06-29 03:17:57,677] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.77 | optimizer_gradients: 185.21 | optimizer_step: 5068.99
[2022-06-29 03:17:57,678] [INFO] [logging.py:69:log_dist] [Rank 0] step=50, skipped=0, lr=[2.34375e-06, 2.34375e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-06-29 03:17:57,678] [INFO] [timer.py:181:stop] 0/50, SamplesPerSec=0.

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-06-29 03:18:11,626] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-06-29 03:18:11,626] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-06-29 03:18:11,627] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.2 GB, percent = 3.5%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f2b55e161f0>
train_batch = None, micro_batch=4
[2022-06-29 03:18:16,790] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-06-29 03:18:16,791] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         M

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 108654
    total 

[2022-06-29 03:20:59,598] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4241.29
[2022-06-29 03:20:59,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1786.28 | backward_microstep: 5081.79 | backward_inner_microstep: 5027.72 | backward_allreduce_microstep: 53.93 | step_microstep: 4270.96
[2022-06-29 03:20:59,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1786.29 | backward: 5081.79 | backward_inner: 5027.74 | backward_allreduce: 53.96 | step: 4270.96
[2022-06-29 03:21:10,740] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4232.15
[2022-06-29 03:21:10,741] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1787.58 | backward_microstep: 5089.63 | backward_inner_microstep: 5035.47 | backward_allreduce_microstep: 54.00 | step_microstep: 4261.46
[2022-06-29 03:21:10,741] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1787.6

[2022-06-29 03:23:36,070] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4259.61
[2022-06-29 03:23:36,071] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1788.22 | backward_microstep: 5090.67 | backward_inner_microstep: 5036.19 | backward_allreduce_microstep: 54.38 | step_microstep: 4289.57
[2022-06-29 03:23:36,071] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1788.24 | backward: 5090.67 | backward_inner: 5036.21 | backward_allreduce: 54.39 | step: 4289.57
[2022-06-29 03:23:47,239] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4240.32
[2022-06-29 03:23:47,240] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1788.54 | backward_microstep: 5107.14 | backward_inner_microstep: 5053.23 | backward_allreduce_microstep: 53.81 | step_microstep: 4269.60
[2022-06-29 03:23:47,240] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1788.5

[2022-06-29 03:26:12,584] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4259.35
[2022-06-29 03:26:12,585] [INFO] [logging.py:69:log_dist] [Rank 0] step=40, skipped=0, lr=[1.8749999999999998e-06, 1.8749999999999998e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-06-29 03:26:12,585] [INFO] [timer.py:181:stop] 0/40, SamplesPerSec=0.3580743926343268
[2022-06-29 03:26:12,585] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1783.62 | backward_microstep: 5074.89 | backward_inner_microstep: 5020.63 | backward_allreduce_microstep: 54.11 | step_microstep: 4289.17
[2022-06-29 03:26:12,585] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1783.65 | backward: 5074.89 | backward_inner: 5020.66 | backward_allreduce: 54.14 | step: 4289.17
 iteration       40/      50 | elapsed time per iteration (ms): 11182.4 | learning rate: 1.875E-06 | lm loss: 9.275822E+00 | number of skipped iterations:   0 | number of nan iterations:   

In [138]:
# To print the relevant information from log files
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/case3.sh && \
\
pyenv deactivate'


Running:  grep -R 'SamplesPerSec' ./results/log_zero-infinity_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656472684.txt ./results/log_zero-offload_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656472088.txt | awk -v FS='[/,_: ]' '{print $5, $6, $4, $22}' | sort 

Running:  grep -R 'SamplesPerSec' ./results/log_l2l_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656470788.txt ./results/log_megatron-lm_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656464109.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-10_2022-06-29.1656478517.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-12_2022-06-29.1656478904.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-14_2022-06-29.1656479286.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-06-29.1656468888.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-2_2022-06-29.1656476815.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656477260.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-6_2022-06-29.1656477696.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-8_2022-06-

## 2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B)

In this case, we evaluate the performance (elapsed time per iteration - ms) as the model size increases. Similar to previous cases, the model size changes via increasing/decreasing the number of transformer layers. 

You would see the `elapsed time per iteration` linearly rise with the number of transformer layers (representing model size), proving STRONGHOLD's scalability.


In [112]:
######
# To check if there exists other running processes launched by other reviwers in case of GPU overlead.
# Just run it and no need to change anything in this cell.
#
# `ps aux` in docker container. 
######
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c 'export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && export PYENV_ROOT="/root/.pyenv" && export PATH="$PYENV_ROOT/bin:$PATH" && eval "$(pyenv init -)" && eval "$(pyenv virtualenv-init -)" && pyenv activate py3.9.10 && ps aux && pyenv deactivate'

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2520    72 pts/0    Ss+  Jun28   0:00 sleep infinit
root           7  0.0  0.0   4348   696 pts/1    Ss+  Jun28   0:00 /bin/bash
root         159  0.0  0.0   4348   804 pts/2    Ss+  Jun28   0:00 /bin/bash
root       35813  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35825  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35855  0.0  0.0      0     0 ?        Z    01:01   0:01 [python] <def
root       35856  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35864  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35865  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       39751  0.0  0.0   3976  3092 pts/3    Ss+  03:28   0:00 /bin/bash -c 
root       39909  0.0  0.0   5892  2784 pts/3    R+   03:28   0:00 ps aux


### The following results correspond to Figure 8b in the submitted paper. Please refers to section VI.B on page 9 for more details. Run around 60 mins

In [114]:
# Code Here
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/run.sh -m "stronghold" -l 92 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 64 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 56 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 40 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 24 -h 2048 -w 15 && \
./examples/run.sh -m "stronghold" -l 16 -h 2048 -w 15 && \
\
pyenv deactivate'

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 92 2048 16 1024 4 15 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-92_hs-2048_bs-4_ws-15_2022-06-29.1656473293.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 92 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-fi

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 21290.7 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.016363E+01 | loss scale: 1.0 | grad norm: inf | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.29 and total parameters 4.738 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 21.290717983245848;  SamplesPerSecond: 0.1878752986699505
time (ms) | e2e-time: 21290.74 | forward-compute: 3177.43 | backward-compute: 18102.28 | backward-embedding-all-reduce: 0.02 | optimizer: 2.04 | batch-generator: 1.41 | offloading-func-call-overhead: 51.72 | offloading-fwd-overhead: 2881.05 | offloading-bwd-overhead: 93.44 | offloading-fwd-2gpu-overhead: 1373.79 | offloading-fwd-2cpu-overhead: 1505.35 | offloading-bwd-2gpu-overhead: 3.89 | offloading-bwd-2cpu-overhead: 87.10
 iteration       40/      50 | elapsed time per iteration (ms): 21211.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.961718

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.156 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 14303.4 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.013971E+01 | loss scale: 1.0 | grad norm: 51.268 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.62 and total parameters 3.328 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 14.303354072570801;  SamplesPerSecond: 0.27965468656548914
time (ms) | e2e-time: 14303.18 | forward-compute: 2014.60 | backward-compute: 12277.71 | backward-embedding-all-reduce: 0.02 | optimizer: 2.25 | batch-generator: 1.35 | offloading-func-call-overhead: 35.15 | offloading-fwd-overhead: 1800.35 | offloading-bwd-overhead: 104.04 | offloading-fwd-2gpu-overhead: 877.13 | offloading-fwd-2cpu-overhead: 921.88 | offloading-bwd-2gpu-overhead: 2.51 | offloading-bwd-2cpu-overhead: 98.38
 iteration       40/      50 | elapsed time per iteration (ms): 14053.7 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.906

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 12019.5 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.894191E+00 | loss scale: 1.0 | grad norm: 728.236 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.97 and total parameters 2.925 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 12.019475555419922;  SamplesPerSecond: 0.3327932222630368
time (ms) | e2e-time: 12019.47 | forward-compute: 1702.20 | backward-compute: 10306.51 | backward-embedding-all-reduce: 0.02 | optimizer: 2.27 | batch-generator: 1.35 | offloading-func-call-overhead: 37.43 | offloading-fwd-overhead: 1515.65 | offloading-bwd-overhead: 4.56 | offloading-fwd-2gpu-overhead: 720.90 | offloading-fwd-2cpu-overhead: 793.65 | offloading-bwd-2gpu-overhead: 1.99 | offloading-bwd-2cpu-overhead: 0.86
 iteration       40/      50 | elapsed time per iteration (ms): 12081.2 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.664505

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7918.5 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.984436E+00 | loss scale: 1.0 | grad norm: 3.623 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.77 and total parameters 2.119 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.918460369110107;  SamplesPerSecond: 0.5051487048674246
time (ms) | e2e-time: 7918.43 | forward-compute: 1099.66 | backward-compute: 6808.02 | backward-embedding-all-reduce: 0.02 | optimizer: 2.41 | batch-generator: 1.34 | offloading-func-call-overhead: 19.99 | offloading-fwd-overhead: 967.45 | offloading-bwd-overhead: 198.33 | offloading-fwd-2gpu-overhead: 432.43 | offloading-fwd-2cpu-overhead: 534.23 | offloading-bwd-2gpu-overhead: 1.20 | offloading-bwd-2cpu-overhead: 195.94
 iteration       40/      50 | elapsed time per iteration (ms): 8307.6 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.544877E+00

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 4123.0 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.153189E+01 | loss scale: 1.0 | grad norm: 28.039 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 10.44 and total parameters 1.314 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 4.122998070716858;  SamplesPerSecond: 0.9701678078409888
time (ms) | e2e-time: 4122.98 | forward-compute: 475.72 | backward-compute: 3636.76 | backward-embedding-all-reduce: 0.02 | optimizer: 2.50 | batch-generator: 1.23 | offloading-func-call-overhead: 10.40 | offloading-fwd-overhead: 387.36 | offloading-bwd-overhead: 1.65 | offloading-fwd-2gpu-overhead: 12.51 | offloading-fwd-2cpu-overhead: 374.39 | offloading-bwd-2gpu-overhead: 0.39 | offloading-bwd-2cpu-overhead: 0.48
 iteration       40/      50 | elapsed time per iteration (ms): 4068.7 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.516344E+00 | l

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 2730.6 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.134986E+01 | loss scale: 1.0 | grad norm: inf | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 10.93 and total parameters 0.911 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 2.730643391609192;  SamplesPerSecond: 1.4648562358202202
time (ms) | e2e-time: 2730.59 | forward-compute: 35.96 | backward-compute: 2684.20 | backward-embedding-all-reduce: 0.02 | optimizer: 2.25 | batch-generator: 1.16 | offloading-func-call-overhead: 4.04 | offloading-fwd-overhead: 1.04 | offloading-bwd-overhead: 0.71 | offloading-fwd-2gpu-overhead: 0.02 | offloading-fwd-2cpu-overhead: 0.74 | offloading-bwd-2gpu-overhead: 0.05 | offloading-bwd-2cpu-overhead: 0.26
 iteration       40/      50 | elapsed time per iteration (ms): 2661.4 | learning rate: 1.875E-06 | global batch size:     4 | loss scale: 1.0 | grad norm: nan | 

In [150]:
# To print the relevant information from log files
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/case4.sh && \
\
pyenv deactivate'


Running:  grep -R 'elapsed time per iteration' ./results/log_stronghold_l-16_hs-2048_bs-4_ws-15_2022-06-29.1656476653.txt ./results/log_stronghold_l-24_hs-2048_bs-4_ws-15_2022-06-29.1656476410.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-06-29.1656468888.txt ./results/log_stronghold_l-40_hs-2048_bs-4_ws-15_2022-06-29.1656475948.txt ./results/log_stronghold_l-48_hs-2048_bs-4_ws-15_2022-06-29.1656469235.txt ./results/log_stronghold_l-56_hs-2048_bs-4_ws-15_2022-06-29.1656475269.txt ./results/log_stronghold_l-64_hs-2048_bs-4_ws-15_2022-06-29.1656474476.txt ./results/log_stronghold_l-78_hs-2048_bs-4_ws-15_2022-06-29.1656469802.txt ./results/log_stronghold_l-92_hs-2048_bs-4_ws-15_2022-06-29.1656473293.txt | awk -v FS='[/,_:|]' '{print $5, $6, $8, $4, $12, $13}' | uniq | sort 

l-16 hs-2048 ws-15 stronghold  elapsed time per iteration (ms)  2627.6 
l-16 hs-2048 ws-15 stronghold  elapsed time per iteration (ms)  2661.4 
l-16 hs-2048 ws-15 stronghold  elapsed time per iteration

## 2.5 CASE - Impact of working window size (Figure 9 in Section VI.C)

Working window size affects the throughput. The larger window can better overlap GPU computation with data transfer, leading to higher training throughput. But, a larger window size means more GPU memory occupancy.

This case evaluates the impact of working window size for STRONGHOLD with 1.7B model. You will see that at the first stage, the larger window size can gain more benefits, while at the end of the stage, enlarging window size shows no influence because the current window size can hide the data transformation process.

PS: The bandwidth restriction in the virtual machine might slightly hurt the performance of STRONGHOLD.

In [116]:
######
# To check if there exists other running processes launched by other reviwers in case of GPU overlead.
# Just run it and no need to change anything in this cell.
#
# `ps aux` in docker container. 
######
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c 'export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && export PYENV_ROOT="/root/.pyenv" && export PATH="$PYENV_ROOT/bin:$PATH" && eval "$(pyenv init -)" && eval "$(pyenv virtualenv-init -)" && pyenv activate py3.9.10 && ps aux && pyenv deactivate'

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2520    72 pts/0    Ss+  Jun28   0:00 sleep infinit
root           7  0.0  0.0   4348   696 pts/1    Ss+  Jun28   0:00 /bin/bash
root         159  0.0  0.0   4348   804 pts/2    Ss+  Jun28   0:00 /bin/bash
root       35813  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35825  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35855  0.0  0.0      0     0 ?        Z    01:01   0:01 [python] <def
root       35856  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35864  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       35865  0.0  0.0      0     0 ?        Z    01:01   0:00 [python] <def
root       42636  0.0  0.0   3976  3192 pts/3    Ss+  04:26   0:00 /bin/bash -c 
root       42794  0.0  0.0   5892  2824 pts/3    R+   04:26   0:00 ps aux


### The following results correspond to Figure 9 in the submitted paper. Please refers to Section VI.C on page 10 for more details. Run around 48 mins

In [118]:
# Code Here
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 2 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 4 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 6 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 8 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 10 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 12 && \
./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 14 && \
\
pyenv deactivate'

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 32 2048 16 1024 4 2 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-32_hs-2048_bs-4_ws-2_2022-06-29.1656476815.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 32 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-file

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7859.4 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.046708E+01 | loss scale: 1.0 | grad norm: 3.165 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.16 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.859429907798767;  SamplesPerSecond: 0.5089427664506396
time (ms) | e2e-time: 7859.51 | forward-compute: 1146.16 | backward-compute: 6702.60 | backward-embedding-all-reduce: 0.02 | optimizer: 2.48 | batch-generator: 1.15 | offloading-func-call-overhead: 16.28 | offloading-fwd-overhead: 1059.16 | offloading-bwd-overhead: 5453.51 | offloading-fwd-2gpu-overhead: 313.41 | offloading-fwd-2cpu-overhead: 745.14 | offloading-bwd-2gpu-overhead: 436.98 | offloading-bwd-2cpu-overhead: 5015.58
 iteration       40/      50 | elapsed time per iteration (ms): 7988.5 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.04604

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7751.9 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.526058E+00 | loss scale: 1.0 | grad norm: 2.373 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.26 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.7519118070602415;  SamplesPerSecond: 0.5160017424807263
time (ms) | e2e-time: 7751.91 | forward-compute: 1044.22 | backward-compute: 6697.02 | backward-embedding-all-reduce: 0.02 | optimizer: 2.46 | batch-generator: 1.14 | offloading-func-call-overhead: 15.88 | offloading-fwd-overhead: 935.83 | offloading-bwd-overhead: 5124.38 | offloading-fwd-2gpu-overhead: 171.21 | offloading-fwd-2cpu-overhead: 764.01 | offloading-bwd-2gpu-overhead: 146.77 | offloading-bwd-2cpu-overhead: 4976.73
 iteration       40/      50 | elapsed time per iteration (ms): 7587.9 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.00378

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7237.4 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.025148E+01 | loss scale: 1.0 | grad norm: 6.158 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.77 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.237360525131225;  SamplesPerSecond: 0.5526876802820975
time (ms) | e2e-time: 7237.37 | forward-compute: 953.77 | backward-compute: 6272.98 | backward-embedding-all-reduce: 0.02 | optimizer: 2.48 | batch-generator: 1.15 | offloading-func-call-overhead: 16.57 | offloading-fwd-overhead: 849.37 | offloading-bwd-overhead: 4840.10 | offloading-fwd-2gpu-overhead: 163.49 | offloading-fwd-2cpu-overhead: 685.11 | offloading-bwd-2gpu-overhead: 253.02 | offloading-bwd-2cpu-overhead: 4586.05
 iteration       40/      50 | elapsed time per iteration (ms): 7271.3 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.813936E

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7098.5 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.751514E+00 | loss scale: 1.0 | grad norm: 11.207 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.92 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.09846043586731;  SamplesPerSecond: 0.5635024715766087
time (ms) | e2e-time: 7098.47 | forward-compute: 906.85 | backward-compute: 6181.02 | backward-embedding-all-reduce: 0.02 | optimizer: 2.45 | batch-generator: 1.13 | offloading-func-call-overhead: 17.24 | offloading-fwd-overhead: 798.60 | offloading-bwd-overhead: 4122.61 | offloading-fwd-2gpu-overhead: 154.88 | offloading-fwd-2cpu-overhead: 643.09 | offloading-bwd-2gpu-overhead: 28.08 | offloading-bwd-2cpu-overhead: 4093.53
 iteration       40/      50 | elapsed time per iteration (ms): 7053.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.805072E+

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6889.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.021579E+01 | loss scale: 1.0 | grad norm: 36.416 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.17 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.889051866531372;  SamplesPerSecond: 0.5806314246860206
time (ms) | e2e-time: 6889.04 | forward-compute: 828.02 | backward-compute: 6050.47 | backward-embedding-all-reduce: 0.02 | optimizer: 2.45 | batch-generator: 1.21 | offloading-func-call-overhead: 18.32 | offloading-fwd-overhead: 715.47 | offloading-bwd-overhead: 1466.04 | offloading-fwd-2gpu-overhead: 133.64 | offloading-fwd-2cpu-overhead: 581.08 | offloading-bwd-2gpu-overhead: 1.11 | offloading-bwd-2cpu-overhead: 1463.91
 iteration       40/      50 | elapsed time per iteration (ms): 6822.4 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.964854E+

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.161 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6584.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.049995E+01 | loss scale: 1.0 | grad norm: 22.707 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.54 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.584106826782227;  SamplesPerSecond: 0.6075235571405322
time (ms) | e2e-time: 6584.12 | forward-compute: 825.66 | backward-compute: 5747.81 | backward-embedding-all-reduce: 0.02 | optimizer: 2.46 | batch-generator: 1.23 | offloading-func-call-overhead: 16.53 | offloading-fwd-overhead: 710.61 | offloading-bwd-overhead: 1067.19 | offloading-fwd-2gpu-overhead: 185.33 | offloading-fwd-2cpu-overhead: 524.64 | offloading-bwd-2gpu-overhead: 0.94 | offloading-bwd-2cpu-overhead: 1065.31
 iteration       40/      50 | elapsed time per iteration (ms): 6765.6 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.011405E+

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.160 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6187.5 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.023255E+01 | loss scale: 1.0 | grad norm: 33.184 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 9.09 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.1875091075897215;  SamplesPerSecond: 0.6464636949129531
time (ms) | e2e-time: 6187.61 | forward-compute: 784.88 | backward-compute: 5391.95 | backward-embedding-all-reduce: 0.02 | optimizer: 2.42 | batch-generator: 1.32 | offloading-func-call-overhead: 14.30 | offloading-fwd-overhead: 692.10 | offloading-bwd-overhead: 319.71 | offloading-fwd-2gpu-overhead: 155.65 | offloading-fwd-2cpu-overhead: 535.76 | offloading-bwd-2gpu-overhead: 0.95 | offloading-bwd-2cpu-overhead: 317.85
 iteration       40/      50 | elapsed time per iteration (ms): 6515.8 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.982765E+0

In [159]:
# To print the relevant information from log files
!docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
./examples/case5.sh && \
\
pyenv deactivate'


Running:  grep -R 'elapsed time per iteration' ./results/log_stronghold_l-32_hs-2048_bs-4_ws-10_2022-06-29.1656478517.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-12_2022-06-29.1656478904.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-14_2022-06-29.1656479286.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-06-29.1656468888.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-2_2022-06-29.1656476815.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-4_2022-06-29.1656477260.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-6_2022-06-29.1656477696.txt ./results/log_stronghold_l-32_hs-2048_bs-4_ws-8_2022-06-29.1656478111.txt | awk -v FS='[/,_:|]' '{print $8, $4, $12, $13}' | uniq | sort 

ws-10 stronghold  elapsed time per iteration (ms)  6533.4 
ws-10 stronghold  elapsed time per iteration (ms)  6787.8 
ws-10 stronghold  elapsed time per iteration (ms)  6822.4 
ws-10 stronghold  elapsed time per iteration (ms)  6889.1 
ws-10 stronghold  elapsed time per iteration (ms) 

-----
# The end of this Artifact Evaluation
-----

#### Many thanks for your review, time and efforts on this artifact evaluation.  <br> Many thanks for your understanding and bearing with some inconveniences on this notebook. 

The repository will be open as soon as possible. Users can reproduce other experiment figures using the released docker image or source code.
