# Artifact Evaluation 
> #### STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training (pap381s2)

Deep neural networks (DNNs) with billion-scale parameters have demonstrated impressive performance in solving many tasks. Unfortunately, training a billion-scale DNN requires high-performance GPU servers that are too expensive to purchase and maintain. Existing solutions for enabling larger DNN training with limited resources are inadequate because they suffer from high training time overhead. 

We present STRONGHOLD, a better approach for enabling large DNN model training by dynamically offloading data to the CPU RAM and using the secondary storage (e.g., an SSD drive). It maintains a working window to overlap the GPU computation with CPU-GPU data movement carefully and exploits the multi-core CPU for optimizer update. Compared to the state-of-the-art offloading-based solutions, STRONGHOLD improves the trainable model size by 1.9x∼6.5x on a 32GB V100 GPU, with 1.2x∼3.7x improvement on the training throughput.

## Preliminaries

This interactive Jupyter notebook showcases the performance evaluation and comparison between STRONGHOLD and the existing state-of-the-art methods (e.g., Megatron-LM, L2L, ZeRO-Offload and ZeRO-Infinity). 


<!-- The main results of our works are to compare the performance of our approach (i.e. STRONGHOLD) with the existing state-of-the-art methods. The evaluation criteria include throughput, model sizes, scalability and inference time.

STRONGHOLD we proposed aims to increase the trainable model size through dynamic offloading. In this artifact, we show the performance of our apporach (i.e. STRONGHOLD) against the existing state-of-the-art methods:  Megatron-LM, L2L, Zero-offload and Zero-Infinity. The evaluation criteria includes throughput, model sizes, scalability and inference time. We will go through all the evalutions on the largest trainable deep neural network (DNN) size on a single GPU (NVIDIA 32GB V100). -->

- **Megatron-LM [1]**: the library supporting Transformer-based models, released by NVIDIA, optimized for tensor parallelism. We choose the `tags/v2.6` version as a reference for the testing throughput and trainable model size.

- **L2L [2]**: an offloading strategy, keeping only one Transformer layer in the GPU memory at a time and offloading model parameters between the GPU memory and CPU RAM sequentially. Since L2L stores the optimizer states on the GPU memory, it is limited mainly by the capacity of GPU memory.

- **ZeRO-Offload [3]**: a training method, statically storing the model states in the GPU memory and optimizer states in the CPU RAM. It utilizes the CPU computation cycle to update the model parameters through a CPU-version Adam optimizer.

- **ZeRO-Infinity [4]**: a training method based on ZeRO-3 [5], utilizing GPU, CPU RAM and/or NVMe secondary storage. In this notebook, we test STRONGHOLD on CPU and GPU against ZeRO-Infinity, since keeping high NVMe I/O for a long time causes a mistake issue of overload disk in the cloud platform, which triggers the process to be killed by the underlying hypervisor.


The metric assessed in this notebook includes throughput, model size, scalability, etc., shown as:

- The **largest trainable model size**, corresponding to **Fig.6a** in paper.
- The **throughput comparison on respective maximum trainable model size** of each baseline, corresponding to **Fig.7a** in the paper.
- The **throughput comparison on the maximum trainable model size** of Megatron-LM, corresponding to **Fig.8a** in the paper.
- The **nearly linear scaling** on iteration time of STRONGHOLD, corresponding to **Fig.8b** in the paper.
- The **impact of GPU working window size** on the performance of STRONGHOLD, corresponding to **Fig.9** in the paper.

<!-- - The inference time of running the different model sizes of STRONGHOLD.
- Changing two key settings of STONGHOLD (i.e **window size** and **multi-stream**) and evaluate the changes of the throughput. -->

PS: The present notebook runs on a rented ECS (virtual machine) with one 32GB-V100 GPU, 90GB CPU RAM and 12 CPU Cores. The hardware configuration differs from that in our paper. Thus, we reproduce the equivalent cases, and results reported in the paper but keep the relevant ratio value the same.

<!-- The absolute values might be distinct, but the relevant ratio keeps the same. -->


### Reference
> [1] Megatron-LM. https://github.com/NVIDIA/Megatron-LM. <br>
> [2] B. Pudipeddi et al., “Training large neural networks with constant memory using a new execution algorithm,” arXiv, 2020. <br>
> [3] J. Ren et al., “Zero-offload: Democratizing billion-scale model training,” in OSDI, 2021. <br>
> [4] S. Rajbhandari, O. Ruwase et al., “Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning,” in SC, 2021, pp. 1–14. <br>
> [5] S. Rajbhandari, J. Rasley, O. Ruwase et al., “Zero: Memory optimiza- tions toward training trillion parameter models,” in SC, 2020, pp. 1–16.



## Important Notes

**A few bash scripts take more than half an hour to complete; Please wait for the results before executing the next one.**

Overload might lead to a longer wait for results. This issue may occur if multiple reviewers simultaneously run the scripts to generate results. One possible way is to check the running process by `ps aux` before executing any other script.

The experiments are customisable as reviewers can edit the Jupyter notebook on the spot. Type your changes with different docker scripts we provided and re-run using **Cell > Run Cells** from the menu.

<!-- All experiments run on a single 32GB V100 GPU in order to briefly show that all of the evaultion results in our paper can be reproduced. Since multi-GPU running is not used, some of the results are not included in this artifact, but researchers can surely follow the similar ways to reproduce those results. -->


### Links to The Paper

**For each step, we highlight that the current evaluation is corresponding to which Section or Figure in the submitted paper.**

The main results presented in this notebook correspond to the submitted paper's Figures 6, 7, 8, and 9.


# 1. Basic Setup

### 1.1 Let's ensure the docker container (NAME: aetesting) is launched. 

`docker ps` command shows the current running docker containers. The next cell in this notebook should output the following information. If not produce similar output, please run `!docker stop aetesting` and `!docker start aetesting` commands in a cell to restart it.

```
CONTAINER ID   IMAGE                    COMMAND       CREATED         STATUS         PORTS     NAMES
d67abb7f151c   strongh/sc22-ae:latest   "/bin/bash"   3 minutes ago   Up 3 minutes             aetesting
```

In [1]:
# to define a magic function for launching scripts in this notebook
%alias docker_exec docker exec -w /home/sys/STRONGHOLD -it aetesting /bin/bash -c '\
export PYENV_VIRTUALENV_DISABLE_PROMPT=1 && \
export PYENV_ROOT="/root/.pyenv" && \
export PATH="$PYENV_ROOT/bin:$PATH" && \
eval "$(pyenv init -)" && \
eval "$(pyenv virtualenv-init -)" && \
pyenv activate py3.9.10 && \
\
%l && \
\
pyenv deactivate'

# to show the current running docker containers
!docker ps

CONTAINER ID   IMAGE                    COMMAND       CREATED          STATUS          PORTS     NAMES
f31d08089dcc   strongh/sc22-ae:latest   "/bin/bash"   39 minutes ago   Up 39 minutes             aetesting


### 1.2 Let's check the runtime environment in the docker container works well. 

`docker exec` supports executing a bash script in one running container. The following cell executes a command that should output the information, shown as the following:

```
root
/home/sys/STRONGHOLD
torch                         1.10.0a0+git71f889c /root/.pyenv/......3.9.10/lib/python3.9/site-packages
torchvision                   0.11.0a0+05eae32
```

In [36]:
%docker_exec whoami && pwd && pip list | grep torch

root
/home/sys/STRONGHOLD
torch                         1.10.0a0+git71f889c /root/.pyenv/versions/3.9.10/envs/py3.9.10/lib/python3.9/site-packages
torchvision                   0.11.0a0+05eae32


### 1.3 How to run customized commands in the docker container?

If executing the customized commands in the docker container, please add `docker_exec` as a prefix to your commands. 

For example, if executing `nvidia-smi` in the container, the right typing is `docker_exec nvidia-smi`.

# 2. Evaluation
>The jupyter notebook runs on a VM with a 32G V100 on a public cloud platform. The environment is diffrent from that in the submitted paper. **Herein, the output (absolute values) may differ the results in the paper, but the relative relationship keeps the same!**

Next, we use five cases to test performance separately on the largest trainable model size, throughput and scalability. Each case matches a figure in the submitted paper.

To reuse the existing log files produced in the previous cases, we recommend you to run these cases one by one, which reduces the total execution time to **about 5 hours**. 

- 2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A) - around 130 mins
- 2.2 CASE - Throughput  on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B) - around 40 mins
- 2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B) - around 45 mins
- 2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B) - around 60 mins
- 2.5 CASE - Impact of working window size (Figure 9 in Section VI.C) - around 50 mins

**All log files will be stored in `/home/sys/STRONGHOLD/results`** as a format of `log_[method]_l-[layers]_h-[hidden size]_bs-[BATCH_SIZE]_ws-[WINDOW_SIZE]_[date].txt`. We print the core content in the log files via `grep` and `awk` for you at the end of each execution.

**Launch script** `./examples/run.sh -m [method] -l [layers] -h [hidden size] -b [batch size] -w [window size]` accepts five arguements, where `[method]` takes the values of `megatron-lm`, `l2l`, `zero-offload`, `zero-infinity`, `stronghold` and `all`. Using all to automatically evaluate all approaches. Default values for `[layers]`, `[hidden size]`, `[batch size]`, `[window size]` are 16, 2048, 4 and 4, respectively.

PS: some cases would consume over a half-hour because we have to execute all baselines. Please have a coffee and wait for the output before the subsequent execution.

<!-- > #### Description of the output log
>
> `------------------------ arguments ------------------------` and the below information shows the running parameters in details.
> 
> `>>> done with compiling and loading fused kernels. Compilation time: X seconds` means that all the compiling settings including CUDA linking and loading CUDA module have been successfully done. And it also gives the compilation time..
> 
> `>>> done with compiling and loading strongh utils. Compilation time: X seconds` means all the optimizer modules have been successfully linked and compiled.
> 
> `> building train, validation, and test datasets ...` shows that we are now building the training, validation and test dataloaders. And the following information will show the dataset in details.
> 
> `[before the start of training step] datetime: 2022-06-22 22:22:59` <br>
> `done with setup ...` shows everything works well before starting the actual training. 
> 
> `time (ms) | model-and-optimizer-setup: X | train/valid/test-data-iterators-setup: X` shows each of the setup cost (ms).
> 
> `training ...` and the below log information shows the traing breakdowns. 
> 
> **The below log shows the actual model running results, thus should be the one we pay attention for. And those results correspond to Figure 7(a) and Figure 8(a) in our paper. Please refer to our paper for more details.**
> 
> `iteration       X/      50` shows the iteration level, where `X` is the number represents `X` out of the total iteration (50 in this artifact) .
> 
> `[Rank X] (after 10 iterations) memory (MB) | ...` shows the memory footprint.
>
> `time (ms) | ...` shows the training time breakdowns.
>
> **All the logs will be stored in `/home/sys/STRONGHOLD/results`** -->

**To clean up the previous log files in the `results` folder**

In [3]:
%docker_exec rm -rf /home/sys/STRONGHOLD/results/*.txt /home/sys/STRONGHOLD/results/*.csv

In [4]:
%docker_exec ls /home/sys/STRONGHOLD/results/

## 2.1 CASE - The largest trainable model size (Figure 6a in Section VI.A)

In this case, we use GPT-like models to exploit each method's largest trainable model size. Model size changes via increasing/decreasing the number of transformer layers.

Here, we evaluate Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores to exploit their largest trainable model size and bottleneck. During this process, we configure the `Heads=16, Sequence Length=1024, Batch Size=4` in all GPT-like models and training setups.

The largest model sizes have been tested in this notebook, shown in the following table. Please run the following cells to reproduce it. Thanks.

| Methods | Largest Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **1.717 B**| **32** | 2048 | 16 | 1024 | 4 |
| L2L | **4.033 B**| **78** | 2048 | 16 | 1024 | 4 |
| ZeRO-Offload | **2.522 B**| **48** | 2048 | 16 | 1024 | 4 |
| ZeRO-Infinity | **2.522 B**| **48** | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **5.141 B**| **100** | 2048 | 16 | 1024 | 4 |

PS: `Errors about GPU/CPU OOM` might be represented as other information, such as 'can not create XXX'.

**Using `ps aux` to check if there exists other running processes launched by other reviwers in case of GPU overlead.**

In [37]:
%docker_exec ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4240  3512 pts/0    Ss+  07:09   0:00 /bin/bash
root         209  0.0  0.0   4340  3808 pts/1    Ss+  07:10   0:00 /bin/bash
root        3377  0.0  0.0   3976  3224 pts/2    Ss+  07:15   0:00 /bin/bash -c 
root        3535  0.0  0.0   5892  2952 pts/2    R+   07:15   0:00 ps aux


### The following results correspond to Figure 6a in the submitted paper. Please refers to Section VI.A on page 8 for more details. <br><br>Runs around 130 mins.


In [6]:
%docker_exec ./examples/run.sh -m "megatron-lm" -l 32 -h 2048
%docker_exec ./examples/run.sh -m "l2l" -l 78 -h 2048
%docker_exec ./examples/run.sh -m "zero-offload" -l 48 -h 2048
%docker_exec ./examples/run.sh -m "zero-infinity" -l 48 -h 2048
%docker_exec ./examples/run.sh -m "stronghold" -l 100 -h 2048



 !!! The training model size in megatron-lm might be much smaller than others, such as zero-offload, stronghold, etc. !!! 

 
cd /home/sys/STRONGHOLD/examples/../Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../Megatron-LM/examples/sc22-gpt-megatron.sh 32 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_megatron-lm_l-32_hs-2048_bs-4_ws-4_2022-07-02.1656730240.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ..................

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default number of work

  successfully saved checkpoint at iteration      50 to checkpoints/gpt2
time (ms) | save-checkpoint: 62657.14
[exiting program at iteration 50] datetime: 2022-07-02 02:56:17 
/home/sys/STRONGHOLD
cd /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/sc22-gpt-l2l.sh 78 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_l2l_l-78_hs-2048_bs-4_ws-4_2022-07-02.1656730580.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ..........

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.178 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default nu

IOStream.flush timed out


/home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/sc22-gpt-l2l.sh: line 56: 31047 Killed                  PYTHONGIL=1 python pretrain_gpt.py --num-layers ${NLAYERS} --hidden-size ${NHIDDEN} --num-attention-heads ${HEADS} --micro-batch-size ${BATCHSIZE} --global-batch-size ${BATCHSIZE} --seq-length ${SEQ} --max-position-embeddings ${SEQ} --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save $CHECKPOINT_PATH --load $CHECKPOINT_PATH --data-path $DATA_PATH --vocab-file ${VOCAB_PATH} --merge-file ${MERGE_PATH} --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 0.00015 --min-lr 1.0e-5 --lr-decay-style cosine --weight-decay 1e-2 --clip-grad 1.0 --lr-warmup-fraction .01 --activations-checkpoint-method uniform --save-interval 10000 --eval-interval 1000 --eval-iters 1000 --enable-l2l
/home/sys/STRONGHOLD
cd /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Megatron-L

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-07-02 03:14:18,212] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-07-02 03:14:18,213] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-07-02 03:14:18,213] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.06 GB, percent = 3.4%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f046c497310>
train_batch = 4, micro_batch=None
[2022-07-02 03:14:18,403] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-07-02 03:14:18,404] [INFO] [utils.py:823:see_memory_usage] MA 9.39 GB        

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.003 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.003 seconds
    total number of samples: 108654
    total 

[2022-07-02 03:18:13,045] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.35 | optimizer_gradients: 266.75 | optimizer_step: 7368.17
[2022-07-02 03:18:13,045] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1757.15 | backward_microstep: 6368.03 | backward_inner_microstep: 6327.11 | backward_allreduce_microstep: 40.83 | step_microstep: 7693.52
[2022-07-02 03:18:13,046] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1757.22 | backward: 6368.03 | backward_inner: 6327.12 | backward_allreduce: 40.83 | step: 7693.53
[2022-07-02 03:18:28,862] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.35 | optimizer_gradients: 266.73 | optimizer_step: 7355.07
[2022-07-02 03:18:28,863] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1759.44 | backward_microstep: 6373.52 | backward_inner_microstep: 6334.81 | backward_allreduce_microstep: 38.63 | step_mi

[2022-07-02 03:21:38,925] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.34 | optimizer_gradients: 264.91 | optimizer_step: 7361.74
[2022-07-02 03:21:38,926] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1761.91 | backward_microstep: 6377.65 | backward_inner_microstep: 6339.18 | backward_allreduce_microstep: 38.38 | step_microstep: 7685.17
[2022-07-02 03:21:38,926] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1762.00 | backward: 6377.65 | backward_inner: 6339.19 | backward_allreduce: 38.39 | step: 7685.17
[2022-07-02 03:21:54,750] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.36 | optimizer_gradients: 268.84 | optimizer_step: 7367.99
[2022-07-02 03:21:54,751] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1757.15 | backward_microstep: 6369.27 | backward_inner_microstep: 6330.28 | backward_allreduce_microstep: 38.91 | step_mi

[2022-07-02 03:25:04,828] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.35 | optimizer_gradients: 266.55 | optimizer_step: 7398.14
[2022-07-02 03:25:04,829] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1757.12 | backward_microstep: 6374.27 | backward_inner_microstep: 6335.79 | backward_allreduce_microstep: 38.39 | step_microstep: 7723.30
[2022-07-02 03:25:04,829] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1757.21 | backward: 6374.27 | backward_inner: 6335.80 | backward_allreduce: 38.39 | step: 7723.30
[2022-07-02 03:25:20,726] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.37 | optimizer_gradients: 267.64 | optimizer_step: 7447.88
[2022-07-02 03:25:20,726] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1756.63 | backward_microstep: 6363.83 | backward_inner_microstep: 6325.35 | backward_allreduce_microstep: 38.37 | step_mi

[2022-07-02 03:28:30,903] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 52.34 | optimizer_gradients: 268.86 | optimizer_step: 7403.55
[2022-07-02 03:28:30,904] [INFO] [logging.py:69:log_dist] [Rank 0] step=50, skipped=0, lr=[2.34375e-06, 2.34375e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-07-02 03:28:30,904] [INFO] [timer.py:181:stop] 0/50, SamplesPerSec=0.2525706111582965
[2022-07-02 03:28:30,904] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1754.24 | backward_microstep: 6361.98 | backward_inner_microstep: 6323.60 | backward_allreduce_microstep: 38.30 | step_microstep: 7731.42
[2022-07-02 03:28:30,904] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1754.32 | backward: 6361.98 | backward_inner: 6323.61 | backward_allreduce: 38.30 | step: 7731.43
 iteration       50/      50 | elapsed time per iteration (ms): 15846.9 | learning rate: 2.344E-06 | lm loss: 8.758082E+00 | number of skipped iteration

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-07-02 03:28:46,799] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-07-02 03:28:46,799] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-07-02 03:28:46,800] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.05 GB, percent = 3.4%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f68981da310>
train_batch = None, micro_batch=4
[2022-07-02 03:28:54,257] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-07-02 03:28:54,258] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 108654
    total 

[2022-07-02 03:32:52,208] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6144.94
[2022-07-02 03:32:52,209] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2611.37 | backward_microstep: 7534.86 | backward_inner_microstep: 7457.65 | backward_allreduce_microstep: 77.04 | step_microstep: 6189.25
[2022-07-02 03:32:52,209] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2611.43 | backward: 7534.85 | backward_inner: 7457.67 | backward_allreduce: 77.07 | step: 6189.25
[2022-07-02 03:33:08,597] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6175.22
[2022-07-02 03:33:08,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2633.82 | backward_microstep: 7531.74 | backward_inner_microstep: 7454.43 | backward_allreduce_microstep: 77.19 | step_microstep: 6219.24
[2022-07-02 03:33:08,599] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2633.8

[2022-07-02 03:36:41,526] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6192.25
[2022-07-02 03:36:41,527] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2620.18 | backward_microstep: 7522.49 | backward_inner_microstep: 7445.07 | backward_allreduce_microstep: 77.31 | step_microstep: 6236.01
[2022-07-02 03:36:41,527] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2620.19 | backward: 7522.49 | backward_inner: 7445.09 | backward_allreduce: 77.32 | step: 6236.01
[2022-07-02 03:36:57,888] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6161.02
[2022-07-02 03:36:57,889] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2625.05 | backward_microstep: 7526.49 | backward_inner_microstep: 7449.16 | backward_allreduce_microstep: 77.20 | step_microstep: 6205.88
[2022-07-02 03:36:57,889] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2625.1

[2022-07-02 03:40:30,827] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 6135.75
[2022-07-02 03:40:30,828] [INFO] [logging.py:69:log_dist] [Rank 0] step=40, skipped=0, lr=[1.8749999999999998e-06, 1.8749999999999998e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-07-02 03:40:30,828] [INFO] [timer.py:181:stop] 0/40, SamplesPerSec=0.24431760335652652
[2022-07-02 03:40:30,828] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 2645.92 | backward_microstep: 7544.43 | backward_inner_microstep: 7465.60 | backward_allreduce_microstep: 78.71 | step_microstep: 6179.48
[2022-07-02 03:40:30,828] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 2645.94 | backward: 7544.43 | backward_inner: 7465.60 | backward_allreduce: 78.74 | step: 6179.48
 iteration       40/      50 | elapsed time per iteration (ms): 16380.4 | learning rate: 1.875E-06 | lm loss: 1.087060E+01 | number of skipped iterations:   0 | number of nan iterations:  

`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
--checkpoint-activations is no longer valid, use --activation-checkpoint-method instead. Defaulting to activation-checkpoint-method=uniform.
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_query_key_layer_scaling ................... True
  apply_resid

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.160 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 25391.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.323940E+00 | loss scale: 1.0 | grad norm: 148.220 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 6.63 and total parameters 5.141 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 25.391076254844666;  SamplesPerSecond: 0.15753566173615002
time (ms) | e2e-time: 25391.07 | forward-compute: 3707.36 | backward-compute: 21672.75 | backward-embedding-all-reduce: 0.02 | optimizer: 2.03 | batch-generator: 1.34 | offloading-func-call-overhead: 51.72 | offloading-fwd-overhead: 3372.11 | offloading-bwd-overhead: 19204.44 | offloading-fwd-2gpu-overhead: 680.60 | offloading-fwd-2cpu-overhead: 2689.65 | offloading-bwd-2gpu-overhead: 586.48 | offloading-bwd-2cpu-overhead: 18615.08
 iteration       40/      50 | elapsed time per iteration (ms): 25445.2 | learning rate: 1.875E-06 | global batch size:     4 | lm lo

### To print/draw the relevant information from log files

> - extract and save useful infomation from the detailed logs to `./results/case1.csv` <br>
> - visualize as a figure saved into `out/metric_model_scale.png` <br>
> - `docker_exec cat ./results/case1.csv` can look through the numeric values <br>

In [31]:
%docker_exec ./examples/case1_extract.sh
%docker_exec python ./examples/case1_draw.py

import random; __counter__ = random.randint(0,2e9)
from IPython.display import HTML, display
display(HTML('<img src="./out/metric_model_scale.png?%d" style="height:180px">' % __counter__))

## 2.2 CASE - Throughput  on the largest trainable model size supported by each baseline (Figure 7a in Section VI.B)

In this case, we use GPT-like models to exploit the largest trainable model size supported by each baseline and compare the performance against STRONGHOLD on each largest model size. Model size changes via increasing/decreasing the number of transformer layers.

Here, we evaluate (Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity) v.s. STRONGHOLD on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores. During this process, we configure the `Heads=16, Sequence Length=1024, Batch Size=4` in all GPT-like models and training setups.

The throughput has been tested in this notebook, shown in the following table. Please run the next cells to reproduce it. Thanks.

| Methods | Throughput | Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **0.7496** |1.717 B | 32 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.6647** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
|
| L2L | **0.0529** | 4.033 B| 78 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.2271** | 4.033 B| 78 | 2048 | 16 | 1024 | 4 |
|
| ZeRO-Offload | **0.2523** |2.522 B | 48 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.3999**| 2.522 B| 48 | 2048 | 16 | 1024 | 4 |
|
| ZeRO-Infinity | **0.2439** | 2.522 B| 48 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.3999**| 2.522 B| 48 | 2048 | 16 | 1024 | 4 |

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

**Using `ps aux` to check if there exists other running processes launched by other reviwers in case of GPU overlead.**

In [38]:
%docker_exec ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4240  3512 pts/0    Ss+  07:09   0:00 /bin/bash
root         209  0.0  0.0   4340  3808 pts/1    Ss+  07:10   0:00 /bin/bash
root        3556  0.0  0.0   3976  3172 pts/2    Ss+  07:15   0:00 /bin/bash -c 
root        3714  0.0  0.0   5892  2848 pts/2    R+   07:15   0:00 ps aux


### The following results correspond to Figure 7a in the submitted paper. Please refers to Section VI.B on page 9 for more details. <br><br>Runs around 40 mins.

In [11]:
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 48 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 78 -h 2048 -w 15

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 32 2048 16 1024 4 15 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-32_hs-2048_bs-4_ws-15_2022-07-02.1656734808.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 32 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-fi

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 5990.7 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.076840E+01 | loss scale: 1.0 | grad norm: 1363562471.041 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 9.39 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 5.990670108795166;  SamplesPerSecond: 0.6677049357345557
time (ms) | e2e-time: 5990.68 | forward-compute: 857.77 | backward-compute: 5122.16 | backward-embedding-all-reduce: 0.02 | optimizer: 2.44 | batch-generator: 1.17 | offloading-func-call-overhead: 15.44 | offloading-fwd-overhead: 748.51 | offloading-bwd-overhead: 2.13 | offloading-fwd-2gpu-overhead: 348.03 | offloading-fwd-2cpu-overhead: 399.89 | offloading-bwd-2gpu-overhead: 0.78 | offloading-bwd-2cpu-overhead: 0.48
 iteration       40/      50 | elapsed time per iteration (ms): 5883.4 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.037473

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.155 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 10075.5 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.051130E+01 | loss scale: 1.0 | grad norm: 52.520 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.2 and total parameters 2.522 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 10.075518250465393;  SamplesPerSecond: 0.397001911024799
time (ms) | e2e-time: 10075.51 | forward-compute: 1358.87 | backward-compute: 8705.88 | backward-embedding-all-reduce: 0.02 | optimizer: 2.36 | batch-generator: 1.21 | offloading-func-call-overhead: 35.92 | offloading-fwd-overhead: 1192.92 | offloading-bwd-overhead: 3.57 | offloading-fwd-2gpu-overhead: 535.56 | offloading-fwd-2cpu-overhead: 656.47 | offloading-bwd-2gpu-overhead: 1.51 | offloading-bwd-2cpu-overhead: 0.80
 iteration       40/      50 | elapsed time per iteration (ms): 9939.3 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.010002E+01 

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 17625.8 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.023125E+01 | loss scale: 1.0 | grad norm: 1466714772.580 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.5 and total parameters 4.033 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 17.625775718688963;  SamplesPerSecond: 0.22694036641795684
time (ms) | e2e-time: 17625.81 | forward-compute: 2607.50 | backward-compute: 15007.40 | backward-embedding-all-reduce: 0.02 | optimizer: 2.17 | batch-generator: 1.39 | offloading-func-call-overhead: 41.11 | offloading-fwd-overhead: 2361.02 | offloading-bwd-overhead: 609.36 | offloading-fwd-2gpu-overhead: 1161.72 | offloading-fwd-2cpu-overhead: 1197.84 | offloading-bwd-2gpu-overhead: 3.26 | offloading-bwd-2cpu-overhead: 603.55
 iteration       40/      50 | elapsed time per iteration (ms): 17623.1 | learning rate: 1.875E-06 | global batch size:     4 | lm l

### To print/draw the relevant information from log files

> - extract and save useful infomation from the detailed logs to `./results/case2.csv` <br>
> - visualize as a figure saved into `out/metric_throughput_vs.png` <br>
> - `docker_exec cat ./results/case2.csv` can look through the numeric values <br>

In [32]:
%docker_exec ./examples/case2_extract.sh
%docker_exec python ./examples/case2_draw.py

import random; __counter__ = random.randint(0,2e9)
from IPython.display import HTML, display
display(HTML('<img src="./out/metric_throughput_vs.png?%d" style="height:180px">' % __counter__))

Due to the limitations of CPU cores and bandwidth in the virtual machine, STRONGHOLD might slow slightly compared with Megatron-LM, but still outperfom other offloading solutions.

## 2.3 CASE - Throughput on the largest trainable model size of Megatron-LM (Figure 8a in Section VI.B)

This case shows the throughput performance of running Megatron-LM, L2L, ZeRO-Offload, ZeRO-Infinity and STRONGHOLD, respectively, on a 1.717 B model that is the largest trainable model size supported by Megatron-LM. The evaluation is conducted on a virtual machine with one 32GB V100, 90GB CPU RAM and 12 CPU Cores.

The throughput results have been tested in this notebook, shown in the following table. Please run the following cells to reproduce it. Thanks.

| Methods | Throughput | Trainable Size | Layers | Hidden Size | Heads | Sequence Length | Batch Size |
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| Megatron-LM | **0.7496** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| L2L | **0.1729**| 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| ZeRO-Offload | **0.3711**| 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| ZeRO-Infinity | **0.3587** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |
| STRONGHOLD | **0.6647** | 1.717 B| 32 | 2048 | 16 | 1024 | 4 |

PS: Limitations of CPU cores and bandwidth in the virtual machine hurts the performance of STRONGHOLD a little.

**Using `ps aux` to check if there exists other running processes launched by other reviwers in case of GPU overlead.**

In [15]:
%docker_exec ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4340  2716 pts/0    Ss+  02:10   0:00 /bin/bash
root       36316  0.0  0.0   3976  3164 pts/1    Ss+  04:38   0:00 /bin/bash -c 
root       36474  0.0  0.0   5892  2844 pts/1    R+   04:38   0:00 ps aux


### The following results correspond to Figure 8a in the submitted paper. Please refers to Section VI.B on page 9 for more details. <br><br>Runs around 45 mins.

In [16]:
%docker_exec ./examples/run.sh -m "l2l" -l 32 -h 2048
%docker_exec ./examples/run.sh -m "zero-offload" -l 32 -h 2048 
%docker_exec ./examples/run.sh -m "zero-infinity" -l 32 -h 2048

cd /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../L2L-Megatron-LM/examples/sc22-gpt-l2l.sh 32 2048 16 1024 4 4 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_l2l_l-32_hs-2048_bs-4_ws-4_2022-07-02.1656736709.txt && cd -
`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
using world size: 1, data-parallel-size: 1, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
using torch.float32 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  activations_checkpoint_method ................... uniform
  activations_checkpoint_num_layers ............... 1
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.999
  adam_eps ........................................ 1e-08
  adlr_autoresume ..............................

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.156 seconds
> compiling and loading fused kernels ...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/sys/STRONGHOLD/L2L-Megatron-LM/megatron/fused_kernels/build/build.ninja...
Building extension module scaled_upper_triang_masked_softmax_cuda...
Allowing ninja to set a default nu

 iteration       50/      50 | elapsed time per iteration (ms): 22823.0 | learning rate: 2.344E-06 | global batch size:     4 | lm loss: 1.121118E+01 | loss scale: 1.0 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 2.46 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 22.822991847991943;  SamplesPerSecond: 0.17526185991044535
time (ms) | forward-compute: 4267.56 | backward-compute: 15703.58 | backward-params-all-reduce: 16.84 | backward-embedding-all-reduce: 0.04 | optimizer: 2815.08 | batch-generator: 1.38
saving checkpoint at iteration      50 to checkpoints/gpt2_345m_ds
  successfully saved checkpoint at iteration      50 to checkpoints/gpt2_345m_ds
time (ms) | save-checkpoint: 69139.43
[exiting program at iteration 50] datetime: 2022-07-02 05:00:03 
/home/sys/STRONGHOLD
cd /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../DeepSpeedExample-Meg

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-07-02 05:00:14,847] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-07-02 05:00:14,848] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-07-02 05:00:14,848] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.15 GB, percent = 3.5%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f19efdc6310>
train_batch = 4, micro_batch=None
[2022-07-02 05:00:14,993] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-07-02 05:00:14,994] [INFO] [utils.py:823:see_memory_usage] MA 6.39 GB        

[2022-07-02 05:03:36,650] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.75 | optimizer_gradients: 182.28 | optimizer_step: 5001.92
[2022-07-02 05:03:36,651] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1191.34 | backward_microstep: 4301.74 | backward_inner_microstep: 4262.77 | backward_allreduce_microstep: 38.88 | step_microstep: 5224.43
[2022-07-02 05:03:36,651] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1191.40 | backward: 4301.74 | backward_inner: 4262.78 | backward_allreduce: 38.89 | step: 5224.43
[2022-07-02 05:03:47,387] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.74 | optimizer_gradients: 185.19 | optimizer_step: 5003.76
[2022-07-02 05:03:47,388] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1193.15 | backward_microstep: 4311.46 | backward_inner_microstep: 4272.51 | backward_allreduce_microstep: 38.87 | step_mi

[2022-07-02 05:05:56,229] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.76 | optimizer_gradients: 187.43 | optimizer_step: 5011.95
[2022-07-02 05:05:56,230] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1192.35 | backward_microstep: 4311.41 | backward_inner_microstep: 4270.40 | backward_allreduce_microstep: 40.92 | step_microstep: 5239.63
[2022-07-02 05:05:56,230] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1192.44 | backward: 4311.41 | backward_inner: 4270.41 | backward_allreduce: 40.93 | step: 5239.63
[2022-07-02 05:06:06,979] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.77 | optimizer_gradients: 186.48 | optimizer_step: 5017.63
[2022-07-02 05:06:06,980] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1191.27 | backward_microstep: 4311.78 | backward_inner_microstep: 4272.71 | backward_allreduce_microstep: 38.98 | step_mi

[2022-07-02 05:08:15,883] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.75 | optimizer_gradients: 184.50 | optimizer_step: 5012.36
[2022-07-02 05:08:15,884] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1192.01 | backward_microstep: 4309.50 | backward_inner_microstep: 4270.47 | backward_allreduce_microstep: 38.94 | step_microstep: 5237.02
[2022-07-02 05:08:15,884] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1192.07 | backward: 4309.50 | backward_inner: 4270.47 | backward_allreduce: 38.95 | step: 5237.02
[2022-07-02 05:08:26,618] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 35.76 | optimizer_gradients: 183.56 | optimizer_step: 5009.68
[2022-07-02 05:08:26,619] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1190.52 | backward_microstep: 4307.84 | backward_inner_microstep: 4268.75 | backward_allreduce_microstep: 39.00 | step_mi

`fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.
[2022-07-02 05:10:01,594] [INFO] [runner.py:398:main] cmd = /root/.pyenv/versions/3.9.10/envs/py3.9.10/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 pretrain_gpt2.py --model-parallel-size 1 --num-layers 32 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --max-position-embeddings 1024 --batch-size 4 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save checkpoints/gpt2_ds --load checkpoints/gpt2_ds --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-file /home/sys/STRONGHOLD/data/gpt2-merges.txt --data-impl mmap --split 949,50,1 --distributed-backend nccl --lr 1.5e-4 --lr-decay-style cosine --min-lr 1.0e-5 --weight-decay 1e-2 --clip-grad 1.0 --warmup 0.01 --checkpoint-

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
[2022-07-02 05:10:05,998] [INFO] [utils.py:822:see_memory_usage] Before Building Model
[2022-07-02 05:10:05,999] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         Max_MA 0.0 GB         CA 0.0 GB         Max_CA 0 GB 
[2022-07-02 05:10:05,999] [INFO] [utils.py:831:see_memory_usage] CPU Virtual Memory:  used = 3.16 GB, percent = 3.5%
beginging get_train_batch_size = <function get_train_batch_size at 0x7f92ec33a310>
train_batch = None, micro_batch=4
[2022-07-02 05:10:11,102] [INFO] [utils.py:822:see_memory_usage] After Building Model
[2022-07-02 05:10:11,103] [INFO] [utils.py:823:see_memory_usage] MA 0.0 GB         

 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_train_indexmap_200ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.002 seconds
    total number of samples: 3154519
    total number of epochs: 1
 > loading doc-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_doc_idx.npy
 > loading sample-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_sample_idx.npy
 > loading shuffle-idx mapping from /home/sys/STRONGHOLD/data/my-gpt2-en_text_document_valid_indexmap_4000ns_1024sl_1234s_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 108654
    total 

[2022-07-02 05:12:52,954] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4171.79
[2022-07-02 05:12:52,955] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1794.41 | backward_microstep: 5089.87 | backward_inner_microstep: 5034.74 | backward_allreduce_microstep: 54.98 | step_microstep: 4201.52
[2022-07-02 05:12:52,955] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1794.46 | backward: 5089.87 | backward_inner: 5034.76 | backward_allreduce: 55.02 | step: 4201.52
[2022-07-02 05:13:04,019] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4174.32
[2022-07-02 05:13:04,020] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1781.02 | backward_microstep: 5075.60 | backward_inner_microstep: 5020.75 | backward_allreduce_microstep: 54.72 | step_microstep: 4204.49
[2022-07-02 05:13:04,020] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1781.0

[2022-07-02 05:15:28,275] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4190.73
[2022-07-02 05:15:28,276] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1807.17 | backward_microstep: 5084.71 | backward_inner_microstep: 5029.00 | backward_allreduce_microstep: 55.57 | step_microstep: 4220.59
[2022-07-02 05:15:28,276] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1807.22 | backward: 5084.71 | backward_inner: 5029.02 | backward_allreduce: 55.60 | step: 4220.60
[2022-07-02 05:15:39,367] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4183.35
[2022-07-02 05:15:39,368] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1789.76 | backward_microstep: 5085.81 | backward_inner_microstep: 5030.20 | backward_allreduce_microstep: 55.47 | step_microstep: 4213.01
[2022-07-02 05:15:39,368] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1789.7

[2022-07-02 05:18:03,778] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | optimizer_step: 4224.63
[2022-07-02 05:18:03,779] [INFO] [logging.py:69:log_dist] [Rank 0] step=40, skipped=0, lr=[1.8749999999999998e-06, 1.8749999999999998e-06], mom=[(0.9, 0.999), (0.9, 0.999)]
[2022-07-02 05:18:03,779] [INFO] [timer.py:181:stop] 0/40, SamplesPerSec=0.36045088277337134
[2022-07-02 05:18:03,779] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 1795.24 | backward_microstep: 5088.82 | backward_inner_microstep: 5033.60 | backward_allreduce_microstep: 55.08 | step_microstep: 4255.51
[2022-07-02 05:18:03,779] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 time (ms) | forward: 1795.26 | backward: 5088.83 | backward_inner: 5033.62 | backward_allreduce: 55.11 | step: 4255.50
 iteration       40/      50 | elapsed time per iteration (ms): 11113.7 | learning rate: 1.875E-06 | lm loss: 9.305648E+00 | number of skipped iterations:   0 | number of nan iterations:  

### To print/draw the relevant information from log files

> - extract and save useful infomation from the detailed logs to `./results/case3.csv` <br>
> - visualize as a figure saved into `out/metric_throughput.png` <br>
> - `docker_exec cat ./results/case3.csv` can look through the numeric values <br>

In [33]:
%docker_exec ./examples/case3_extract.sh
%docker_exec python ./examples/case3_draw.py

import random; __counter__ = random.randint(0,2e9)
from IPython.display import HTML, display
display(HTML('<img src="./out/metric_throughput.png?%d" style="height:180px">' % __counter__))

## 2.4 CASE - Nearly linear scaling as model size increases (Figure 8b in Section VI.B)

In this case, we evaluate the performance (elapsed time per iteration - ms) as the model size increases. Similar to previous cases, the model size changes via increasing/decreasing the number of transformer layers. 

You would see the `elapsed time per iteration` linearly rise with the number of transformer layers (representing model size), proving STRONGHOLD's scalability.


**Using `ps aux` to check if there exists other running processes launched by other reviwers in case of GPU overlead.**

In [20]:
%docker_exec ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4340  2716 pts/0    Ss+  02:10   0:00 /bin/bash
root       38139  2.0  0.0   3976  3128 pts/1    Ss+  05:20   0:00 /bin/bash -c 
root       38297  0.0  0.0   5892  2896 pts/1    R+   05:20   0:00 ps aux


### The following results correspond to Figure 8b in the submitted paper. Please refers to section VI.B on page 9 for more details.  <br><br>Runs around 60 mins

In [21]:
%docker_exec ./examples/run.sh -m "stronghold" -l 92 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 64 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 56 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 40 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 24 -h 2048 -w 15
%docker_exec ./examples/run.sh -m "stronghold" -l 16 -h 2048 -w 15

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 92 2048 16 1024 4 15 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-92_hs-2048_bs-4_ws-15_2022-07-02.1656739206.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 92 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-fi

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 20956.8 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.006344E+01 | loss scale: 1.0 | grad norm: 10177.623 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.41 and total parameters 4.738 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 20.95683088302612;  SamplesPerSecond: 0.19086855366284317
time (ms) | e2e-time: 20956.82 | forward-compute: 3162.93 | backward-compute: 17782.95 | backward-embedding-all-reduce: 0.02 | optimizer: 2.06 | batch-generator: 1.37 | offloading-func-call-overhead: 51.73 | offloading-fwd-overhead: 2865.03 | offloading-bwd-overhead: 8.45 | offloading-fwd-2gpu-overhead: 1396.72 | offloading-fwd-2cpu-overhead: 1466.29 | offloading-bwd-2gpu-overhead: 4.31 | offloading-bwd-2cpu-overhead: 1.42
 iteration       40/      50 | elapsed time per iteration (ms): 20949.4 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.97

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.155 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 14004.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.054702E+01 | loss scale: 1.0 | grad norm: 10.246 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.79 and total parameters 3.328 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 14.004148483276367;  SamplesPerSecond: 0.28562964787018397
time (ms) | e2e-time: 14004.13 | forward-compute: 1975.78 | backward-compute: 12017.51 | backward-embedding-all-reduce: 0.02 | optimizer: 2.24 | batch-generator: 1.35 | offloading-func-call-overhead: 34.49 | offloading-fwd-overhead: 1761.04 | offloading-bwd-overhead: 282.85 | offloading-fwd-2gpu-overhead: 843.53 | offloading-fwd-2cpu-overhead: 916.28 | offloading-bwd-2gpu-overhead: 2.40 | offloading-bwd-2cpu-overhead: 278.75
 iteration       40/      50 | elapsed time per iteration (ms): 14200.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.03

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.156 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 11808.3 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.802625E+00 | loss scale: 1.0 | grad norm: 229509.988 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.12 and total parameters 2.925 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 11.808258152008056;  SamplesPerSecond: 0.3387459817110942
time (ms) | e2e-time: 11808.34 | forward-compute: 1667.69 | backward-compute: 10129.85 | backward-embedding-all-reduce: 0.02 | optimizer: 2.27 | batch-generator: 1.21 | offloading-func-call-overhead: 39.55 | offloading-fwd-overhead: 1475.02 | offloading-bwd-overhead: 141.16 | offloading-fwd-2gpu-overhead: 703.93 | offloading-fwd-2cpu-overhead: 769.98 | offloading-bwd-2gpu-overhead: 1.97 | offloading-bwd-2cpu-overhead: 137.69
 iteration       40/      50 | elapsed time per iteration (ms): 11814.8 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7891.7 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.006158E+01 | loss scale: 1.0 | grad norm: 8.309 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.8 and total parameters 2.119 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.891727471351624;  SamplesPerSecond: 0.5068598750426586
time (ms) | e2e-time: 7891.73 | forward-compute: 1062.78 | backward-compute: 6818.28 | backward-embedding-all-reduce: 0.02 | optimizer: 2.37 | batch-generator: 1.14 | offloading-func-call-overhead: 18.26 | offloading-fwd-overhead: 935.75 | offloading-bwd-overhead: 2.84 | offloading-fwd-2gpu-overhead: 423.85 | offloading-fwd-2cpu-overhead: 511.15 | offloading-bwd-2gpu-overhead: 1.15 | offloading-bwd-2cpu-overhead: 0.62
 iteration       40/      50 | elapsed time per iteration (ms): 7971.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.797252E+00 | lo

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.154 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 4121.6 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.004763E+01 | loss scale: 1.0 | grad norm: 4.123 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 10.44 and total parameters 1.314 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 4.121577477455139;  SamplesPerSecond: 0.9705021977337166
time (ms) | e2e-time: 4121.57 | forward-compute: 497.78 | backward-compute: 3613.31 | backward-embedding-all-reduce: 0.01 | optimizer: 2.50 | batch-generator: 1.16 | offloading-func-call-overhead: 10.40 | offloading-fwd-overhead: 409.09 | offloading-bwd-overhead: 1.86 | offloading-fwd-2gpu-overhead: 12.47 | offloading-fwd-2cpu-overhead: 396.16 | offloading-bwd-2gpu-overhead: 0.39 | offloading-bwd-2cpu-overhead: 0.39
 iteration       40/      50 | elapsed time per iteration (ms): 4152.3 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.520592E+00 | lo

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.155 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 2729.6 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.114230E+01 | loss scale: 1.0 | grad norm: inf | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 10.93 and total parameters 0.911 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 2.7295880794525145;  SamplesPerSecond: 1.465422577901314
time (ms) | e2e-time: 2729.54 | forward-compute: 34.53 | backward-compute: 2684.71 | backward-embedding-all-reduce: 0.01 | optimizer: 2.19 | batch-generator: 1.21 | offloading-func-call-overhead: 3.76 | offloading-fwd-overhead: 0.95 | offloading-bwd-overhead: 0.66 | offloading-fwd-2gpu-overhead: 0.01 | offloading-fwd-2cpu-overhead: 0.66 | offloading-bwd-2gpu-overhead: 0.05 | offloading-bwd-2cpu-overhead: 0.24
 iteration       40/      50 | elapsed time per iteration (ms): 2660.2 | learning rate: 1.875E-06 | global batch size:     4 | loss scale: 1.0 | grad norm: nan | 

### To print/draw the relevant information from log files

> - extract and save useful infomation from the detailed logs to `./results/case4.csv` <br>
> - visualize as a figure saved into `out/metric_linear_scaling.png` <br>
> - `docker_exec cat ./results/case4.csv` can look through the numeric values <br>

In [34]:
%docker_exec ./examples/case4_extract.sh
%docker_exec python ./examples/case4_draw.py

import random; __counter__ = random.randint(0,2e9)
from IPython.display import HTML, display
display(HTML('<img src="./out/metric_linear_scaling.png?%d" style="height:180px">' % __counter__))

## 2.5 CASE - Impact of working window size (Figure 9 in Section VI.C)

Working window size affects the throughput. The larger window can better overlap GPU computation with data transfer, leading to higher training throughput. But, a larger window size means more GPU memory occupancy.

This case evaluates the impact of working window size for STRONGHOLD with 1.7B model. You will see that at the first stage, the larger window size can gain more benefits, while at the end of the stage, enlarging window size shows no influence because the current window size can hide the data transformation process.

PS: The bandwidth restriction in the virtual machine might slightly hurt the performance of STRONGHOLD.

**Using `ps aux` to check if there exists other running processes launched by other reviwers in case of GPU overlead.**

In [25]:
%docker_exec ps aux

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   4340  2724 pts/0    Ss+  02:10   0:00 /bin/bash
root       43293  0.0  0.0   3976  3160 pts/1    Ss+  06:18   0:00 /bin/bash -c 
root       43451  0.0  0.0   5892  2924 pts/1    R+   06:18   0:00 ps aux


### The following results correspond to Figure 9 in the submitted paper. Please refers to Section VI.C on page 10 for more details. <br><br>Runs around 48 mins

In [26]:
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 2
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 4
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 6
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 8
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 10
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 12
%docker_exec ./examples/run.sh -m "stronghold" -l 32 -h 2048 -w 14

cd /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/.. && /home/sys/STRONGHOLD/examples/../SHv0-Megatron-LM/examples/sc22-gpt-sh.sh 32 2048 16 1024 4 2 2>&1 | tee /home/sys/STRONGHOLD/examples/../results/log_stronghold_l-32_hs-2048_bs-4_ws-2_2022-07-02.1656742692.txt && cd -
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py': No such file or directory
cp: cannot create regular file '/usr/local/lib/python3.8/dist-packages/deepspeed/ops/adam/cpu_adam.py': No such file or directory
PYTHONGIL=1 python pretrain_gpt.py --num-layers 32 --hidden-size 2048 --num-attention-heads 16 --seq-length 1024 --micro-batch-size 4 --global-batch-size 4 --max-position-embeddings 1024 --train-iters 50 --log-interval 10 --exit-interval 50 --lr-decay-iters 320000 --save ./checkpoints/gpt2 --load ./checkpoints/gpt2 --data-path /home/sys/STRONGHOLD/data/my-gpt2-en_text_document --vocab-file /home/sys/STRONGHOLD/data/gpt2-vocab.json --merge-file

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7862.0 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.068723E+01 | loss scale: 1.0 | grad norm: 8.260 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.15 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.862011671066284;  SamplesPerSecond: 0.508775637502647
time (ms) | e2e-time: 7862.01 | forward-compute: 1143.78 | backward-compute: 6707.56 | backward-embedding-all-reduce: 0.01 | optimizer: 2.48 | batch-generator: 1.12 | offloading-func-call-overhead: 15.54 | offloading-fwd-overhead: 1058.23 | offloading-bwd-overhead: 5319.03 | offloading-fwd-2gpu-overhead: 297.93 | offloading-fwd-2cpu-overhead: 759.71 | offloading-bwd-2gpu-overhead: 405.30 | offloading-bwd-2cpu-overhead: 4912.77
 iteration       40/      50 | elapsed time per iteration (ms): 7983.3 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.063645

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.155 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7702.9 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.430505E+00 | loss scale: 1.0 | grad norm: 3.716 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.3 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.702899551391601;  SamplesPerSecond: 0.5192849748738269
time (ms) | e2e-time: 7702.90 | forward-compute: 1045.38 | backward-compute: 6646.81 | backward-embedding-all-reduce: 0.01 | optimizer: 2.45 | batch-generator: 1.13 | offloading-func-call-overhead: 15.67 | offloading-fwd-overhead: 941.81 | offloading-bwd-overhead: 4778.44 | offloading-fwd-2gpu-overhead: 172.14 | offloading-fwd-2cpu-overhead: 769.09 | offloading-bwd-2gpu-overhead: 137.80 | offloading-bwd-2cpu-overhead: 4639.76
 iteration       40/      50 | elapsed time per iteration (ms): 7540.5 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 9.865253E

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7213.0 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.018981E+01 | loss scale: 1.0 | grad norm: 41.109 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.8 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.212998676300049;  SamplesPerSecond: 0.5545543787693892
time (ms) | e2e-time: 7213.00 | forward-compute: 948.47 | backward-compute: 6253.97 | backward-embedding-all-reduce: 0.01 | optimizer: 2.48 | batch-generator: 1.10 | offloading-func-call-overhead: 15.78 | offloading-fwd-overhead: 840.10 | offloading-bwd-overhead: 4986.02 | offloading-fwd-2gpu-overhead: 157.17 | offloading-fwd-2cpu-overhead: 682.30 | offloading-bwd-2gpu-overhead: 241.93 | offloading-bwd-2cpu-overhead: 4743.20
 iteration       40/      50 | elapsed time per iteration (ms): 7144.7 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.037998E

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.157 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 7306.3 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 9.887196E+00 | loss scale: 1.0 | grad norm: 11.667 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 7.7 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 7.306340432167053;  SamplesPerSecond: 0.5474696993845939
time (ms) | e2e-time: 7306.43 | forward-compute: 895.03 | backward-compute: 6400.73 | backward-embedding-all-reduce: 0.01 | optimizer: 2.46 | batch-generator: 1.16 | offloading-func-call-overhead: 16.62 | offloading-fwd-overhead: 789.98 | offloading-bwd-overhead: 3395.84 | offloading-fwd-2gpu-overhead: 162.99 | offloading-fwd-2cpu-overhead: 626.25 | offloading-bwd-2gpu-overhead: 22.92 | offloading-bwd-2cpu-overhead: 3372.06
 iteration       40/      50 | elapsed time per iteration (ms): 7201.5 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.007727E+

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.159 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6787.7 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.052729E+01 | loss scale: 1.0 | grad norm: 2652389.006 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.29 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.7876777172088625;  SamplesPerSecond: 0.589303170635041
time (ms) | e2e-time: 6787.67 | forward-compute: 813.48 | backward-compute: 5963.67 | backward-embedding-all-reduce: 0.01 | optimizer: 2.45 | batch-generator: 1.14 | offloading-func-call-overhead: 15.22 | offloading-fwd-overhead: 712.31 | offloading-bwd-overhead: 2280.06 | offloading-fwd-2gpu-overhead: 129.19 | offloading-fwd-2cpu-overhead: 581.88 | offloading-bwd-2gpu-overhead: 1.03 | offloading-bwd-2cpu-overhead: 2278.07
 iteration       40/      50 | elapsed time per iteration (ms): 6881.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.060

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.159 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6544.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.039556E+01 | loss scale: 1.0 | grad norm: 23.211 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.6 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.544146966934204;  SamplesPerSecond: 0.6112332165232401
time (ms) | e2e-time: 6544.15 | forward-compute: 817.64 | backward-compute: 5715.91 | backward-embedding-all-reduce: 0.01 | optimizer: 2.45 | batch-generator: 1.18 | offloading-func-call-overhead: 16.28 | offloading-fwd-overhead: 709.04 | offloading-bwd-overhead: 1019.43 | offloading-fwd-2gpu-overhead: 179.04 | offloading-fwd-2cpu-overhead: 528.76 | offloading-bwd-2gpu-overhead: 0.99 | offloading-bwd-2cpu-overhead: 1017.53
 iteration       40/      50 | elapsed time per iteration (ms): 6414.5 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.039095E+0

 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
>-- rank=0; local_rank=0;
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 50257) with 47 dummy tokens (new size: 50304)
> initializing torch distributed ...
.  > the rank=0 is ready...
.   > rank=0; local_rank=0, device=0
--------distributed env init done ----------
> initializing tensor model parallel with size 1
> initializing pipeline model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
make: Entering directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
make: Nothing to be done for 'default'.
make: Leaving directory '/home/sys/STRONGHOLD/SHv0-Megatron-LM/megatron/data'
> compiling dataset index builder ...
>>> done with dataset index builder. Compilation time: 0.158 seconds
> compiling and loading fused kernels ...

 iteration       30/      50 | elapsed time per iteration (ms): 6491.1 | learning rate: 1.406E-06 | global batch size:     4 | lm loss: 1.045013E+01 | loss scale: 1.0 | grad norm: 39.976 | number of skipped iterations:   0 | number of nan iterations:   0 |
Effective Tera Flops per GPU: 8.67 and total parameters 1.717 B
NumWorkers: 1; SamplesPerStep: 4; IterationTime: 6.491117906570435;  SamplesPerSecond: 0.616226674291515
time (ms) | e2e-time: 6491.14 | forward-compute: 787.34 | backward-compute: 5693.09 | backward-embedding-all-reduce: 0.02 | optimizer: 2.44 | batch-generator: 1.22 | offloading-func-call-overhead: 14.44 | offloading-fwd-overhead: 685.26 | offloading-bwd-overhead: 156.35 | offloading-fwd-2gpu-overhead: 128.67 | offloading-fwd-2cpu-overhead: 555.97 | offloading-bwd-2gpu-overhead: 0.86 | offloading-bwd-2cpu-overhead: 154.60
 iteration       40/      50 | elapsed time per iteration (ms): 6550.0 | learning rate: 1.875E-06 | global batch size:     4 | lm loss: 1.022740E+01 

### To print/draw the relevant information from log files

> - extract and save useful infomation from the detailed logs to `./results/case5.csv` <br>
> - visualize as a figure saved into `out/metric_window_size.png` <br>
> - `docker_exec cat ./results/case5.csv` can look through the numeric values <br>

In [35]:
%docker_exec ./examples/case5_extract.sh
%docker_exec python ./examples/case5_draw.py

import random; __counter__ = random.randint(0,2e9)
from IPython.display import HTML, display
display(HTML('<img src="./out/metric_window_size.png?%d" style="height:180px">' % __counter__))

You will see that the time per iteration(ms) is slightly longer at the smaller window sizes. But, as the window size increases, the iteration time will stay at stable value.

-----
# The end of this Artifact Evaluation
-----

#### Many thanks for your review, time and efforts on this artifact evaluation.  <br> Many thanks for your understanding and bearing with some inconveniences on this notebook. 

The repository will be open as soon as possible. Users can reproduce other experiment figures using the released docker image or source code.
