In [None]:
############################################################################
##
## Copyright (C) 2022 NVIDIA Corporation.  All rights reserved.
##
## NVIDIA Sample Code
##
## Please refer to the NVIDIA end user license agreement (EULA) associated
## with this source code for terms and conditions that govern your use of
## this software. Any use, reproduction, disclosure, or distribution of
## this software and related documentation outside the terms of the EULA
## is strictly prohibited.
##
############################################################################

## Megatron GPT Pretraining on Tabular Data


### Tensor and Pipeline, and Data Parallelism in practice

We are ready to pretrain the GPT model. As large models can be quite difficult to train due to memory constraints, Megatron makes it possible by using both Tensor parallelism and Pipeline parallelism that enables training transformer models with billions of parameters. Tensor parallelism and pipeline parallelism are orthogonal to each other. Recall the figure from the previous notebook that shows how to divide the large model horizontally (intra-layer) by tensor parallelism and vertically (across layers) by pipeline parallelism. Both of tensor and pipeline parallelism are types of <u>model parallelism</u>.

<br>
<center><img src=images/model_parallelism.png width="50%" height="50%" style="display=block; margin:auto" alt="model parallelism"/></center>
<br>

In addition to model parallelism, we can apply <u>data parallelism</u> to the training to fully utilize all the GPUs in the cluster. This [paper](https://arxiv.org/pdf/2104.04473.pdf) provides a few takeaways about how to optimally setup the model parallelism and data parallelism:

1. When considering different forms of model parallelism, tensor model parallelism should generally be used up to degree 𝑔 when using 𝑔-GPU servers, and then pipeline model parallelism can be used to scale up to larger models across server.
2. When using data and model parallelism, a total model-parallel size of 𝑀 = 𝑡 · 𝑝 should be used so that the model’s parameters and intermediate metadata fit in GPU memory; data parallelism can be used to scale up training to more GPUs. <br>In this case, 𝑡 is the tensor parallel size, and 𝑝 is the pipeline parallel size.
3. The optimal micro batch size 𝑏 depends on the throughput and memory footprint characteristics of the model, as well as the pipeline depth 𝑝, data-parallel size 𝑑, and batch size 𝐵.

In our experiment, we are only concerned with training a model that fits into a single GPU. we set the tensor model parallel and pipeline model parallelism parameter to 1. Here is the script we used for the pretraining task.

```bash
#! /bin/bash
source ./model_config.sh

python -m torch.distributed.launch $DISTRIBUTED_ARGS \
       pretrain_gpt.py \
       --num-layers $NUM_LAYERS \
       --hidden-size $HIDDEN_SIZE \
       --num-attention-heads $NUM_HEADS \
       --micro-batch-size 4 \
       --global-batch-size 32 \
       --seq-length $SEQ_LEN \
       --max-position-embeddings $MAX_POS_EMD \
       --train-iters 500000 \
       --lr-decay-iters 320000 \
       --tensorboard-dir $TB_PATH \
       --save $CHECKPOINT_PATH \
       --data-path $DATA_PATH \
       --data-impl mmap \
       --split 949,50,1 \
       --distributed-backend nccl \
       --tensor-model-parallel-size $TENSOR_MP_SIZE \
       --pipeline-model-parallel-size $PIPELINE_MP_SIZE \
       --lr 0.00015 \
       --lr-decay-style cosine \
       --min-lr 1.0e-5 \
       --weight-decay 1e-2 \
       --clip-grad 1.0 \
       --lr-warmup-fraction .01 \
       --checkpoint-activations \
       --log-interval 100 \
       --save-interval 5000 \
       --eval-interval 1000 \
       --eval-iters 10 \
       --load $LOADPATH \
       --vocab-file $VOCAB_FILE \
       --fp16
```

Run the pretraining task script below. Note, this is the most time consuming step. It can take days for the task to converge depending on the computation environment.

In [None]:
# # IF YOU ARE RETRAINING, YOU MAY WANT TO DELETE ALL THE PREVIOUS MODEL CHECKPOINTS
# OUTPUT_PATH='checkpoints'
# import os
# import shutil
# if os.path.isdir(OUTPUT_PATH):
#     shutil.rmtree(OUTPUT_PATH)

In [1]:
!date

Sat Mar 26 03:35:36 UTC 2022


While running the cell below, the model checkpoints and tensorboard events will be saved to:

```
TOY_MODEL_CHECKPOINT_PATH=checkpoints/gpt_toy_model
TOY_MODEL_TB_PATH=checkpoints/checkpoints/tb/toy_model
```
as defined in the <a href="./model_config.sh">model_config.sh</a> script and could be used for viewing the Tensorboard or for inference.

In [3]:
!./pretrain_step.sh

and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
using world size: 8, data-parallel-size: 8, tensor-model-parallel size: 1, pipeline-model-parallel size: 1 
--checkpoint-activations is no longer valid, use --activation-checkpoint-method instead. Defaulting to activation-checkpoint-method=uniform.
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .....

# Please shut down the Kernel

Ex. `Kernel -> Shut down kernel`, or in Jupyter Lab, navigating to the `Running Terminals and Kernels` Tab on the left sidebar, highlighting the mouse over this notebook's name in the `KERNELS` Section and selecting the `X` that appears.