## Uneven pp

### args:

```
        --tensor-model-parallel-size ${TP} \
        --pipeline-model-parallel-size ${PP} \
        --decoder-first-pipeline-num-layers ${FIRST_PP_LAYERS} \
        --decoder-last-pipeline-num-layers ${LAST_PP_LAYERS} \
```


REF: [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)

![image.png](./images/pp_1f1b.png)

We denote the number of microbatches in a batch as $m$, the number of pipeline stages (number of devices used for pipeline parallelism) as $p$, the ideal time per iteration as $t_{id}$ (assuming perfect or ideal scaling), and the time to execute a single microbatch’s forward and backward pass as $t_{f}$ and $t_{b}$. In this schedule, the pipeline bubble consists of $p - 1$ forward passes at the start of a batch, and $p - 1$ backward passes at the end. The total amount of time spent in the The pipeline bubble is then $t_{pb} = (p - 1)\cdot(t_{f}+t_{b})$. The ideal processing time for the batch is $t_{id} = m\cdot(t_{f} + t_{b})$. Therefore, the fraction of ideal computation time spent in the pipeline bubble is:

$$\text{Bubble time fraction (pipeline bubble size)} =  \frac{t_{pb}}{t_{id}}=\frac{p - 1}{m}$$ 

<br>

Ideally, the forward and reverse times on each pipeline rank are as balanced as possible, and the bubble time can be minimized. We can take a more intuitive look at the bubble from the following figure:

![image-2.png](./images/pp_GPipe.png)

<br>

Let's analyze the situation of unbalanced Pipeline Parallelism (PP):

![image-2.png](./images/pp_uneven_GPipe.png)



![image-2.png](./images/pp_vs_uneven_pp_additional_bubbles.png)


<br><br>
## TFLOPS Calculation

### ViT Encoder TFLOPS

Formula for the number of floating-point operations (FLOPS) in the convolutional part (forward pass) of the Vision Transformer (ViT):

$\text{FLOPS}_{\text{conv_forward}} = 2Nhcp^{2}$

 
<br>
Where:

$h$ : Hidden size

$p$ : Patch size

$c$ : Image input channels

$N$ : Number of image tokens after patchification

$l$ : Number of transformer layers


Formula for the number of image tokens after patchification:

$N = \left\lfloor\frac{W + p - 1}{p}\right\rfloor\times\left\lfloor\frac{H + p - 1}{p}\right\rfloor$


Formula for the number of floating-point operations (FLOPS) in the Transformer part (forward pass):

$\text{FLOPS}_{\text{transformer_forward}} = (24Nh^{2}+4hN^{2})l$


Formula for the total trillion floating-point operations per second (TFLOPS) of the ViT (forward pass + backward pass) (Note: The classifier head is ignored):

$\text{FLOPS}_{\text{ViT_(forward+backward)}} = 3\times((24Nh^{2}+4hN^{2})L + 2Nhcp^{2})$


<br>

### LLM TFLOPS

In most case: $h_2=4h$

$\text{FLOPS}_{\text{LLM_(forward+backward)}} = 3\times(24Nh^{2}+4hN^{2})L$

$\text{FLOPS}_{\text{one_layer_forward}} = 24Nh^{2}+4hN^{2}$

If $h_2\ne4h$:

$\text{FLOPS}_{\text{one_layer_forward}} = 8Nh^{2}+4hN^{2}+4Nhh_2$

$\text{FLOPS}_{\text{one_layer_forward_backward}} = 3\times(8Nh^{2}+4hN^{2}+4Nhh_2)$

$\text{FLOPS}_{\text{llm_forward_backward}} = 3\times(8Nh^{2}+4hN^{2}+4Nhh_2)L$

Where:

$h$ : Hidden size

$h_2$ : FFN immediate size

$N$ : Sequence length includes images tokens and text tokens

$l$ : Number of transformer layers

<br><br>
## Uneven pp parameters configuration

In [37]:
def vit_encoder_tflops_calculator(W,H,P,hidden_size, in_channels, L):
    N = ((W+P-1)//P) * ((H+P-1)//P)
    print(N)
    tflops_conv_forward = 2*N*hidden_size*in_channels*(P**2)
    tflops_transformer_forward = (24*N*(hidden_size**2) + 4*hidden_size*(N**2))*L
    tflops=3*(tflops_conv_forward + tflops_transformer_forward)
    return tflops/1e12
    
W=224
H=224
P=14
hidden_size=4096
in_channels=3
L=28

vit_tflops=vit_encoder_tflops_calculator(W,H,P, hidden_size, in_channels, L)
print("vit_tflops:", vit_tflops)


def llm_encoder_tflops_calculator(hidden_size, L, seq_len, intermediate_size=None):
    if intermediate_size is None:
        tflops_one_transformer_layer_forward=24*seq_len*(hidden_size**2) + 4*hidden_size*(seq_len**2)
    else:
        tflops_one_transformer_layer_forward=8*seq_len*(hidden_size**2) + 4*hidden_size*(seq_len**2) + 4*seq_len*hidden_size*intermediate_size
#     tflops_transformer_forward = (24*seq_len*(hidden_size**2) + 4*hidden_size*(seq_len**2))*L
    tflops_transformer_forward = tflops_one_transformer_layer_forward * L
    tflops = tflops_transformer_forward*3/1e12
    tflops_one_layer = tflops_one_transformer_layer_forward*3/1e12
    return tflops, tflops_one_layer

hidden_size=3584
intermediate_size=18944
seq_len=1024
L=28

llm_tflops, llm_tflops_onelayer=llm_encoder_tflops_calculator(hidden_size, L, seq_len,intermediate_size)
print('llm_tflops:', llm_tflops)
print('llm_tflops_onelayer:', llm_tflops_onelayer)

nGPUs=2

total_tflops = (vit_tflops+llm_tflops)
nlayers_per_rank = total_tflops/nGPUs/llm_tflops_onelayer
print("nlayers_per_rank:\t", nlayers_per_rank)

256
vit_tflops: 8.75254775808
llm_tflops: 33.462090203136
llm_tflops_onelayer: 1.195074650112
nlayers_per_rank:	 17.66192511792453


<br><br>
# GPU Memory Analysis (LLM)


### Parameters

Ref:[weights memory](https://nvidia-my.sharepoint.com/:p:/p/xueh/EQ8ZCgcmONpCjfMw9YZcBfsB31BAFcL1pX9oMS_DAIKn9A?e=keCH0v)

$Prameters_{\text{weights_per_transformer_layer}}=2h+12\frac{h^2}{tp}+4h+\frac{7h}{tp}=6h+12\frac{h^2}{tp}+\frac{7h}{tp}$

$Parameters_{\text{vocab}}=Vh$

where:

$tp$: tensor parallel size

$h$: hidden size

here, 

$h_{\text{intermediate_size}}=4h$

<br>

### Total static memory per transformer layer

Ref: [total memory](https://nvidia-my.sharepoint.com/:p:/p/xueh/EQ8ZCgcmONpCjfMw9YZcBfsB31BAFcL1pX9oMS_DAIKn9A?e=3DIAlm)

Ref: [ZeRO](https://arxiv.org/pdf/1910.02054)

If training with BF16, total static memory occupation, grad in 16bit precision:

Static memory occupation:

$M_{\text{weights}}=2\phi$

$M_{\text{grads}}=2\phi$

$M_{\text{os}}=12\phi$

$M_{\text{total}}=16\phi$

where:

$\phi$: parameters number

So, the total static memory per transformer layer is: <span style="color:red">$M_{\text{static_memory_per_transformer_layer}}=16\phi=16*(6h+12\frac{h^2}{tp}+\frac{7h}{tp})$</span>

### Activation memoy per transformer layer

<span style="color:red">$M_{\text{activation_memory_per_transformer_layer}}=\frac{34bsh}{tp}$</span>

To simplify the calculation, we ingore the activations of vocabulary embedding at the first stage and the LLM Head at the last stage.


### Total memory per transformer layer

<span style="color:red">$M_{\text{per_transformer_layer}}=M_{\text{static_memory_per_transformer_layer}} + M_{\text{activation_memory_per_transformer_layer}}$</span>

<span style="color:red">$M_{\text{per_transformer_layer}}=16*(6h+12\frac{h^2}{tp}+\frac{7h}{tp})+\frac{34bsh}{tp}$</span>


<br>

# GPU Memory Analysis (ViT Encoder)

### Parameters (ViT Encoder)

$Parameters_{\text{conv}}=p^2C_{in}C_{out}=p^2ch$

$Prameters_{\text{weights_all_transformer_layers}}=(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l$

$\phi=p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l$

where:

$h$ : Hidden size

$p$ : Patch size

$c$ : Image input channels

$N$ : Number of image tokens after patchification

$l$ : Number of transformer layers

$\phi$ : Parameters of ViT Encoder

### Total static memory (ViT Encoder)

$M_{\text{static_memory_vit_encoder}}=16\phi=16*(p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l)$

### Activation memory (ViT Encoder)

$M_{\text{activtion_memory_vit_encoder}}=xxx+\frac{34bshl}{tp}$

### Total memory (ViT Encoder)

$M_{\text{total_vit_encoder}}=M_{\text{static_memory_vit_encoder}}+M_{\text{activtion_memory_vit_encoder}}$

<span style="color:red">$M_{\text{total_vit_encoder}}=16*(p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l+xxx+\frac{34bshl}{tp}$ </span>



# GPU Memory Analysis (MLP Adaptor/projector/merger)

This part mainly aims to align the features from the Vision Transformer (ViT) with the text features. It generally consists of one or two linear layers. Here, we take a single linear layer as an example to analyze the GPU memory usage.

### Parameters (MLP/Linear projector)

$Parameters_{\text{projector}}=h_{\text{vit}}h_{\text{llm}}$



In [50]:
def weights_memory_calculation(hidden_size, num_layers, vocab_size=32000, intermediate_size=None):
    h=hidden_size
    l=num_layers
    if intermediate_size is None:
        parameters = (6*h + 12*(h**2) + 7*h)*l + 2*vocab_size*h
    return parameters/1e9

hidden_size=4096
num_layers=32

weights=weights_memory_calculation(hidden_size, num_layers)

print('weights:', weights)


H=4096
L=32
n=32
d=128
I=11008
V=32000


V*H+L*(H*H+n*d*H+2*H*I)+H*V

weights: 6.70629888


4221566976

In [51]:
32000*4096+32*(4096*4096+2*32*128*4096+2*4096*11008)+4096*32000

4758437888

<br><br>
## How to config Uneven PP parameters?