# Uneven pp

## Parameters involved in the configuration file:

```
        --tensor-model-parallel-size ${TP} \
        --pipeline-model-parallel-size ${PP} \
        --decoder-first-pipeline-num-layers ${FIRST_PP_LAYERS} \
        --decoder-last-pipeline-num-layers ${LAST_PP_LAYERS} \
```


<br>

## What scenarios are suitable for enabling Uneven pp?

<span style="color:red">**When the computation time is unbalanced among different PP ranks, Uneven PP can be enabled. Moreover, the more unbalanced time consumed on different ranks, the more obvious the benefits of enabling Uneven PP will be.**</span>

When PP is enabled, in the default configuration, the Transformer layers of the LLM Decoder are evenly distributed across different ranks.

For example, $L_{\text{num_transformer_layers}}=32$, `--pipeline-model-parallel-size=4`, by default, each rank will be allocated 8 Transformer layers.

In this case, the first and the last PP ranks will be assigned more modules. Besides the 8 Transformer layers, the first PP rank will also have a ViT Encoder and an MLP Adaptor. Similarly, in addition to the 8 Transformer layers, the last rank will have an LLM Head.

In comparison, when the Transformer layers of the LLM Decoder are evenly distributed, the first PP rank is assigned more additional modules, which leads to significantly longer computation time and causes more bubbles on other PP ranks.

Why does this happen? You can refer to the following analysis of the principle:

<br><br>

## Principle: Why does GPU utilization decrease when the computation time on different PP ranks is uneven?

<br>

REF: [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://arxiv.org/abs/2104.04473)

![image.png](./images/pp_1f1b.png)

We denote the number of microbatches in a batch as $m$, the number of pipeline stages (number of devices used for pipeline parallelism) as $p$, the ideal time per iteration as $t_{id}$ (assuming perfect or ideal scaling), and the time to execute a single microbatch’s forward and backward pass as $t_{f}$ and $t_{b}$. In this schedule, the pipeline bubble consists of $p - 1$ forward passes at the start of a batch, and $p - 1$ backward passes at the end. The total amount of time spent in the The pipeline bubble is then $t_{pb} = (p - 1)\cdot(t_{f}+t_{b})$. The ideal processing time for the batch is $t_{id} = m\cdot(t_{f} + t_{b})$. Therefore, the fraction of ideal computation time spent in the pipeline bubble is:

$$\text{Bubble time fraction (pipeline bubble size)} =  \frac{t_{pb}}{t_{id}}=\frac{p - 1}{m}$$ 

<br>

<span style="color:red">**Ideally, the forward and backward times consumed on each PP rank are as balanced as possible, and the bubbles time can be minimized.**</span> We can take a more intuitive look at the bubble from the following figure:

![image-2.png](./images/pp_GPipe.png)

<br>

Let's analyze the situation of unbalanced Pipeline Parallelism (PP):

![image-2.png](./images/pp_uneven_GPipe.png)



![image-2.png](./images/pp_vs_uneven_pp_additional_bubbles.png)


<span style="color:red">**From the above comparison, we can see that since the calculation time of the first PP stage is significantly longer than that of other PP ranks, compared with the ideal situation, additional bubbles are introduced on other ranks. Moreover, the longer the calculation time of the first PP stage, the larger the bubbles on other ranks will be. Therefore, the calculation time on each rank should be made as identical or balanced as possible.**</span>

<br><br>

## How to find the optimal hyperparameter settings to enable Uneven pp

We need to make the calculation time on each pipeline parallel (PP) rank as balanced as possible. Before the formal training run, we can preliminarily estimate the GPU calculation time through TeraFLOPS (tflops). Therefore, making the calculation time on each PP rank as balanced as possible can be simplified to making the tflops of the modules on each PP rank as balanced as possible.

Note: For the first PP rank, the calculation time includes not only the calculation time of the model but also the data loading time. This is a non - negligible part of the time when configuring the parameters of Uneven PP. However, this part of the time cannot be estimated by floating - point operations. For simplicity, we do not take this time into account during the calculation. But in the final result, we round down the number of large language models (LLMs) assigned to the first PP rank and round up the number of LLMs assigned to other ranks.

<br>

$TFLOPS_{\text{total}} = TFLOPS_{\text{vit_encoder}} + TFLOPS_{\text{adaptor}} + TFLOPS_{\text{llm_decoder}}$

<br>

For an even distribution, the TFLOPS on each pipeline parallel (PP) rank is：

$TFLOPS_{\text{per_pp_rank}}=\frac{TFLOPS_{\text{total}}}{\text{pp}}$

Correspondingly, the number of LLM decoder transformer layers that should be assigned to each intermediate rank is：

$L_{\text{per_middle_pp_rank}} = \lceil{\frac{TFLOPS_{\text{per_pp_rank}}}{TFLOPS_{\text{per_llm_decoder_one_transformer_layer}}}}\rceil$

<br>

Therefore, the most important parameter in Uneven PP, `--decoder-first-pipeline-num-layers`:

$L_{\text{first_decoder_pp_rank}}=Layers_{\text{llm_decoder}}-L_{\text{per_middle_pp_rank}}*(pp-1)$ 

<br>

The parameters mentioned above are explained as follows：

$pp$: pipeline-model-parallel-siz

$Layers_{\text{llm_decoder}}$: total number of llm decoder transformer layers

$L_{\text{per_middle_pp_rank}}$: number of transformer layers allocated on each pipeline rank

$L_{\text{first_decoder_pp_rank}}$: decoder-first-pipeline-num-layers

For the detailed calculation process of TFLOPS, you can refer to：[TFLOPS Calculation Part](http://10.23.206.92:7888/notebooks/UnevenPP-chinese.ipynb#TFLOPS-Calculation)

<br>

To sum up, we can calculate `--decoder-first-pipeline-num-layers` as follows：


$L_{\text{first_decoder_pp_rank}}=Layers_{\text{llm_decoder}}-L_{\text{per_middle_pp_rank}}*(pp-1)$ 

<br>


<span style="color:red">$L_{\text{first_decoder_pp_rank}}=Layers_{\text{llm_decoder}}-{\frac{TFLOPS_{\text{total}}}{\text{pp}}}*(pp-1)$ </span>

<br>

For example, here's how to set the hyperparameters of Uneven PP under different hyperparameter scenarios：
1. Case1: General Vision Transformer (ViT) settings
2. Case2: To better observe the advantageous scenarios of Uneven PP, change the $\text{vit_hidden_size}$ from 1280 to 4096.
    * In this case, the `--decoder-first-pipeline-num-layers` should be assigned fewer transformer layers.
3. The proportion of TFLOPS in the ViT Encoder part exceeds the TFLOPS that would be evenly distributed to each pipeline parallel (PP) rank.
    * In this case, set `--decoder-first-pipeline-num-layers=0`, and the transformer layers of the large language model (LLM) are distributed among the remaining PP ranks.
    
<br>


You can use the `uneven_pp_parameters_generator` function in [uneven_pp_utils.py](./uneven_pp_utils.py) to quickly estimate the hyperparameter configuration:

<br>

In [6]:
from uneven_pp_utils import *

# *********************************************************************************************
# case 1
# vit encoder parameters
image_w = 224
image_h = 224
image_in_channels = 3
patch_size = 14
vit_hidden_size = 1280
vit_intermediate_size = vit_hidden_size*4
vit_num_layers = 28

# adaptor parameters
# vit_hidden_size = 1280
# llm_hidden_size = 3584

# llm decoder parameters
decoder_seq_len = 1024
llm_hidden_size = 3584
llm_intermediate_size = 18944
llm_num_layers = 28

pp_size = 2

num_layers_first_pp_rank, num_layers_last_pp_rank = uneven_pp_parameters_generator(
    image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, 
    decoder_seq_len, llm_num_layers, llm_hidden_size, llm_intermediate_size, pp_size)

print('\ncase1 (vit ~500M):')
print('num_layers_first_pp_rank:', num_layers_first_pp_rank)
print('num_layers_last_pp_rank:', num_layers_last_pp_rank)

# *********************************************************************************************
# case 2

# vit encoder parameters
image_w = 224
image_h = 224
image_in_channels = 3
patch_size = 14
vit_hidden_size = 4096
vit_intermediate_size = vit_hidden_size*4
vit_num_layers = 28

# adaptor parameters
# vit_hidden_size = 1280
# llm_hidden_size = 3584

# llm decoder parameters
decoder_seq_len = 1024
llm_hidden_size = 3584
llm_intermediate_size = 18944
llm_num_layers = 28

pp_size = 2

num_layers_first_pp_rank, num_layers_last_pp_rank = uneven_pp_parameters_generator(
    image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, 
    decoder_seq_len, llm_num_layers, llm_hidden_size, llm_intermediate_size, pp_size)

print('\ncase2: (vit ~5.6G)')
print('num_layers_first_pp_rank:', num_layers_first_pp_rank)
print('num_layers_last_pp_rank:', num_layers_last_pp_rank) 

# *********************************************************************************************
# case 3
# vit encoder parameters
image_w = 224
image_h = 224
image_in_channels = 3
patch_size = 14
vit_hidden_size = 8000
vit_intermediate_size = vit_hidden_size*4
vit_num_layers = 28

# adaptor parameters
# vit_hidden_size = 1280
# llm_hidden_size = 3584

# llm decoder parameters
decoder_seq_len = 1024
llm_hidden_size = 3584
llm_intermediate_size = 18944
llm_num_layers = 28

pp_size = 2

num_layers_first_pp_rank, num_layers_last_pp_rank = uneven_pp_parameters_generator(
    image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, 
    decoder_seq_len, llm_num_layers, llm_hidden_size, llm_intermediate_size, pp_size)

print('\ncase3: (vit ~21G)')
print('num_layers_first_pp_rank:', num_layers_first_pp_rank)
print('num_layers_last_pp_rank:', num_layers_last_pp_rank) 


case1 (vit ~500M):
num_layers_first_pp_rank: 13
num_layers_last_pp_rank: 15

case2: (vit ~5.6G)
num_layers_first_pp_rank: 10
num_layers_last_pp_rank: 18

case3: (vit ~21G)
num_layers_first_pp_rank: 0
num_layers_last_pp_rank: 28


***

<br>

## GPU memory estimating


Estimate whether the GPU memory usage is reasonable under the current configuration. (Note: The following estimates do not take into account various memory optimization techniques, and the actual memory usage should be based on the real situation.)

For the analysis of GPU memory usage of each part, you can refer to the **GPU Memory Analysis (LLM)** section.

You can use the `memory_first_pp_rank` function in [uneven_pp_utils.py](./uneven_pp_utils.py) to quickly estimate whether the GPU memory usage is reasonable.

If you adopt the configuration of case 2 with `pp = 2` and `tp = 1`, when estimating the memory on the first PP rank, the estimated memory is about 120GB, which exceeds the 96GB memory capacity of the H20 GPU.

So, we try the configuration of case 2 again with `pp = 2` and `tp = 2`. At this time, the estimated memory on the first PP rank is about 61GB, which meets the 96GB memory capacity of the H20 GPU.

Similarly, we need to verify whether the memory on other PP ranks meets the requirements. Finally, the configuration of case 2 with `pp = 2` and `tp = 2` can meet the GPU memory requirements. Therefore, we will conduct experiments with `pp = 2` and `tp = 2` next.

<br>

In [1]:
from uneven_pp_utils import *
from uneven_pp_utils import memory_first_pp_rank_calculator

# *********************************************************************************************
# case 2: memory analysi, tp=1, pp2

print('\ncase 2: memory analysi, tp=2, pp2')
# vit encoder parameters
image_w = 224
image_h = 224
image_in_channels = 3
patch_size = 14
vit_hidden_size = 4096
vit_intermediate_size = vit_hidden_size*4
vit_num_layers = 28

bs=1
tp=1

# llm decoder parameters
decoder_seq_len = 1024
llm_hidden_size = 3584
llm_intermediate_size = 18944
llm_num_layers = 28

# uneven pp
num_layers_first_pp_stage=10

memory_vit_encoder_calculator(image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, bs, tp)

memory_first_pp_rank = memory_first_pp_rank_calculator(image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, decoder_seq_len, llm_num_layers, llm_hidden_size, llm_intermediate_size, num_layers_first_pp_stage, bs, tp)

# *********************************************************************************************
# case 2: memory analysi, tp=2, pp2

print('\ncase 2: memory analysi, tp=2, pp1')
# vit encoder parameters
image_w = 224
image_h = 224
image_in_channels = 3
patch_size = 14
vit_hidden_size = 4096
vit_intermediate_size = vit_hidden_size*4
vit_num_layers = 28

bs=1
tp=2

# llm decoder parameters
decoder_seq_len = 1024
llm_hidden_size = 3584
llm_intermediate_size = 18944
llm_num_layers = 28

# uneven pp
num_layers_first_pp_stage=10

memory_vit_encoder_calculator(image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, bs, tp)

memory_first_pp_rank = memory_first_pp_rank_calculator(image_w, image_h, image_in_channels, patch_size, vit_num_layers, vit_hidden_size, vit_intermediate_size, decoder_seq_len, llm_num_layers, llm_hidden_size, llm_intermediate_size, num_layers_first_pp_stage, bs, tp)



case 2: memory analysi, tp=2, pp2
total_memory_vit_encoder:91.255 GB
memory_llm_decoder_on_first_pp_rank:31.392 GB
memory on first pp rank:122.647 GB

case 2: memory analysi, tp=2, pp1
total_memory_vit_encoder:45.653 GB
memory_llm_decoder_on_first_pp_rank:15.698 GB
memory on first pp rank:61.350 GB


<br><br>

****

# Tips

The parameter configuration of Uneven pp can be determined through the following steps:
1. Ensure the balanced allocation of computing power. Calculate the parameter configuration of Uneven pp based on TFLOPS. For instance, in the example, set`--decoder-first-pipeline-num-layers=10`, `--pipeline-model-parallel-siz=2`, and `--tensor-model-parallel-siz=1`.
2. Ensure that the GPU memory meets the requirements. As in the example, when setting `--decoder-first-pipeline-num-layers=10`, `--pipeline-model-parallel-siz=2`, and `--tensor-model-parallel-siz=1`, it will exceed the GPU memory capacity of the H20-96G. Therefore, the final selection is `--decoder-first-pipeline-num-layers=10`, `--pipeline-model-parallel-siz=2`, and `--tensor-model-parallel-siz=2`.


<br><br>

****

# Examples: Uneven pp enabled vs. Uneven pp disabled

以case 2为例来看一下Uneven pp开启所能带来的性能提升：

参考脚本：[run_mcore_qwen_h20.sh](./run_mcore_qwen_h20.sh)

## exp1: uneven pp disabled

```
sh run_mcore_qwen_h20.sh  \
dsw  \
7B   \
1    \
32 \
1e-5   \
1e-6   \
1024  \
1024  \
bf16  \
2   \
2  \
1 \
false \
false \
true   \
false \
false \
100000  \
/workspace/data/mm/LLaVA-Pretrain/wds   \
/workspace/data/mm/LLaVA-Pretrain/wds   \
/workspace/data/mm/model/qwen2-vl-ckpts/Qwen2-VL-2B-Instruct \
20000  \
200   \
./output_mcore_qwen2vl_pretrain \
14 \
0
```

<br>

## exp1: uneven pp enabled

```
sh run_mcore_qwen_h20.sh  \
dsw  \
7B   \
1    \
32 \
1e-5   \
1e-6   \
1024  \
1024  \
bf16  \
2   \
2  \
1 \
false \
false \
true   \
false \
false \
100000  \
/workspace/data/mm/LLaVA-Pretrain/wds   \
/workspace/data/mm/LLaVA-Pretrain/wds   \
/workspace/data/mm/model/qwen2-vl-ckpts/Qwen2-VL-2B-Instruct \
20000  \
200   \
./output_mcore_qwen2vl_pretrain \
10 \
0
```

<br>
使能Uneven pp时，获得的加速效果如下：

# Benchmark: Uneven pp enabled v.s. Uneven pp disabled on case 2

|config|Time per iteration (ms)|Speedup|GPU|mbs/gbs|
|:-:|:-:|:-:|:-:|:-:|
|Uneven pp disabled|28388.7||H20|1/32|
|Uneven pp enabled （10，18）|19607.1|1.448|H20|1/32|

<br>



<br><br>
***

# Timeline (Uneven pp disabled vs. Uneven pp enabled)

通过如下的timeline我看可以清楚的看到：
1. 当Uneven pp使能时 （`--decoder-first-pipeline-num-layers=12`， `--decoder-last-pipeline-num-layers=16`, `--pipeline-model-parallel-size=2`），两个PP ranks之间的算力分配更加均衡 （32.9% vs. 34.1%）；
2. 当Uneven pp不使能时 （`--decoder-first-pipeline-num-layers=14`， `--decoder-last-pipeline-num-layers=14`, `--pipeline-model-parallel-size=2`），两个PP ranks之间的算力分配不均衡 (34.1% vs. 17.4)；
3. 对比Uneven pp使能和不使能时，会发现当Uneven pp不使能时，pp rank1上通信等待的时间会更长，有更多的通信等待bubbles.

<br>

![](./images/nsys_timeline_even_pp_vs_uneven_pp.png)

<br><br><br><br>

***
***


# Appendix


***

# TFLOPS Calculation

### ViT Encoder TFLOPS

Formula for the number of floating-point operations (FLOPS) in the convolutional part (forward pass) of the Vision Transformer (ViT):

$\text{FLOPS}_{\text{conv_forward}} = 2Nhcp^{2}$

 
<br>
Where:

$h$ : Hidden size

$p$ : Patch size

$c$ : Image input channels

$N$ : Number of image tokens after patchification

$l$ : Number of transformer layers


Formula for the number of image tokens after patchification:

$N = \left\lfloor\frac{W + p - 1}{p}\right\rfloor\times\left\lfloor\frac{H + p - 1}{p}\right\rfloor$


Formula for the number of floating-point operations (FLOPS) in the Transformer part (forward pass):

$\text{FLOPS}_{\text{transformer_forward}} = (24Nh^{2}+4hN^{2})l$


Formula for the total trillion floating-point operations per second (TFLOPS) of the ViT (forward pass + backward pass) (Note: The classifier head is ignored):

$\text{FLOPS}_{\text{ViT_(forward+backward)}} = 3\times((24Nh^{2}+4hN^{2})L + 2Nhcp^{2})$


<br><br>

### Adaptor/Projector TFLOPS

This part mainly aims to align the features from the Vision Transformer (ViT) with the text features. It generally consists of one or two linear layers. Here, we take a single linear layer as an example to calculate the TFLOPs.


$\text{FLOPS}_{\text{projector_forward}} = 2bs_{vit}h_{vit}h_{llm}$

<br><br>

### LLM Decoder TFLOPS

In most case: $h_2=4h$

$\text{FLOPS}_{\text{LLM_(forward+backward)}} = 3\times(24Nh^{2}+4hN^{2})L$

$\text{FLOPS}_{\text{one_layer_forward}} = 24Nh^{2}+4hN^{2}$

If $h_2\ne4h$:

$\text{FLOPS}_{\text{one_layer_forward}} = 8Nh^{2}+4hN^{2}+4Nhh_2$

$\text{FLOPS}_{\text{one_layer_forward_backward}} = 3\times(8Nh^{2}+4hN^{2}+4Nhh_2)$

$\text{FLOPS}_{\text{llm_forward_backward}} = 3\times(8Nh^{2}+4hN^{2}+4Nhh_2)L$

Where:

$h$ : Hidden size

$h_2$ : FFN immediate size

$N$ : Sequence length includes images tokens and text tokens

$l$ : Number of transformer layers

<br>

<span style="color:red">**Note:The hyperparameters $l$, $N$, $h$ of the LLM are different from those of the ViT. To distinguish them, we will use $l_{\text{vit}}$, $N_{\text{vit}}$, $h_{\text{vit}}$, $l_{\text{llm}}$, $N_{\text{llm}}$, $h_{\text{llm}}$ to replace them respectively.**</span>

<br><br>
****

# GPU memory Analysis


## GPU Memory Analysis (Transformer Layer)


### Parameters

Ref:[weights memory](https://nvidia-my.sharepoint.com/:p:/p/xueh/EQ8ZCgcmONpCjfMw9YZcBfsB31BAFcL1pX9oMS_DAIKn9A?e=keCH0v)

$Prameters_{\text{weights_per_transformer_layer}}=2h+12\frac{h^2}{tp}+4h+\frac{7h}{tp}=6h+12\frac{h^2}{tp}+\frac{7h}{tp}$

$Parameters_{\text{vocab}}=Vh$

where:

$tp$: tensor parallel size

$h$: hidden size

here, 

$h_{\text{intermediate_size}}=4h$

<br>

### Total static memory per transformer layer

Ref: [total memory](https://nvidia-my.sharepoint.com/:p:/p/xueh/EQ8ZCgcmONpCjfMw9YZcBfsB31BAFcL1pX9oMS_DAIKn9A?e=3DIAlm)

Ref: [ZeRO](https://arxiv.org/pdf/1910.02054)

If training with BF16, total static memory occupation, grad in 16bit precision:

Static memory occupation:

$M_{\text{weights}}=2\phi$

$M_{\text{grads}}=2\phi$

$M_{\text{os}}=12\phi$

$M_{\text{total}}=16\phi$

where:

$\phi$: parameters number

So, the total static memory per transformer layer is: <span style="color:red">$M_{\text{static_memory_per_transformer_layer}}=16\phi=16*(6h+12\frac{h^2}{tp}+\frac{7h}{tp})$</span>

### Activation memoy per transformer layer

<span style="color:red">$M_{\text{activation_memory_per_transformer_layer}}=\frac{34bsh}{tp}$</span>

To simplify the calculation, we ingore the activations of vocabulary embedding at the first stage and the LLM Head at the last stage.


### Total memory per transformer layer

$M_{\text{per_transformer_layer}}=M_{\text{static_memory_per_transformer_layer}} + M_{\text{activation_memory_per_transformer_layer}}$

<span style="color:red">$M_{\text{per_transformer_layer}}=16*(6h+12\frac{h^2}{tp}+\frac{7h}{tp})+\frac{34bsh}{tp}$</span>


<br>

## GPU Memory Analysis (ViT Encoder)

### Parameters (ViT Encoder)

$Parameters_{\text{conv}}=p^2C_{in}C_{out}=p^2ch$

$Prameters_{\text{weights_all_transformer_layers}}=(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l$

$\phi=p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l$

where:

$h$ : Hidden size

$p$ : Patch size

$c$ : Image input channels

$N$ : Number of image tokens after patchification

$l$ : Number of transformer layers

$\phi$ : Parameters of ViT Encoder

### Total static memory (ViT Encoder)

$M_{\text{static_memory_vit_encoder}}=16\phi=16*(p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l)$

### Activation memory (ViT Encoder)

$M_{\text{activtion_memory_vit_encoder}}=2whc+\frac{34bshl}{tp}$

### Total memory (ViT Encoder)

$M_{\text{total_vit_encoder}}=M_{\text{static_memory_vit_encoder}}+M_{\text{activtion_memory_vit_encoder}}$

<span style="color:red">$M_{\text{total_vit_encoder}}=16*(p^2ch+(6h+12\frac{h^2}{tp}+\frac{7h}{tp})l)+2whc+\frac{34bshl}{tp}$ </span>


<br>

## GPU Memory Analysis (MLP Adaptor/projector)

This part mainly aims to align the features from the Vision Transformer (ViT) with the text features. It generally consists of one or two linear layers. Here, we take a single linear layer as an example to analyze the GPU memory usage.

### Parameters (Linear projector)

$Parameters_{\text{projector}}=h_{\text{vit}}h_{\text{llm}}$

### Total static memory (Linear projector)

$M_{\text{static_memory_projector}}=16\phi=16h_{\text{vit}}h_{\text{llm}}$

### Activation memory (Linear projector)

$M_{\text{activtion_memory_projector}}=2b{s_{\text{vit}}}{h_{\text{vit}}}$

### Total memory (Linear projector)

$M_{\text{total_projector}}=M_{\text{static_memory_projector}}+M_{\text{activtion_memory_projector}}$

<span style="color:red">$M_{\text{total_projector}}=16h_{\text{vit}}h_{\text{llm}}+2b{s_{\text{vit}}}{h_{\text{vit}}}$ </span>



<br>

## GPU memory on different pp ranks

Note: For the ViT Encoder and the LLM Decoder, different settings are used for 
$h$ and $s$. To make a distinction, we will replace the original $h$ and $s$ with $h_{vit}$ and $s_{vit}$ respectively.

The first pp rank:

$M_{\text{first_pp_rank}}=M_{\text{total_vit_encoder}}+M_{\text{total_projector}}+M_{\text{total_llm_on_first_rank}}$

<span style="color:red">$M_{\text{first_pp_rank}}=16(p^2ch_{\text{vit}}+(6h_{\text{vit}}+12\frac{h_{\text{vit}}^2}{tp}+\frac{7h_{\text{vit}}}{tp})l_{\text{vit}}+2whc+\frac{34bs_{\text{vit}}h_{\text{vit}}l_{\text{vit}}}{tp} \\ \hspace{4cm} +  16h_{\text{vit}}h_{\text{llm}}+2b{s_{\text{vit}}}{h_{\text{vit}}} \\ \hspace{4cm} + (16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{decoder_first_pipeline_num_layers}}$<span style="color:red">
    
 
<span style="color:red">$M_{\text{other_pp_rank}}=(16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{num_layers_per_other_pp_rank}}$</span>
    
<span style="color:red">$M_{\text{last_pp_rank}}=(16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{num_layers_per_other_pp_rank}} + 16{\frac{hV}{tp}}+8{\frac{bsh}{tp}}$</span>
    
    
$M_{\text{first_pp_rank}}=16(p^2ch_{\text{vit}}+(6h_{\text{vit}}+12\frac{h_{\text{vit}}^2}{tp}+\frac{7h_{\text{vit}}}{tp})l_{\text{vit}}+2whc+\frac{34bs_{\text{vit}}h_{\text{vit}}l_{\text{vit}}}{tp} \\ \hspace{4cm} +  16h_{\text{vit}}h_{\text{llm}}+2b{s_{\text{vit}}}{h_{\text{vit}}} \\ \hspace{4cm} + (16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{decoder_first_pipeline_num_layers}}$
    
 
$M_{\text{other_pp_rank}}=(16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{num_layers_per_other_pp_rank}}$
    
$M_{\text{last_pp_rank}}=(16(6h_{\text{llm}}+12\frac{h_{\text{llm}}^2}{tp}+\frac{7h_{\text{llm}}}{tp})+\frac{34bs_{\text{llm}}h_{\text{llm}}}{tp})l_{\text{num_layers_per_other_pp_rank}} + 16{\frac{hV}{tp}}+8{\frac{bsh}{tp}}$