# Model-Parallel Deep Learning Efficient DL, Episode VI '25

Yandex Research



# Dealing with large models Model-Parallel Deep Learning Efficient DL, Episode III '25

Yandex Research





#### Recap: large models



Image Classification ImageNet Machine Translation average over WMT

Source: https://arxiv.org/abs/1811.06965

#### Recap: Ring allreduce

Bonus quest: you can only send data between adjacent gpus



Ring topology



*Image: graphcore ipu server* 

Answer & more: tinyurl.com/ring-allreduce-blog

#### Recap: All-Reduce SGD

arxiv.org/abs/1706.02677

Idea: get rid of the host, each gpu runs its own computation Q: why will weights be equal after such step?



Q: What if a model is larger than GPU?

Q: What if a model is larger than GPU? easy mode: cannot fit the right batch size hard mode: cannot fit a single sample expert mode: not even parameters!

Q: What if a model is larger than GPU? easy mode: cannot fit the right batch size hard mode: cannot fit a single sample expert mode: not even parameters!



Q: What if a model is larger than GPU? easy mode: cannot fit the right batch size

hard mode: cannot fit a single sample expert mode: not even parameters!

**Solution:** accumulate grads from several training batches

```
[ ] 1 optimizer.zero_grad()
2 for i in range(B):
3   loss = model(**next_batch())
4   (loss / B).backward()
5 optimizer.step()
```

Q: What if a model is larger than GPU? easy mode: cannot fit the right batch size hard mode: cannot fit one training sample expert mode: not even parameters!

aka rematerialization



Paper (DL): arxiv.org/pdf/1604.06174.pdf

#### **Normal backprop**



Paper (DL): arxiv.org/pdf/1604.06174.pdf

#### **Full rematerialization**



Paper (DL): arxiv.org/pdf/1604.06174.pdf

#### Single checkpoint

checkpoint



Paper (DL): arxiv.org/pdf/1604.06174.pdf

#### Single checkpoint



Paper (DL): arxiv.org/pdf/1604.06174.pdf

Q: What if a model is larger than GPU? easy mode: cannot fit batch size 1 expert mode: not even parameters!

You still have one GPU... (but not only a GPU)







L2L: https://arxiv.org/abs/2002.05645

#### **EPS with L2L execution**



- Initialize all layers on CPU
- Move k layers at a time to GPU
- Remove layers after computation
- Fetch k+1-st layer while k-th runs
- Still 20-50% overhead

L2L: https://arxiv.org/abs/2002.05645

| Метнор           | UBATCH<br>SIZE | DEVICE<br>BATCH SIZE | #Layer    | #PARAMETERS | MEMORY<br>(GB) |
|------------------|----------------|----------------------|-----------|-------------|----------------|
| BASELINE         | 2 2            | 2                    | 24        | 300 MILLION | 9.23           |
| BASELINE         |                | <b>2</b>             | <b>48</b> | 600 MILLION | <b>OOM</b>     |
| L2L-STASH ON GPU | 64             | 64                   | 24        | 300 MILLION | 5.22           |
| L2L-STASH ON GPU | 64             | 64                   | 48        | 600 MILLION | 6.76           |
| L2L-STASH ON GPU | 64             | 64                   | 96        | 1.2 BILLION | 9.83           |
| L2L-STASH ON CPU | 64             | 64                   | 24        | 300 MILLION | 3.69           |
| L2L-STASH ON CPU | 64             | 64                   | 96        | 1.2 BILLION | 3.69           |
| L2L-STASH ON CPU | 64             | 64                   | 384       | 4.8 BILLION | 3.69           |



- Offload in parallel with computation
- Use gradient checkpointing
- Delayed parameter update



- Offload in parallel with computation
- Use gradient checkpointing
- Delayed parameter update



**Figure 6:** Delayed parameter update during the training process.

- Offload in parallel with computation
- Use gradient checkpointing
- Delayed parameter update





Q: What if a model is larger than GPU? easy mode: cannot fit batch size 1 expert mode: not even parameters!

# Can we do it better with multiple GPUs?









## Model-parallel training

**Q:** What if a model is larger than GPU?



#### Model-parallel training

Q: What if a model is larger than GPU?



## **Pipelining**

**GPipe:** arxiv.org/abs/1811.06965 – good starting point, *not* the 1<sup>st</sup> paper

Idea: split data into micro-batches and form a pipeline (right)



model size: O(n)

Gradients

throughput: O(n) – with caveats

## **Pipelining**

**GPipe:** arxiv.org/abs/1811.06965 – good starting point, *not* the 1<sup>st</sup> paper

Idea: split data into micro-batches and form a pipeline (right)



model size: O(n)

Gradients

throughput: O(n) – with caveats

Q: Even faster?

#### Reducing the bubble

GPipe: arxiv.org/abs/1811.06965





... to be improved in a moment

Note: backward takes longer than forward in practice

E.g. linear forward has one matmul, backward has two matmuls (dW and dX)

#### Reducing the bubble

1F1B pipeline from Megatron: https://arxiv.org/abs/2104.04473



## Reducing the bubble (further)

1F1B pipeline from Megatron: https://arxiv.org/abs/2104.04473



## Reducing the bubble (furtherer)

ZB1P: "almost zero bubble" https://arxiv.org/abs/2401.10241



#### Idea: split backward into two ops:

- w.r.t inputs and w.r.t. weights

Grad w.r.t. weights doesn't block backward pass to prev stage





Figure 3: Handcrafted pipeline schedules, top: ZB-H1; bottom: ZB-H2

## Reducing the bubble (furtherer yet)

Deepseek V1 schedule: https://arxiv.org/abs/2412.19437



Figure 5 | Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication.

## **Asynchronous Pipelining**

PipeDream: arxiv.org/abs/1806.03377

**Idea:** apply gradients with every microbatch for maximum throughput

#### Also neat:

- Automatically partition layers to GPUs via dynamic programming
- Store k past weight versions to reduce gradient staleness
- Aims at high latency



## Pipelining Recap

#### When to use:

- model doesn't fit on GPU; have multiple GPUs
- if model fits, but not the activations: ???
- if model doesn't fit, but you only have one GPU: ???

#### How to use:

(just a moment...)

## Pipelining Recap

#### When to use:

- model doesn't fit on GPU; have multiple GPUs
- if model fits, but not the activations: just do grad checkpointing!
- if model doesn't fit, but you only have one GPU: offloading!

#### How to use:

- Basic implementation (GPipe): github.com/kakaobrain/torchgpipe

```
from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d) 
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)
```

## Pipelining Recap

#### When to use:

- model doesn't fit on GPU; have multiple GPUs
- if model fits, but not the activations: just do grad checkpointing!
- if model doesn't fit, but you only have one GPU: offloading!

#### How to use:

- Basic implementation (GPipe): github.com/kakaobrain/torchgpipe
- PyTorch built-in: pytorch.org/tutorials/intermediate/pipelining tutorial.html

```
from torch.distributed.pipelining import ScheduleGPipe

# Create a schedule
schedule = ScheduleGPipe(stage, n_microbatches)
```

uses torch.distributed (torchrun) | supports GPipe, 1F1B, extendable!

# Pipelining Recap

#### When to use:

- model doesn't fit on GPU; have multiple GPUs
- if model fits, but not the activations: just do grad checkpointing!
- if model doesn't fit, but you only have one GPU: offloading!

#### How to use:

- Basic implementation (GPipe): github.com/kakaobrain/torchgpipe
- PyTorch built-in: pytorch.org/tutorials/intermediate/pipelining\_tutorial.html
- DeepSpeed: https://deepspeed.readthedocs.io/en/latest/pipeline.html

#### Custom pipelines in many applications

- Megatron-LM: <a href="https://github.com/NVIDIA/Megatron-LM">https://github.com/NVIDIA/Megatron-LM</a> (transformer-specific)
- Megablocks: https://github.com/databricks/megablocks (mixture-of-experts)

# Pipelining Recap

#### When to use:

- model doesn't fit on GPU; have multiple GPUs
- if model fits, but not the activations: just do grad checkpointing!
- if model doesn't fit, but you only have one GPU: offloading!

#### How to use:

- Basic implementation (GPipe): github.com/kakaobrain/torchgpipe
- PyTorch built-in: pytorch.org/tutorials/intermediate/pipelining\_tutorial.html
- DeepSpeed: https://deepspeed.readthedocs.io/en/latest/pipeline.html

#### **Problems:**

- Bubbles = wasted compute time (duh)
- What if model layers aren't symmetric? (e.g. LLM "head" or any ViT)

  Balancing a pipeline is a world of hurt.

[short break]

How else can we run a large model

over multiple GPUs / hosts?

# Tensor-parallel training

https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks



Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network's input is 150,528-dimensional, and the number of neurons in the network's remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.

See also: DP + TP https://arxiv.org/abs/1404.5997

# Tensor-parallel training



#### Q: find AllReduce op here



### Q: find AllReduce op here



# Tensor-parallel training

https://arxiv.org/pdf/2104.04473

Mix and match parallelism directions to reduce synchronization

#### MLP: split over neurons



Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). f and g are conjugate. f is the identity operator in the forward pass and all-reduce in the backward pass, while g is the reverse.

#### Attention: split over heads



(b) Self-Attention

# Tensor-parallel training

https://arxiv.org/pdf/2104.04473

Mix and match parallelism directions to reduce synchronization

#### MLP: split over neurons



Figure 5: Blocks of transformer model partitioned with tensor model parallelism (figures borrowed from Megatron [40]). f and g are conjugate. f is the identity operator in the forward pass and all-reduce in the backward pass, while g is the reverse.

#### **Attention: split over heads**



(b) Self-Attention.

## Sequence Parallelism

https://arxiv.org/abs/2309.14509

Avoid storing the all activations on every device



Figure 2: DeepSpeed sequence parallelism (DeepSpeed-Ulysses) design

# [MOAR] Sequence Parallelism

Early mention of parallelism over sequences

https://arxiv.org/abs/2105.05720

**DeepSpeed Ulysses – the method from previous slide** 

https://arxiv.org/abs/2309.14509

Ring Attention – compute attention dot / softmax in parallel https://arxiv.org/abs/2310.01889

FLUX – overlap computation and communication with custom kernels https://arxiv.org/abs/2406.06858

source: https://sites.google.com/view/icml-2022-big-model



source: https://sites.google.com/view/icml-2022-big-model

Classic view

Data parallelism

Model parallelism

New view (this tutorial)

Inter-op parallelism

Intra-op parallelism

source: https://sites.google.com/view/icml-2022-big-model

#### Data parallelism

#### Model parallelism



source: https://sites.google.com/view/icml-2022-big-model

#### Data and model parallelism

- Two pillars: data and model.
- Variation of the second of the
- ? "Model parallelism" is vague.
- ? The view creates ambiguity for methods that neither partitions data nor the model computation.

#### **New:** Inter-op and Intra-op parallelism.

- Two pillars: computational graph and device cluster
- This view is based on their computing characteristics.
- This view facilitates the development of new parallelism methods.

source: https://sites.google.com/view/icml-2022-big-model

$$egin{aligned} heta^{(t+1)} &= fig( heta^{(t)},\, 
abla_Lig( heta^{(t)},\, D^{(t)}ig)ig) \ L &= ext{MSE}(w_2 \cdot ext{ReLU}(w_1x),\, y) \quad heta = \{w_1,w_2\},\, D = \{(x,y)\} \ f( heta, 
abla_L) &= heta - 
abla_L \end{aligned}$$

Operator / its output tensor → Data flowing direction



source: https://sites.google.com/view/icml-2022-big-model

# Compute graph

# Device cluster





source: https://sites.google.com/view/icml-2022-big-model

Q: How to partition the graph on the device cluster?





source: https://sites.google.com/view/icml-2022-big-model



### Strategy 1



### Strategy 2



## Strategy 3



## Strategy 4



source: https://sites.google.com/view/icml-2022-big-model



Device 1

Device 2

**Q:** have you seen S1/2/3/4 before?

### Strategy 1



### Strategy 2



### Strategy 3



## Strategy 4



source: https://sites.google.com/view/icml-2022-big-model



### Pipeline MP



### Tensor-parallel v1



## DP with offloading or PS



## Tensor-parallel v2



source: https://sites.google.com/view/icml-2022-big-model



# Tensor-parallel v2





source: https://sites.google.com/view/icml-2022-big-model



Device 1



#### Inter-op parallelism



#### Trade-off

|                  | Inter-operator<br>Parallelism | Intra-operator<br>Parallelism |
|------------------|-------------------------------|-------------------------------|
| Communication    | Less                          | More                          |
| Device Idle Time | More                          | Less                          |

#### Intra-op parallelism



source: https://sites.google.com/view/icml-2022-big-model



# **RL-based partitioning**

https://people.csail.mit.edu/hongzi/content/publications/placeto-neurips19.pdf

State: Device assignment plan for a computational graph.

Action: Modify the device assignment of a node.

Reward: Latency difference between the new and old placements.

Trained with **policy gradient** algorithm.



# Optimization-based partitioning

https://arxiv.org/abs/2006.16423

min

## Integer Linear Programming:

Variable: Decision variable vector for each operator, representing device assignment.

Minimize: Maximum finishing time of all operators.

Constraint: Execution dependency & memory capacity of each device.

TotalLatency  $\sum_{i=0}^k x_{vi} = 1$ s.t. subgraph  $\{v \in V : x_{vi} = 1\}$  is contiguous  $M \geq \sum_{v} m_v \cdot x_{vi}$  $CommIn_{ui} \ge x_{vi} - x_{ui}$  $CommOut_{ui} \geq x_{ui} - x_{vi}$  $TotalLatency \geq Latency$  $SubgraphStart_i \geq Latency_v \cdot CommIn_{vi}$  $\text{SubgraphFinish}_i = \text{SubgraphStart}_i + \sum_{v} \text{CommIn}_{vi} \cdot c_v$  $+\sum_{v} x_{vi} \cdot p_v^{\mathrm{acc}} + \sum_{v} \mathrm{CommOut}_{vi} \cdot c_v$ Latency<sub>v</sub>  $\geq x_{v0} \cdot p_v^{\text{cpu}}$ Latency,  $\geq x_{v0} \cdot p_v^{\text{cpu}} + \text{Latency}_u$  $Latency_v \geq x_{vi} \cdot SubgraphFinish_i$  $x_{vi} \in \{0, 1\}$ 

https://arxiv.org/abs/2201.12023

## Whole Search Space



## **Alpa Hierarchical Space**



https://arxiv.org/abs/2201.12023



https://arxiv.org/abs/2201.12023



More details of each pass:

https://sites.google.com/view/icml-2022-big-model

https://arxiv.org/abs/2201.12023

Not the first algorithm for auto-parallelism... but the first one that is usable\* (\* - most of the time)

(benchmarks on AWS V100)

#### GPT (up to 39B)



Match specialized manual systems.

#### GShard MoE (up to 70B)



Outperform the manual baseline by up to 8x.

#### Wide-ResNet (up to 13B)



Generalize to models without manual plans.

https://arxiv.org/abs/2201.12023

Not the first algorithm for auto-parallelism... but the first one that is usable\* (\*-most of the time)

```
# Define the training step. The body of this function is the same as the
# ``train step`` above. The only difference is to decorate it with
# ``alpa.paralellize``.
@alpa.parallelize auto best strategy
def alpa_train_step(state, batch):
    def loss_func(params):
        out = state.apply_fn(params, batch["x"])
       loss = jnp.mean((out - batch["y"])**2)
        return loss works in jax
    grads = jax.grad(loss_func)(state.params)
    new_state = state.apply_gradients(grads=grads)
    return new state
# Test correctness
actual_state = alpa_train_step(state, batch)
assert allclose(expected state.params, actual state.params, atol=5e-3)
```

# </part 2>

- + model larger than GPU
- + faster for small
- \* typical size: 2-8 gpus
- model partitioning is tricky tensor parallelism is easier, but requires ultra low latency
- latency is critical, go buy nvlink except for PipeDream
- often combined with gradient checkpointing

#### **Tutorials:**

- Simple pipelining in PyTorch tinyurl.com/pytorch-pipelining
- Distributed model-parallel with torch RPC https://tinyurl.com/torch-rpc
- Minimalistic tensor parallelism pip install tensor\_parallel

# </part 2>

- + model larger than GPU
- + faster for small
- \* typical size: 2-8 gpus
- model partitioning is tricky tensor parallelism is easier, but requires ultra low latency
- latency is critical, go buy nvlink except for PipeDream
- often combined with gradient checkpointing

#### **Tutorials:**

- Simple pipelining in PyTorch tinyurl.com/pytorch-pipelining
- Distributed model-parallel with torch RPC https://tinyurl.com/torch-rpc
- Automatic tensor parallelism pip install tensor parallel

Q: what if you have 1024 GPUs, but the model fits on 8?

# </part 2>

- + model larger than GPU
- + faster for small
- \* typical size: 2-8 gpus
- model partitioning is tricky tensor parallelism is easier, but requires ultra low latency
- latency is critical, go buy nvlink except for PipeDream
- often combined with gradient checkpointing

#### **Tutorials:**

- Simple pipelining in PyTorch tinyurl.com/pytorch-pipelining
- Distributed model-parallel with torch RPC https://tinyurl.com/torch-rpc
- Automatic tensor parallelism pip install tensor parallel

#### Large-scale training: combine model- and data-parallel

So far we've been trying to partition for existing models...

Perhaps there are models that are easier to partition?

# **Expert Parallelism**

Sparsely gated MoE: https://arxiv.org/pdf/1701.06538.pdf



# MoE Variant: Switch Transformer

Switch: https://arxiv.org/pdf/2101.03961.pdf

#### Terminology

- Experts: Split across devices, each having their own unique parameters. Perform standard feedforward computation.
- Expert Capacity: Batch size of each expert. Calculated as
- (tokens\_per\_batch / num\_experts) \* capacity\_factor
- Capacity Factor: Used when calculating expert capacity. Expert capacity allows more buffer to help mitigate token overflow during routing.





# MoE Variant: Switch Transformer

Switch: https://arxiv.org/pdf/2101.03961.pdf

#### MLM pre-training objective [BERT-like]





### MoE Variant: Switch Transformer

Switch: https://arxiv.org/pdf/2101.03961.pdf

#### **Pre-training vs downstream quality**





# Alternative: FSDP

Source: microsoft



# DeepSpeed Inference

Paper: https://arxiv.org/abs/2207.00032

- Same techniques, but for inference
- Offloading, tensor- & pipeline-parallel
- ... and a ton of hacks



# </ZeRO>

#### **Multi-GPU strategies:**

- \* Pipeline model-parallel allocate layers on different GPUs
- \* Sharded data-parallel split optimizer state and/or parameters

#### **Single GPU strategies:**

- \* Small model gradient checkpointing & virtual batch
- \* Large model optimizer state sharding (keep parameters on GPU)

#### **Implementations:**

- DeepSpeed— sharded DP, offload, tensor parallelism, active development
  - Offload https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html
- FSDP most of DeepSpeed features with native PyTorch API
- Model-specific implementations— https://github.com/NVIDIA/Megatron-LM

# If we have time... *(if not, skip)*

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect

16GB model and optimizer, 128GB activations (batch 32) → ???

# </le>

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect

16GB model and optimizer, 128GB activations (batch 32) → grad accumulation

16GB model and optimizer, 16GB activations (batch 1) - ???

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect

16GB model and optimizer, 128GB activations (batch 32) → grad accumulation

16GB model and optimizer, 16GB activations (batch 1) → grad checkpointing

32GB model and optimizer, 1GB activations  $\rightarrow$  ???

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect 16GB model and optimizer, 128GB activations (batch 32) → **grad accumulation** 16GB model and optimizer, 16GB activations (batch 1) → **grad checkpointing** 32GB model and optimizer, 1GB activations → **it depends...** 

DDP + offloading | FSDP (ZeRO) | Pipeline-parallel | Tensor-parallel

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect 16GB model and optimizer, 128GB activations (batch 32) → **grad accumulation** 16GB model and optimizer, 16GB activations (batch 1) → **grad checkpointing** 32GB model and optimizer, 1GB activations → **it depends...** 

DDP + offloading | FSDP (ZeRO) | Pipeline-parallel | Tensor-parallel | e.g. if too few GPUs | for other methods

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect 16GB model and optimizer, 128GB activations (batch 32) → **grad accumulation** 16GB model and optimizer, 16GB activations (batch 1) → **grad checkpointing** 32GB model and optimizer, 1GB activations → **it depends...** 

DDP + offloading | FSDP (ZeRO) | Pipeline-parallel | Tensor-parallel
 e.g. if too few GPUs no custom model code, best for large batches

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect 16GB model and optimizer, 128GB activations (batch 32) → **grad accumulation** 16GB model and optimizer, 16GB activations (batch 1) → **grad checkpointing** 

DDP + offloading | FSDP (ZeRO) | Pipeline-parallel | Tensor-parallel

32GB model and optimizer, 1GB activations  $\rightarrow$  it depends...

e.g. if too few GPUs for other methods

no custom model code, communication-efficient best for large batches sequential model

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect

16GB model and optimizer, 128GB activations (batch 32) → grad accumulation

16GB model and optimizer, 16GB activations (batch 1) → grad checkpointing

32GB model and optimizer, 1GB activations → it depends...

#### DDP + offloading | FSDP (ZeRO) | Pipeline-parallel | Tensor-parallel

e.g. if too few GPUs no custom model code, communication-efficient minimal latency for other methods best for large batches sequential model non-symmetric model

Mix and match: TP within one server, minimal PP between servers, DDP between groups Parallel code: manual (e.g. Megatron-LM) vs automated (alpa, FSDP, tensor\_parallel) Unconventional hardware: hivemind, petals, varuna, etc

#### **Example configuration:**

Several GPU w/ 24GB memory | 128GB system memory | 16GBps interconnect 16GB model and optimizer, 128GB activations (batch 32) → **grad accumulation** 16GB model and optimizer, 16GB activations (batch 1) → **grad checkpointing** 

**32GB model and optimizer, 1GB activations** → it depends...

If the model does not fit, you can also quantize it into submission! (more on model compression in a future lecture)

# That's all Folks.