### Demystifying Pipeline Parallelism: Building It From Scratch!

In the world of neural networks, size matters. As scaling laws suggests, the larger the model, the better the performance. But when you have a giant model that won't fit in the memory of a single device, things get complicated. This is where pipeline parallelism comes into play, acting as a super-efficient assembly line for large neural network models. In this blog post, we will walk through the concept and build a a toy pipeline parallelism gpipe from scratch.

### ****Naive Pipeline Parallelism vs. GPipe****

Pipeline parallelism is a process that can be distilled down to a few core steps:

- Step 1: **Partition the Model**: Our big model is divided into smaller partitions. Each partition corresponds to a section of the neural network and runs on a separate device.
- Step 2: **Micro-Batching**: We split our training data mini-batch into several smaller micro-batches.
- Step 3: **Forward and Backward Passes**: These partitions and micro-batches go through both the forward and backward computation passes.
- Step 4: **Gradient Averaging**: Once the whole pipeline finishes, we collect the gradients and average them to update the model.

To illustrate, let's imagine we have a big model with 10 layers, like a Transformer model, and we've got five devices to run our model. We want to split this model into five parts, or 'partitions', and each part will run on one device.

There's a catch, though. In a Transformer model, each layer needs the result from the previous layer before it can do its work. It's like a relay race, you can't start running until you've got the baton from the runner before you. So if we split our model into 5 parts, then the second part can't start until the first part is done, the third part can't start until the second part is done, and so on. That means that most of the time, most of our devices are just sitting around doing nothing. That's a bummer!

So what can we do? Here's where GPipe comes in. Instead of feeding a big batch of data to our model all at once, GPipe splits that batch into smaller chunks, which we're gonna call 'micro-batches'. And here's the trick: while one micro-batch is being processed by the second part of our model, the next micro-batch can start being processed by the first part of the model.

This way, there's always something for each part of the model to do. It's like a factory assembly line. As soon as one car is done with one station, it moves to the next station and a new car moves into the first station. This keeps all our devices busy (although they might still have some idle time, like when a worker in the factory is waiting for the next car to arrive).

The GPipe's scheduler orchestrates this process. It works in 'clock cycles', figuring out which partitions should be active and which micro-batch each partition should work on for each clock cycle.

### Cracking the Schedule Algorithm

A "clock cycle" is like a unit of time for our pipeline. Each clock cycle activates a new partition and passes it a micro-batch.

In [2]:
n_partritions = 3
n_microbatches = 5

In [3]:
n_clock_cycles = n_partritions + n_microbatches - 1

In [4]:
n_clock_cycles

7

If we have m micro-batches and n partitions, it'll take `n_partition + n_microbatches - 1` clock cycles to get everything through the pipeline. Why is that , because it takes m clock cycles for all micro-batches to pass through the first partition. Once the last micro-batch enters the first partition, it needs to go through the remaining partitions. Since there are n partitions, this requires n-1 additional clock cycles because the first clock cycle is already counted when the micro-batch enters the first partition.

In pipeline parallelism, for each clock cycle, a new partrition actives in the pipeline. If we are currently in `clock_idx`, it means that `clock_idx` partritions have already been actived. 

The next partritions will be `clock_idx+1`. However, we cannot exceed the total number of partitions (`n_partitions`), so we use the min function to limit the range.

So, what happens in each clock cycle? Good question! In each clock cycle, we determine which partitions are active and what they should be working on. Our scheduler assigns tasks in the form of `(microbatch_index, partition_index)` for each clock cycle. This basically tells each device what chunk of the neural network it should process and with which micro-batch.

For example, (4, 1) means the 5th micro-batch is going through the second partition.

### Behind the Scenes: Worker Threads