Pipeline parallelism is a parallelization strategy that distributes the execution of a neural network's forward and backward passes across multiple devices. It does this by dividing the network into smaller subnetworks or partitions and assigning each partition to a different device. This way, the workload is evenly distributed and allows for improved training efficiency.

In pipeline parallelism, the forward and backward passes are decomposed into smaller tasks based on the micro-batches and partitions of the network. Let's break down how the forward and backward passes are decomposed.

Forward Pass Decomposition:

Divide the input batch into micro-batches, e.g., $x_1, \cdots, x_m$.
Sequentially execute the partitions $f^j$ on each micro-batch $x_i$. This results in tasks $F_{i, j}$, where $x_i^0 = x_i$ and $x_i^j = f^j(x_i^{j-1})$ for $i = 1, \cdots, m$ and $j = 1, \cdots, n$.
Compute the output $f(x)$ by aggregating the results from each device, $x_i^n = f(x_i)$.
Backward Pass Decomposition:

Compute the gradient of the loss with respect to each output, $dx_i^n$.
Sequentially execute the backward pass through the partitions $f^j$ on each gradient $dx_i^j$. This results in tasks $B_{i, j}$, where $dx_i^{j-1} = \partial_x f^j(dx_i^j)$ and $g_i^j = \partial_{\theta^j} f^j(dx_i^j)$ for $i = 1, \cdots, m$ and $j = 1, \cdots, n$.
Compute the gradient of the loss with respect to the network parameters, $g^j = \sum_{i=1}^m g_i^j$.
Pipeline parallelism takes advantage of the sequential nature of the tasks in the forward and backward passes. By assigning tasks with different micro-batch indices to different devices, the network can be trained efficiently using data parallelism. Note that there are data dependencies between tasks, so they must be executed in a specific order to ensure that the required data is available when needed.

In summary, pipeline parallelism decomposes the forward and backward passes into smaller tasks based on micro-batches and partitions, assigning each task to a different device for parallel execution. This enables efficient training of large neural networks across multiple devices.

##### Example 1

In [None]:
def clock_cycle(m, n):
    for k in range(m+n-1):
        yield [(k-j, j) for j in range(max(1+k-m, 0), min(1+k), n)]

In [None]:
m = 4
n = 3

In [None]:
for k in range(m+n-1):
    print( [(k - j + 1 , j +1 ) for j in range(max(1 + k - m, 0), min(1 + k, n))] )

[(1, 1)]
[(2, 1), (1, 2)]
[(3, 1), (2, 2), (1, 3)]
[(4, 1), (3, 2), (2, 3)]
[(4, 2), (3, 3)]
[(4, 3)]


##### Draft 2

In [None]:
import torch