## 4. Divvying up GPUs

In order to carry out computations across multiple devices, each device initiates multiple processes, each of which handles a specific GPU. This way, each process can directly send computational tasks to its designated GPU. To make this work, Megatron-LM organizes GPUs into three groups:

- Data Parallel Group: Each GPU in this group handles the same part of the model, but works on different mini-batches. During backpropagation, each GPU calculates the gradient for its part of the model. These gradients are then averaged to get the overall gradient for updating the model’s parameters.
- Tensor Parallel Groups: In this group, each GPU handles different parts of the same layer (or multiple layers). Each GPU computes the output for its designated part and these partial outputs are combined to get the complete output of the layer.
- Pipeline Parallel Groups: The GPUs in this group handle different stages of the forward and backward passes.

### Data Parallel Groups



Megatron-LM uses three variables to set up pipeline parallelism:

- `tensor_model_parallel_size`: The number of GPUs across which a layer will be split in tensor parallelism
- `pipeline_model_parallel_size`: It represents the number of stages in the pipeline
- `data_parallel_size`: It represents the number of model replicas in data parallelism.

And then each process keeps a variable for each parallelism group to keep track of which group it belongs to.

In [None]:
world_size = 30 # the total number of workers
tensor_model_parallel_size = 2
pipeline_model_parallel_size = 3


In pipeline parallelism, a model is split into `pipeline_model_parallel_size` stages.

Because Megatron-LM incorporates both tensor parallelism and pipeline parallelism, so each stage has `tensor_model_parallel_size` GPUs to parallelize the tensor operations in that stage. So, the total number of GPUs required to parallelize a model would be:


In [None]:
num_workers_for_each_model = tensor_model_parallel_size * pipeline_model_parallel_size
num_workers_for_each_model


And then to calculate the number of model replicates in data parallelism, we divide the total number of GPUs (`world_size`) by the number of GPUs used for each model (`num_gpus_for_each_model`):


In [None]:
data_parallel_size = world_size // num_workers_for_each_model
data_parallel_size


Already, so we will have five model replicas. Next, let’s setup all data parallel groups. [[Megatron's Data Parallel Groups]](https://github.com/NVIDIA/Megatron-LM/blob/3316e811cc5335ee24c2d203416d864edcf2f7a8/megatron/core/parallel_state.py#L54)

In [None]:
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size # 4

data_parallel_groups = []

for i in range(pipeline_model_parallel_size):
    start_rank = i*num_pipeline_model_parallel_groups
    end_rank = (i+1)*num_pipeline_model_parallel_groups
    print(f"stage={i}, start_rank={start_rank}, end_rank={end_rank}")

    for j in range(tensor_model_parallel_size):
        ranks = list(range(start_rank+j, end_rank, tensor_model_parallel_size))
        data_parallel_groups.append(ranks)
        print(f"partition {j}, ranks={ranks}")

    print("-------")

data_parallel_groups


![Data Parallel Groups](/images/megatron/megatron-gpus-allocation.jpg)

Already, stay calm. Let’s break it down

- `for i in range(pipeline_model_parallel_size)`: We iterate through all the stages in the pipeline
- `for j in range(tensor_model_parallel_size)`: Within each stage, a layer is divided into `tensor_model_parallel_size` partitions. There will be `tensor_model_parallel_size` data parallel groups in each stage.
- `range(start_rank + j, end_rank, tensor_model_parallel_size)`: We iterate through the next group each time, so the starting GPU will be `start_rank + j`. Since our model layer is divided into `tensor_model_parallel_size` parts, each part is assigned to a different GPU. This means the same part of the model in different GPUs is `tensor_model_parallel_size` ranks apart. So, by using a step size of `tensor_model_parallel_size`, we are able to get the same part of the model from different GPUs.

Since discussing the setup of parallel groups for tensor parallel and pipeline parallel would make it too long, let’s assume that we have already set up all three groups. Now, the question is: How do we allocate GPUs to CPUs?

### Allocate workers to processors



So, here’s the deal: a CPU starts up multiple processes. Each of these processes gets tied to a GPU because GPUs are way faster at deep learning tasks than CPUs. So, each process sends its task over to its assigned GPU. But how does a process get tied to a specific GPU?

Well, it’s done in a round-robin way across all available GPUs. This approach makes it really flexible if you change the number of processes or GPUs. And it also works when there are more processes than GPUs.

In [None]:
num_gpus = 4
process_to_gpu = []

for rank in range(world_size):
    process_to_gpu.append(rank % num_gpus)

[print(f"rank: {rank} -> gpu: {gpu}") for rank, gpu in enumerate(process_to_gpu)];

### MPU (Model Parallel Unit)



![Data Parallel Groups](/images/megatron/3d-parallelism.png)


The MPU class is the one that handles all this GPU allocation. It puts each GPU into the right parallel group, either tensor parallel, model parallel, or pipeline parallel. [[Megatron's MPU]](https://github.com/NVIDIA/Megatron-LM/blob/9288b6b73dbc6bbc58c616a2f06b38381711e847/megatron/mpu/initialize.py#L62)

In a distributed training setting, all nodes run the same code. So, this GPU allocation script gets executed on all nodes in the cluster. PyTorch sets up the communication channels based on the environment variable RANK for each node. After setting up the parallel groups, the MPU class keeps track of which parallel group a CPU belongs to by storing it in a local variable.

Now, the pipeline parallelism needs to be set up. But since `data_parallel_size` depends on the number of GPUs per model, so we only need two variables to initialize the pipeline: `tensor_model_parallel_size` and `pipeline_model_parallel_size`. Now let’s put them all together:

In [None]:
import os
import torch

class MPU:
    def __init__(
        self,
        rank,
        world_size,
        tensor_model_parallel_size,
        pipeline_model_parallel_size,
        master_addr,
        master_port,
        backend
    ):
        if not torch.distributed.is_initialized():
            os.environ["MASTER_ADDR"] = str(master_addr)
            os.environ["MASTER_PORT"] = str(master_port)

            self.set_device(rank)
            torch.distributed.init_process_group(
                rank=rank,
                world_size=world_size,
                backend=backend,
            )

        current_rank = torch.distributed.get_rank()
        world_size = torch.distributed.get_world_size()
        self.debug = True

        self.num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size
        self._data_parallel_group = None

        # init data parallel group
        self.init_data_parallel_group(
            rank=current_rank,
            tensor_model_parallel_size=tensor_model_parallel_size,
            pipeline_model_parallel_size=pipeline_model_parallel_size
        )
        # init tensor parallel and pipeline parallel groups

In [None]:
def set_device(self, rank):
    num_gpus = torch.cuda.device_count()
    if num_gpus > 0:
        device = rank % num_gpus
        torch.cuda.set_device(device)


def init_data_parallel_group(
    self,
    rank,
    tensor_model_parallel_size,
    pipeline_model_parallel_size
):
    for i in range(pipeline_model_parallel_size):
        start_rank = i * self.num_pipeline_model_parallel_groups
        end_rank = (i + 1) * self.num_pipeline_model_parallel_groups

        for j in range(tensor_model_parallel_size):
            ranks = list(range(
                start_rank+j,
                end_rank,
                tensor_model_parallel_size
            ))

            if rank in ranks:
                group = torch.distributed.new_group(ranks=ranks)
                self._data_parallel_group = group


MPU.set_device = set_device
MPU.init_data_parallel_group = init_data_parallel_group