##### Example 1

![image.png](attachment:3d67b7f4-e3db-4960-869b-6a8939e60466.png)

Data parallel groups: `[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]`

Question 1: Why GPU 0 need to communicate with GPU02 even they're in two different models

##### Example 2

In [None]:
num_gpus = 4

In [None]:
world_size = 16

In [None]:
world_size, num_gpus

(16, 4)

Reimplement assigning available GPUs to processes in Megatron-LM. Explain your code (2 benefits)

In [None]:
process_to_gpu = []

In [None]:
for rank in range(world_size):
    process_to_gpu.append(rank % num_gpus)

**Explain**

- `rank` is the unique process ID, ranging from `0` to `world_size - 1`.
- `num_gpus` is the total number of GPUs available.
- The `%` modulo operator gives the remainder after division.

It distributes the processes in a circular fashion across all the available GPU devices evenly. This allows it to handle cases where the number of processes exceeds the number of GPUs, and is also more flexible if you change the number of processes or GPUs.

In [None]:
process_to_gpu

[0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3, 0, 1, 2, 3]

In [None]:
[print(f"rank: {rank} > gpu: {gpu}") for rank, gpu in enumerate(process_to_gpu)]

rank: 0 > gpu: 0
rank: 1 > gpu: 1
rank: 2 > gpu: 2
rank: 3 > gpu: 3
rank: 4 > gpu: 0
rank: 5 > gpu: 1
rank: 6 > gpu: 2
rank: 7 > gpu: 3
rank: 8 > gpu: 0
rank: 9 > gpu: 1
rank: 10 > gpu: 2
rank: 11 > gpu: 3
rank: 12 > gpu: 0
rank: 13 > gpu: 1
rank: 14 > gpu: 2
rank: 15 > gpu: 3


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

##### Example 3

In [None]:
world_size = 16
tensor_model_parallel_size = 2
pipeline_model_parallel_size = 4

In [None]:
world_size

16

In [None]:
tensor_model_parallel_size, pipeline_model_parallel_size

(2, 4)

In [None]:
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size

In [None]:
num_pipeline_model_parallel_groups

4

In Megatron-LM, allocate these GPUs into data parallel groups. Explain your code

In [None]:
data_parallel_groups = []

In [None]:
for i in range(pipeline_model_parallel_size):
    start_rank = i*num_pipeline_model_parallel_groups
    end_rank = (i+1)*num_pipeline_model_parallel_groups
    print(f"stage={i}, start_rank={start_rank}, end_rank={end_rank}") # ignore
    
    for j in range(tensor_model_parallel_size):
        ranks = list(range(start_rank+j, end_rank, tensor_model_parallel_size))
        data_parallel_groups.append(ranks)
        print(f"partition {j}, ranks={ranks}") # ignore
    
    print("-------") # ignore

stage=0, start_rank=0, end_rank=4
partition 0, ranks=[0, 2]
partition 1, ranks=[1, 3]
-------
stage=1, start_rank=4, end_rank=8
partition 0, ranks=[4, 6]
partition 1, ranks=[5, 7]
-------
stage=2, start_rank=8, end_rank=12
partition 0, ranks=[8, 10]
partition 1, ranks=[9, 11]
-------
stage=3, start_rank=12, end_rank=16
partition 0, ranks=[12, 14]
partition 1, ranks=[13, 15]
-------


**Explain**

The purpose of the data parallel group is to average the gradients of all partitions in data parallelism. Megatron divides a model into layers, and then each layer is further divided into partitions (tensor parallelism). So we have `pipeline_model_parallel_size * tensor_model_parallel_size` partitions in total.

- `for i in range(pipeline_model_parallel_size)`: This iterates through the different stages across the pipeline.

- `for j in range(tensor_model_parallel_size)`: Within each stage, a layer is divided into `tensor_model_parallel_size` partitions. There will be tensor_model_parallel_size data parallel groups in each stage.

- `range(start_rank + j, end_rank, tensor_model_parallel_size)`: Because each layer has `tensor_model_parallel_size` partitions, the rank distance for a partition in one model to the same partition in a different model is `tensor_model_parallel_size` ranks. Therefore, we use a spacing of `tensor_model_parallel_size` to get the same partition in different models.

In [None]:
len(data_parallel_groups)

8

In [None]:
data_parallel_groups

[[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]

##### Example 4

In [None]:
world_size = 16
tensor_model_parallel_size = 2
pipeline_model_parallel_size = 4

In [None]:
num_pipeline_model_parallel_groups = world_size // pipeline_model_parallel_size

In [None]:
for i in range(num_pipeline_model_parallel_groups):
    print(f"group {i}")
    ranks = list(range(i, world_size, num_pipeline_model_parallel_groups))
    print(f"ranks: {ranks}")
    print("---------")

group 0
ranks: [0, 4, 8, 12]
---------
group 1
ranks: [1, 5, 9, 13]
---------
group 2
ranks: [2, 6, 10, 14]
---------
group 3
ranks: [3, 7, 11, 15]
---------


##### Example 5

In [None]:
import torch

In [None]:
def _broadcast(inputs):
    return inputs.clone()

In [None]:
class Broadcast(torch.autograd.Function):
    @staticmethod
    def forward(ctx, inputs):
        return _broadcast(inputs)
    
    @staticmethod
    def backward(ctx, inputs):
        return _reduce(inputs)

##### Example 5

In [None]:
_tensor_model_parallel_group = None
_pipeline_model_parallel_group = None
_data_parallel_group = None

In [None]:
import os

In [None]:
os.environ["RANK"] = str(2)
os.environ["WORLD_SIZE"] = str(16)

In [None]:
TENSOR_MODEL_PARALLEL_GROUP = None

In [None]:
def _initialize_tensor_model_parallel_group(
    current_rank,
    tensor_model_parallel_size,
    num_tensor_model_parallel_groups
):
    for i in range(num_tensor_model_parallel_groups):
        start_rank = i*tensor_model_parallel_size
        end_rank = (i+1)*tensor_model_parallel_size
        ranks = list(range(start_rank, end_rank))
            
        if current_rank in ranks:
            group = torch.distributed.new_group(ranks)
            global _TENSOR_MODEL_PARALLEL_GROUP
            TENSOR_MODEL_PARALLEL_GROUP = group

In [None]:
import os
import torch

In [None]:
os.getenv("RANK"), os.getenv("WORLD_SIZE")

('2', '16')

Write provide a script that can be distributed to all processes to initiate a distributed environment in Megatron-LM.

The script should meet the following requirements:
- Map each process to a CUDA device.
- Each process to maintain a local variable to store its tensor model parallel group (`ProcessGroup`).

**Hints**: `os.environ[x]`: `MATER_ADDR`, `MASTER_PORT`

In [None]:
class MPU:
    def __init__(
        self,
        tensor_model_parallel_size,
        master_addr,
        master_port,
        backend
    ):
        if not torch.distributed.is_initialized():
            self.initialize_distributed(
                master_addr,
                master_port,
                backend
            )
        
        current_rank = torch.distributed.get_rank()
        world_size = torch.distributed.get_world_size()
        # initialize tensor, pipeline and data parallel group
        
    def initialize_distributed(
        self,
        master_addr,
        master_port,
        backend
    ):
        if not torch.distributed.is_initialized():
            rank = int(os.getenv("RANK"), 0)
            world_size = int(os.getenv("WORLD_SIZE"), 1)
            os.environ["MATER_ADDR"] = str(master_addr)
            os.environ["MASTER_PORT"] = str(master_port)
            self._set_device()            
            torch.distributed.init_process_group(
                backend=backend,
                world_size=world_size,
                rank=rank
            )
    
    def _set_device(self, rank):
        device_count = torch.cuda.device_count()
        if device_count > 0:
            device = rank % device_count
            torch.cuda.set_device(device)

In [None]:
mpu = MPU(
    master_addr="local_host",
    master_port="12355",
    backend="gloo"
)

##### Example 6