Source: https://nbviewer.org/github/tunib-ai/large-scale-lm-tutorials/blob/main/notebooks/03_distributed_programming.ipynb

In [None]:
import torch

##### Example 1

In [None]:
def func(n):
    print(f"n={n}")

Execute the `func` function using Spawn method with four processors, where `n` represents the number of the processor

**Hint**: Use `mp.set_start_method()`

In [None]:
import torch.multiprocessing as mp

In [None]:
mp.set_start_method("spawn")

In [None]:
for n in range(4):
    process = mp.Process(target=fn, args=(n))
    process.start()

##### Example 2

In [None]:
import torch.multiprocessing as mp

In [None]:
def fn(rank, param1, param2):
    print(f"{param1}, {param2} - rank: {rank}")

In [None]:
# mp.spawn(
#     fn=fn,
#     args=("A0", "B1"),
#     nprocs=4,
#     join=True,
#     daemon=True,
#     start_method="spawn"
# )

##### Example 2.1

In [None]:
def say_hello():
    print("hello world")

In [None]:
say_hello()

hello world


In [None]:
world_size = 3

Run the function `say_hello` using three new processes

In [None]:
from torch.multiprocessing import Process

In [None]:
for rank in range(world_size):
    p = Process(target=say_hello)
    p.start()

##### Example 2.2

In [None]:
def say_hello(rank):
    print(f"hello from rank={rank}")

In [None]:
say_hello(rank=69)

hello from rank=69


In [None]:
world_size = 3

Run the function `say_hello` using three new processes as bellow

In [None]:
from torch.multiprocessing import Process

In [None]:
for rank in range(world_size):
    p = Process(target=say_hello, args=(rank,))
    p.start()

##### Example 2.3

In [None]:
config

('gloo', 'tcp://127.0.0.1:23456')

In [None]:
world_size = 4

Launch four new processes and establish distributed communication between them using the `config` parameter

In [None]:
import torch.distributed as dist
from torch.multiprocessing import Process

In [None]:
def init_communication(rank, wold_size, config):
    dist.init_process_group(*config, rank=rank, world_size=world_size)
    print(f"hello from rank={rank}, world_size={world_size}")

In [None]:
for rank in range(world_size):
    p = Process(target=init_communication, args=(rank, world_size, config))
    p.start()

##### Example 3

In [None]:
import torch.distributed as dist

In [None]:
def fn(global_rank, world_size):
    dist.init_process_group(
        backend="nccl",
        rank=global_rank,
        world_size=world_size
    )
    group = dist.new_group([_ for _ in range(world_size)])

In [None]:
fn(global_rank=69, world_size)

##### Example 

In [None]:
import torch.distributed as dist

In [None]:
ranks = [0, 1, 3, 6]

In [None]:
x = torch.tensor([69, 69, 69])

In [None]:
x

tensor([69, 69, 69])

Write a script that will be distributed across all accelerators and sends tensor `x` from process `0` to processes `1`, `3`, and `6`.

**Explain**

The process with rank `0` wants to send the tensor to process `1`, `3`, and `6`, so these need to be in the same group.

In [None]:
ranks = [0, 1, 3, 6]

Then retrieve the rank of the current process, the rank will be used to check if the current process is in the process group.

In [None]:
rank = dist.get_rank()

In [None]:
group = None

If the current process rank is in the list ranks, this line creates a new process group that includes the processes with ranks specified in `ranks`. This process group will be used for the broadcast operation.

In [None]:
if rank in ranks:
    group = dist.new_group(ranks=ranks)

Broadcast the tensor `x` from the source process (rank `0`) to all the other processes in the process group `group`

In [None]:
if group is not None:
    dist.broadcast(tensor=x, src=0, group=group)

### P2P Communication

##### Example 1

Write a script that will be put to all processes

In [None]:
x = torch.ones(2, 2)

In [None]:
tensor_will_be_received_data = torch.zeros(2, 2)

In [None]:
tensor_will_be_received_data = torch.ones(2, 2)

In [None]:
x

tensor([[1., 1.],
        [1., 1.]])

In [None]:
tensor_will_be_received_data

tensor([[0., 0.],
        [0., 0.]])

In [None]:
x.shape == tensor_will_be_received_data.shape

True

Write a script to be distributed across all processors that sends tensor `x` from processing `0` to processing `1`. Then, processing `1` will fill the data into `tensor_will_be_received_data`

In [None]:
import torch.distributed as dist

In [None]:
if dist.get_rank() == 0:
    dist.send(x, dist=1)
elif dist.get_rank() == 1:
    dist.recv(tensor_will_be_received_data, src=0)

In [None]:
tensor_will_be_received_data

tensor([[1., 1.],
        [1., 1.]])

##### Example 2

In [None]:
x = torch.ones(2, 2)

In [None]:
tensor_will_be_received_data = torch.zeros(2, 2)

In [None]:
tensor_will_be_received_data = torch.ones(2, 2)

In [None]:
x.shape == tensor_will_be_received_data.shape

True

In [None]:
x

tensor([[1., 1.],
        [1., 1.]])

In [None]:
tensor_will_be_received_data

tensor([[0., 0.],
        [0., 0.]])

Write a script to be distributed across all processors that sends tensor `x` from processing `69` to processing `42` **asynchronously**. Then, processing `1` will fill the data into `tensor_will_be_received_data`

In [None]:
import torch.distributed as dist

In [None]:
if dist.get_rank() == 69:
    request = dist.isend(x, dst=42)
elif dist.get_rank() == 42:
    request = dist.irecv(tensor_will_be_received_data, src=69)

In [None]:
tensor_will_be_received_data

tensor([[1., 1.],
        [1., 1.]])

##### Example 3

### Collective Communication

##### Example 1

Write a script that 

In [None]:
x = torch.zeros(2, 2)

In [None]:
import torch

In [None]:
x

tensor([[0., 0.],
        [0., 0.]])

Write a script to be distributed across all processors. The script will send the tensor `x` from processor `0` to all other processors

In [None]:
rank = torch.distributed.get_rank()

In [None]:
if rank == 0:
    torch.distributed.broadcast(x, src=0)

##### Example 2

In [None]:
import torch.distributed as dist

In [None]:
import torch

In [None]:
torch.distributed.

<module 'torch.distributed' from '/Users/education/miniforge3/envs/gym/lib/python3.8/site-packages/torch/distributed/__init__.py'>