In [None]:
from transformers import AutoModel
from torch import optim

In [None]:
model = AutoModel.from_pretrained("gpt2")

In [None]:
optimizer = optim.Adam(model.parameters())

##### Example 1

In [None]:
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    capturable: False
    differentiable: False
    eps: 1e-08
    foreach: None
    fused: False
    lr: 0.001
    maximize: False
    weight_decay: 0
)

In [None]:
param_list = optimizer.param_groups[0]["params"]

In [None]:
len(param_list)

148

In [None]:
world_size = 16

In [None]:
world_size

16

In [None]:
type(param_list), len(param_list)

(list, 148)

In context of ZeRO, using a greedy algorithm, re-implement the partitioning of the model parameters `param_list` across 16 accelerators. And balance the memory usage evenly across all accelerators. Explain your code

**Hint**:
- `x.numel()`
- `sorted(data, key, reverse=True)`

In [None]:
params_per_rank = [[] for _ in range(world_size)]

In [None]:
numel_per_rank = [0 for _ in range(world_size)]

In [None]:
sorted_params = sorted(param_list, key=lambda x: x.numel(), reverse=True)

In [None]:
sorted_params[0].numel(), sorted_params[-1].numel()

(38597376, 768)

In [None]:
for param in sorted_params:
    rank_to_go = numel_per_rank.index(min(numel_per_rank))
    params_per_rank[rank_to_go].append(param)
    numel_per_rank[rank_to_go] += param.numel()

**Explain**
- Initialize empty lists `params_per_rank` and `numel_per_rank with` the size equal to the number of accelerators. `params_per_rank` will store the partitioned parameters for each accelerator, while `numel_per_rank` will keep track of the total number of elements (numel) for each accelerator.
- Iterate over the sorted parameters, and for each parameter:
    + a. Find the accelerator (rank) with the smallest number of elements in its partition using `min(numel_per_rank)`. This step ensures that the parameter is assigned to the accelerator with the least memory usage so far.
    + b. Append the parameter to the corresponding list in `params_per_rank[rank_to_go]` for the selected accelerator.
    + c. Update the total number of elements for the selected accelerator in `numel_per_rank`.

In [None]:
numel_per_rank[:3]

[38597376, 5505024, 5898240]

In [None]:
params_per_rank[:2]

[[Parameter containing:
  tensor([[-0.1101, -0.0393,  0.0331,  ..., -0.1364,  0.0151,  0.0453],
          [ 0.0403, -0.0486,  0.0462,  ...,  0.0861,  0.0025,  0.0432],
          [-0.1275,  0.0479,  0.1841,  ...,  0.0899, -0.1297, -0.0879],
          ...,
          [-0.0445, -0.0548,  0.0123,  ...,  0.1044,  0.0978, -0.0695],
          [ 0.1860,  0.0167,  0.0461,  ..., -0.0963,  0.0785, -0.0225],
          [ 0.0514, -0.0277,  0.0499,  ...,  0.0070,  0.1552,  0.1207]],
         requires_grad=True)],
 [Parameter containing:
  tensor([[ 0.0942,  0.0982, -0.0321,  ..., -0.1783,  0.1474,  0.0706],
          [-0.1265, -0.0671,  0.0305,  ...,  0.1966, -0.1203, -0.0628],
          [ 0.0496, -0.0373, -0.0483,  ...,  0.0655, -0.0714,  0.0826],
          ...,
          [ 0.0480,  0.1575,  0.0014,  ..., -0.3987,  0.0889,  0.0240],
          [ 0.0324,  0.1249, -0.0426,  ..., -0.1934,  0.1272, -0.0405],
          [-0.0316,  0.0010, -0.0491,  ..., -0.0406,  0.0536,  0.1896]],
         requires_grad=Tr