# Part 3: Algorithmic Optimizations

We made the model train faster, but now we want to improve it's performance. Here, we will try to use the hyperparameters used in GPT-2 or GPT-3 papers to train our model, and see if we get a better performance.


In [None]:
# Imports
import torch
import time
import math
import inspect


1. Betas in Adam: $\beta _1 = 0.9, \beta _2=0.95, \epsilon = 10^{-8} $
2. Clipping the global norm of the gradients to 1. Gradient clipping is generally did to handle the problem of exploding gradients. The intuitive reason for this is that if you get a bad batch of data, your loss will be high, and thus gradients will be high also, which you don't want to reflect in the model weights. If the norm of your gradients are above some fixed $c$, then you clip their values at $c$. We do it by: $\frac{g}{||g||}$. In GPT-3, the global norm was clipped at 1. If norm is increasing / you get a sudden spike, things are bad / unstable (in intiali few iterations, the norm can be very high, which is fine ).

I'm not entirely sure about two things, which I need to clarify with someone:
- Are we clipping each parameter tensor individually, or concatenating all parameter gradients into one big tensor and then clipping it? Most likely, we're taking the *global* norm- by concatenating all parameter gradients.
- When we have models like GPT-2, which has so many parameters, if we're clipping the global norm to 1, wouldn't the weight updates be *really* small because we will be multiplying these scaled gradients again by the learning rate? If they are, why can we use lesser precision- it should cause problems?


In [None]:
# ... model initialization as before. Here, I am initializing to None because I have to copy a
# lot of code, which is useless. So init as previous notebooks.
model = None
train_loader = None
device = 'cpu'

# Only optimizer and training loop changes
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), eps=1e-4)

for i in range(50):
    t0 = time.time()
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)

    optimizer.zero_grad()
    with torch.autocast(device_type=device, dtype=torch.bfloat16):
        logits, loss = model(x, y)

    loss.backward()
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Gradient clipping

    optimizer.step()
    torch.cuda.synchronize()

    t1 = time.time() 

    dt = t1 - t0
    tokens_processed = train_loader.B * train_loader.T

    tokens_per_sec = tokens_processed / dt

    print(f"Step {i:4d} | Loss: {loss.item():.6f} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec}")



3. GPT-3 uses a cosine decay learning rate scheduler with warm up. They set the maximum learning rate to be 6e-4 in GPT-3 125M. With this learning rate schedule, the lr starts near zero ( not exactly zero ) then increases linearly (linear warmup) till the max learning rate, and then decays in *cosine* form till it reaches the minimum specified learning rate. They set minimum as 10% of the max learning rate. GPT-3 was trained on 300B tokens. At 260B tokens, they arrive at the minimum LR and train with that for the remaining 40B tokens. So they are training with higher learning rate for a lot longer than their "decayed" learning rate. 

Learning rates is an active area of research, and people have come up with a lot of different learning rate schedules.

In [None]:
max_steps = 50 # maximum optimization "steps"
max_lr = 6e-4
min_lr = max_lr * 0.1 # Min LR is 10% of max lr
warmup_steps = 10

# PyTorch has schedulers which you can use. But here we implement the same
def get_lr(it):
    if it < warmup_steps:
        return max_lr * (it + 1) / warmup_steps # (it + 1) to ensure we don't start at zero i.e. when it=0
    if it > max_steps:
        return min_lr
    
    decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (max_lr - min_lr)

# ... model initialization as before. Here, I am initializing to None because I have to copy a
# lot of code, which is useless. So init as previous notebooks.
model = None
train_loader = None
device = 'cpu'

# Only optimizer and training loop changes
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95), eps=1e-4)

for step in range(max_steps):
    t0 = time.time()
    x, y = train_loader.next_batch()
    x, y = x.to(device), y.to(device)

    optimizer.zero_grad()
    with torch.autocast(device_type=device, dtype=torch.bfloat16):
        logits, loss = model(x, y)

    lr = get_lr(step)

    # In PyTorch optimizer, there are different param_groups & you iterate over them to set LR.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    loss.backward()
    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Gradient clipping

    optimizer.step()
    torch.cuda.synchronize()

    t1 = time.time() 

    dt = t1 - t0
    tokens_processed = train_loader.B * train_loader.T 

    tokens_per_sec = tokens_processed / dt

    print(f"Step {i:4d} | Loss: {loss.item():.6f} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec} | lr: {lr:4e}")



4. **Gradual Batch Size Increase:** In GPT-3, they initially start with smaller batch but then they ramp up linearly the batch size gradually. The intuition for why you'd want to do this is that for the early & easy gains you get by driving some probabilities to zero, you don't need a big batch. That is, all gradients in the early stages are highly correlated- if all gradients are the same, then you don't need a big batch size because you get that information from a smaller batch size also.So you start by small batch, and then for the later training you need bigger batch. But this complicates a bit of arithmetic we do on batches, and it's perhaps doesn't have a major impact on performance, but may increase speed of training. So we have not implemented this.

5. **Data Sampling:** GPT-3 sampled data randomly without replacement. We already do this in the DataLoader because it iterates over the dataset and thus, a token once seen is not seen again until next epoch.

6. **Weight Decay for Regularization:** GPT-3 has a weight decay of 0.1 for regularizing. You generally want some types of weights to be close to zero. For example, matrix multiplication and embeddings weights. Basically, what you want to do is that the parameters that are 2D or above need to be decayed. But 1D parameters or scalars are not decayed, like biases and layer norms. We weight decay because it forces the optimizer to use *more* weights i.e. distribute the work and doesn't allow any one weight to dominate.

In [None]:
# ... Inside the GPT class
import torch


def configure_optimizer(self, weight_decay, lr, device):
    # Get all params that require gradient i.e. will be updated by optimizer
    param_dict = {pn:p for pn, p in self.named_parameters()}
    param_dict = {pn:p for pn, p in param_dict.items() if p.requires_grad}

    decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
    non_decay_params = [p for n, p in param_dict.items() if p.dim() < 2]
    
    optim_groups = {
        {'params' : decay_params, 'weight_decay': weight_decay}, 
        {'params': non_decay_params, 'weight_decay': 0.0}
    }

    # In later versions of PyTorch have this fused, not earlier.
    # If this fused parameter is present, then it's again kernel fusion (all params are updated in one kernel) and thus runs faster.
    # By default, it is not used.
    fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
    use_fused = fused_available and 'cuda' in device

    optimizer = torch.optim.AdamW(optim_groups, lr=lr, betas=(0.9, 0.95), eps=1e-8, fused=use_fused )
    return optimizer


# ... optimizer in the training loop
optimizer = model.configure_optimizer(weight_decay=0.1, lr=6e-4, device='cpu')


There exist relationships between betas, learning rate, weight decay, and batch size. But the topic is quite deep. Refer to notes from Deep Learning course for some hints. At the moment, we're just copying the hyperparameters from GPT-3.

## Gradient Accumulation

For GPT-3 125M model, they used a batch size of 0.5 million tokens in one batch. But we can't do that because GPU will get exhausted. But we do need a bigger batch size because it is correlated with learning rate, other hyperparameters and some of our layers. So we need some way of *simulating* the higher batch size. For that we have *Gradient Accumulation*. You would keep accumulating gradients till you reach your desired batch size (in number of tokens), and only then you would do the update using the optimizer.

Consider the following simple math:
```python
max_tokens_in_batch = 524288  # tokens we want to process in one batch 2^19
B = 16
T = 1024 # B*T = tokens we are going to pass in the loop

# (2^19) / (16 * 1024) = 32 = number of times we would run the loop to accumulate the gradients and only then update.
# i.e. we would accumulate gradients for 32 'batches', and only then update the weights and reset gradients to simulate desired batch size.

```

But there is one subtle issue here. Cross Entropy Loss is calculated as average over the batch. So the dividing factor is the batch size. With gradient accumulation, you have to be careful about the dividing factor. Because with the micro batch of $B \times T$, your dividing factor is going to be different. 

One simple fix would be:
After you compute `loss.backward()`, divide the loss again by 32.

In [None]:
total_batch_size = 524288 # 2^19 tokens in one batch
B = 16 
T = 1024
grad_accum_steps = total_batch_size / (B*T)
for step in range(max_steps):
    t0 = time.time()
    optimizer.zero_grad()

    accumulated_loss = 0.0
    # Accumulate gradients for some time before you update
    for micro_step in range(grad_accum_steps):
        x, y = train_loader.next_batch()
        x, y = x.to(device), y.to(device)
        with torch.autocast(device_type=device, dtype=torch.bfloat16):
            logits, loss = model(x, y) 

        loss = loss / grad_accum_steps
        accumulated_loss += loss.detach() # To keep track of how much loss we accumulated over micro batches for printing
        loss.backward() # accumulate gradients

    # Rest of the loop stays same
    lr = get_lr(step)
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

    norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) 

    optimizer.step()
    torch.cuda.synchronize()

    t1 = time.time() 

    dt = t1 - t0
    tokens_processed = train_loader.B * train_loader.T * grad_accum_steps # You process more tokens in one batch

    tokens_per_sec = tokens_processed / dt

    # Print accumulated loss
    print(f"Step {i:4d} | Loss: {accumulated_loss.item():.6f} | norm: {norm:.4f} | dt: {dt*1000:.2f}ms | tok/sec: {tokens_per_sec} | lr: {lr:4e}")


Set $B$ to be as the high as GPU can manage. The higher it is, the faster the optimization.