In [1]:
import pandas as pd
import numpy as np
import torch

# 1 Assignment Overview
## 1.1 Profiling and Benchmarking
### 1.1.3 End-to-End Benchmarking

We start with forward and backward passes, 5 warmup steps, 10 benchmark steps. We notice low variance among measurements.

In [None]:
# context_length=256
results_forwardandbackward_w5_n10 = {'small': {'forward_only': False, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.038934001384768636), 'std': np.float64(0.00019227064981187754)}, 'medium': {'forward_only': False, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.09081623799866065), 'std': np.float64(0.0005792887672763797)}, 'large': {'forward_only': False, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.2036572418292053), 'std': np.float64(0.000794752277815363)}, 'xl': {'forward_only': False, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.3937284264015034), 'std': np.float64(0.0009801636726452544)}, '2.7B': {'forward_only': False, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.5568876736913808), 'std': np.float64(0.0001973441361972124)}}
df = pd.DataFrame(results_forwardandbackward_w5_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |         std |
|:-------|:---------------|---------------:|------------------:|----------:|------------:|
| small  | False          |              5 |                10 | 0.038934  | 0.000192271 |
| medium | False          |              5 |                10 | 0.0908162 | 0.000579289 |
| large  | False          |              5 |                10 | 0.203657  | 0.000794752 |
| xl     | False          |              5 |                10 | 0.393728  | 0.000980164 |
| 2.7B   | False          |              5 |                10 | 0.556888  | 0.000197344 |


Now, we look at only forward pass. We find that the times are around 1/3, which seems accurate since backward pass is around double the FLOPs of forward pass.

In [None]:
results_forwardonly_w5_n10 = {'small': {'forward_only': True, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.01579690339276567), 'std': np.float64(0.00012533750080744773)}, 'medium': {'forward_only': True, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.030950926861260088), 'std': np.float64(0.0006028892862904648)}, 'large': {'forward_only': True, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.06545800276799127), 'std': np.float64(7.117279307500996e-05)}, 'xl': {'forward_only': True, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.12748205178650096), 'std': np.float64(0.00012059490126578412)}, '2.7B': {'forward_only': True, 'warmup_steps': 5, 'benchmark_steps': 10, 'avg': np.float64(0.17424825453199447), 'std': np.float64(6.339024647792802e-05)}}
df = pd.DataFrame(results_forwardonly_w5_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |         std |
|:-------|:---------------|---------------:|------------------:|----------:|------------:|
| small  | True           |              5 |                10 | 0.0157969 | 0.000125338 |
| medium | True           |              5 |                10 | 0.0309509 | 0.000602889 |
| large  | True           |              5 |                10 | 0.065458  | 7.11728e-05 |
| xl     | True           |              5 |                10 | 0.127482  | 0.000120595 |
| 2.7B   | True           |              5 |                10 | 0.174248  | 6.33902e-05 |


Without warmup, we see a much higher standard deviation and also an impact on smaller / earlier models. This could be because there is overhead associated with the first couple of runs that is not an issue in the long run, which is what we care about.

In [5]:
results_forwardandbackward_w0_n10 = {'small': {'forward_only': False, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.0712938507203944), 'std': np.float64(0.09671228201616748)}, 'medium': {'forward_only': False, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.10467507569119334), 'std': np.float64(0.0287743982342857)}, 'large': {'forward_only': False, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.2092732895980589), 'std': np.float64(0.011061761055204388)}, 'xl': {'forward_only': False, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.3976040959940292), 'std': np.float64(0.007775187865667317)}, '2.7B': {'forward_only': False, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.5620815886883065), 'std': np.float64(0.014736830905364812)}}
df = pd.DataFrame(results_forwardandbackward_w0_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |        std |
|:-------|:---------------|---------------:|------------------:|----------:|-----------:|
| small  | False          |              0 |                10 | 0.0712939 | 0.0967123  |
| medium | False          |              0 |                10 | 0.104675  | 0.0287744  |
| large  | False          |              0 |                10 | 0.209273  | 0.0110618  |
| xl     | False          |              0 |                10 | 0.397604  | 0.00777519 |
| 2.7B   | False          |              0 |                10 | 0.562082  | 0.0147368  |


In [6]:
results_forwardonly_w0_n10 = {'small': {'forward_only': True, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.04331547362962738), 'std': np.float64(0.07872015609390083)}, 'medium': {'forward_only': True, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.04426585491746664), 'std': np.float64(0.03240265923601385)}, 'large': {'forward_only': True, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.07094462101813406), 'std': np.float64(0.011289025703075812)}, 'xl': {'forward_only': True, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.13053130987100303), 'std': np.float64(0.007699488997647922)}, '2.7B': {'forward_only': True, 'warmup_steps': 0, 'benchmark_steps': 10, 'avg': np.float64(0.17898933509131892), 'std': np.float64(0.013183273384352073)}}
df = pd.DataFrame(results_forwardonly_w0_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |        std |
|:-------|:---------------|---------------:|------------------:|----------:|-----------:|
| small  | True           |              0 |                10 | 0.0433155 | 0.0787202  |
| medium | True           |              0 |                10 | 0.0442659 | 0.0324027  |
| large  | True           |              0 |                10 | 0.0709446 | 0.011289   |
| xl     | True           |              0 |                10 | 0.130531  | 0.00769949 |
| 2.7B   | True           |              0 |                10 | 0.178989  | 0.0131833  |


We see that even a couple of warmup steps helps standard deviation greatly and average greatly.

In [7]:
results_forwardandbackward_w2_n10 = {'small': {'forward_only': False, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.039019133895635605), 'std': np.float64(0.0001920012991025753)}, 'medium': {'forward_only': False, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.0909191724145785), 'std': np.float64(0.0009752176591215706)}, 'large': {'forward_only': False, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.20466322761494665), 'std': np.float64(0.0004757270254612791)}, 'xl': {'forward_only': False, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.3947079855017364), 'std': np.float64(0.00036701886755863614)}, '2.7B': {'forward_only': False, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.5580237713409588), 'std': np.float64(0.0002497751298431857)}}
df = pd.DataFrame(results_forwardandbackward_w2_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |         std |
|:-------|:---------------|---------------:|------------------:|----------:|------------:|
| small  | False          |              2 |                10 | 0.0390191 | 0.000192001 |
| medium | False          |              2 |                10 | 0.0909192 | 0.000975218 |
| large  | False          |              2 |                10 | 0.204663  | 0.000475727 |
| xl     | False          |              2 |                10 | 0.394708  | 0.000367019 |
| 2.7B   | False          |              2 |                10 | 0.558024  | 0.000249775 |


In [8]:
results_forwardonly_w2_n10 = {'small': {'forward_only': True, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.01597115001641214), 'std': np.float64(0.00028466750118917957)}, 'medium': {'forward_only': True, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.03091297169448808), 'std': np.float64(0.0006388669542270524)}, 'large': {'forward_only': True, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.06533648789627478), 'std': np.float64(0.0001878294531717484)}, 'xl': {'forward_only': True, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.1273713317932561), 'std': np.float64(0.000170664876891159)}, '2.7B': {'forward_only': True, 'warmup_steps': 2, 'benchmark_steps': 10, 'avg': np.float64(0.17423666048562153), 'std': np.float64(6.371911033421974e-05)}}
df = pd.DataFrame(results_forwardonly_w2_n10).T
print(df.to_markdown())

|        | forward_only   |   warmup_steps |   benchmark_steps |       avg |         std |
|:-------|:---------------|---------------:|------------------:|----------:|------------:|
| small  | True           |              2 |                10 | 0.0159712 | 0.000284668 |
| medium | True           |              2 |                10 | 0.030913  | 0.000638867 |
| large  | True           |              2 |                10 | 0.0653365 | 0.000187829 |
| xl     | True           |              2 |                10 | 0.127371  | 0.000170665 |
| 2.7B   | True           |              2 |                10 | 0.174237  | 6.37191e-05 |


### 1.1.4 Nsight Systems Profiler

**A.** Looking at the forward and backward passes, the times from Nsight are pretty much the same!

We only ran out of memory on the 1024 context length XL and 2.7B model (without optimizer step).

**B.** The `sm80_xmma_gemm_f32f32_f32f32_f32_tn_n_tilesize128x128x8_stage3_warpsize2x2x1_ffma_aligna4_alignc4_execute_kernel__5x_cublas` kernel takes the most time during the Large 1024 forward pass. It is called 145 times and takes up 46.8% of the time. Yes, this is the same kernel that takes up the most runtime in both forward and backward passes (17.0% though).

**C.** There are a few element-wise kernels, but those max out at 4% of the time.

**D.** Running with AdamW looks similar in terms of runtime to with backward pass. Everything is just a bit less. AdamW mostly uses element-wise kernels.

**E.** Attention takes around 20% more time than softmax. However, softmax is much fewer FLOPs. This is weird.

### 1.1.5 Mixed Precision

In [3]:
s = torch.tensor(0,dtype=torch.float32)
for i in range(1000):
    s += torch.tensor(0.01,dtype=torch.float32)
print(s)

s = torch.tensor(0,dtype=torch.float16)
for i in range(1000):
    s += torch.tensor(0.01,dtype=torch.float16)
print(s)

s = torch.tensor(0,dtype=torch.float32)
for i in range(1000):
    s += torch.tensor(0.01,dtype=torch.float16)
print(s)

s = torch.tensor(0,dtype=torch.float32)
for i in range(1000):
    x = torch.tensor(0.01,dtype=torch.float16)
    s += x.type(torch.float32)
print(s)

tensor(10.0001)
tensor(9.9531, dtype=torch.float16)
tensor(10.0021)
tensor(10.0021)


**=== Benchmarking Mixed Precision ===**

**A.** Code below

In [14]:
import torch.nn as nn

class ToyModel(nn.Module):
    def __init__(self, in_features: int, out_features: int):
        super().__init__()
        self.fc1 = nn.Linear(in_features, 10, bias=False)
        self.ln = nn.LayerNorm(10)
        self.fc2 = nn.Linear(10, out_features, bias=False)
        self.relu = nn.ReLU()

    def forward(self, x):
        # Print dtype of model parameters within the autocast context
        print(f"dtype of fc1.weight: {self.fc1.weight.dtype}")
        print(f"dtype of ln.weight: {self.ln.weight.dtype}")
        print(f"dtype of fc2.weight: {self.fc2.weight.dtype}")
        x = self.fc1(x)
        print(f"dtype of first FF layer (fc1 output): {x.dtype}")
        x = self.relu(x)
        x = self.ln(x)
        print(f"dtype after layernorm: {x.dtype}")
        x = self.fc2(x)
        return x

# Model parameters are initialized in FP32
model = ToyModel(5, 20).to('cuda')
ins = torch.arange(5, dtype=torch.float32, device='cuda').unsqueeze(0)  # Add batch dimension for LN

# We'll compute a dummy loss as well
target = torch.zeros((1, 20), dtype=torch.float32, device='cuda')

with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
    out = model(ins)
    print(f"dtype of logits (model output): {out.dtype}")
    # Use MSELoss as dummy loss function, will also print dtype
    loss_fn = nn.MSELoss()
    loss = loss_fn(out, target)
    print(f"dtype of loss: {loss.dtype}")

# Backward to get gradients
loss.backward()
# Print gradients dtypes for all parameters
for name, param in model.named_parameters():
    # param.grad may be None if unused
    grad_dtype = param.grad.dtype if param.grad is not None else None
    print(f"dtype of gradient for {name}: {grad_dtype}")

print(f"Output shape: {out.shape}")

# Summary comment (for reference, not printed in code):
# - Model parameters: FP32 throughout
# - fc1 output: FP16 (autocast active)
# - layernorm output: FP32 (LayerNorm always returns FP32)
# - Model logits (fc2 output): FP16 (autocast active)
# - Loss: FP32 (reduction with FP32 accumulation)
# - Model gradients: FP32 (computed in FP32 for stability)

dtype of fc1.weight: torch.float32
dtype of ln.weight: torch.float32
dtype of fc2.weight: torch.float32
dtype of first FF layer (fc1 output): torch.bfloat16
dtype after layernorm: torch.float32
dtype of logits (model output): torch.bfloat16
dtype of loss: torch.float32
dtype of gradient for fc1.weight: torch.float32
dtype of gradient for ln.weight: torch.float32
dtype of gradient for ln.bias: torch.float32
dtype of gradient for fc2.weight: torch.float32
Output shape: torch.Size([1, 20])


Only the matmuls are downcasted.

**B.** LayerNorm is special because of the variance (sum of squares). This can overflow the max value of float16. BF16 keeps the exponent of float32 which helps with this problem. However, then, the decimal point precision may be too small. That's another issue here.

**C.** Mixed precision is amazing on the 2.7B 256, but actually seems to be a bit laggier than regular on small and medium. Perhaps this is because the matmuls take so little time that it's not worth the conversion time.

### 1.1.6 Profiling Memory

**A.** There are clear separations between forward pass, backward pass, and optimizer. 

**B.** Peak memory usages of forward pass (2.7B): 128: 46GB; 256: 53GB; 512: 70GB; in full step, always goes to 55GB by optimizer step.

**C.** Using mixed precision, 512 goes to 67GB. Optimizer memory does not change. This is not huge but significant.

**D.** Size of Transformer residual: BxCxD = 4 x 512 x 2560 = 5242880 params * 4 bytes / 1024^2 = 20MB

**E.** Most are coming from autograd, it seems. Also attention! Including softmax.