<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
Supplementary code for the <a href="http://mng.bz/orYv">Build a Large Language Model From Scratch</a> book by <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>Code repository: <a href="https://github.com/rasbt/LLMs-from-scratch">https://github.com/rasbt/LLMs-from-scratch</a>
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>

## FLOPS Analysis

- FLOPs (Floating Point Operations Per Second) measure the computational complexity of neural network models by counting the number of floating-point operations executed
- High FLOPs indicate more intensive computation and energy consumption

In [1]:
# pip install -r requirements-extra.txt

In [2]:
from importlib.metadata import version

pkgs = [
    "thop",
    "torch",
]
for p in pkgs:
    print(f"{p} version: {version(p)}")

thop version: 0.1.1-2209072238
torch version: 2.2.1+cu121


&nbsp;
# Simple benchmark with fixed batch size

In [3]:
import torch
from thop import profile

from previous_chapters import GPTModel


BASE_CONFIG = {
    "vocab_size": 50257,     # Vocabulary size
    "context_length": 1024,  # Context length
    "drop_rate": 0.0,        # Dropout rate
    "qkv_bias": True         # Query-key-value bias
}

model_configs = {
    "gpt-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 2
input_tensor = torch.randint(0, 50257, (batch_size, 1024)).to(device)

for size in model_configs:
    BASE_CONFIG.update(model_configs[size])
    
    model = GPTModel(BASE_CONFIG).bfloat16()
    model.to(device)

    # MACS = multiply-accumulate operations
    # MACS are typically counted as two FLOPS (one multiply and one accumulate)
    macs, params = profile(model, inputs=(input_tensor,), verbose=False)
    flops = 2*macs
    print(f"{size:18}: {flops:.1e} FLOPS")
    
    del model
    torch.cuda.empty_cache()

gpt-small (124M)  : 5.1e+11 FLOPS
gpt-medium (355M) : 1.4e+12 FLOPS
gpt-large (774M)  : 3.2e+12 FLOPS
gpt-xl (1558M)    : 6.4e+12 FLOPS


&nbsp;
# Simple benchmark with automatic batch size finding

In [4]:
for size in model_configs:
    print(f"\nProcessing {size}")
    config = BASE_CONFIG.copy()
    config.update(model_configs[size])

    min_batch_size = 1
    max_batch_size = None
    max_possible_batch_size = 4096

    while min_batch_size <= max_possible_batch_size:
        batch_size = (min_batch_size + max_possible_batch_size) // 2
        try:
            input_tensor = torch.randint(
                0, config["vocab_size"],
                (batch_size, config["context_length"]),
                device=device
            )

            model = GPTModel(config).bfloat16().to(device)

            # MACS = multiply-accumulate operations
            # MACS are typically counted as two FLOPS (one multiply and one accumulate)
            macs, params = profile(model, inputs=(input_tensor,), verbose=False)
            flops = 2 * macs
            print(f"  Batch size {batch_size}: {flops:.1e} FLOPS")

            # If successful, try a larger batch size
            min_batch_size = batch_size + 1
            max_batch_size = batch_size

            # Clean up
            del model, input_tensor
            torch.cuda.empty_cache()

        except RuntimeError as e:
            if "out of memory" in str(e):
                # Try smaller batch size
                max_possible_batch_size = batch_size - 1

                # Clean up
                try:
                    del model, input_tensor
                    torch.cuda.empty_cache()
                except NameError:
                    pass
            else:
                raise e


Processing gpt-small (124M)
  Batch size 128: 3.2e+13 FLOPS
  Batch size 160: 4.0e+13 FLOPS
  Batch size 176: 4.5e+13 FLOPS
  Batch size 184: 4.7e+13 FLOPS
  Batch size 186: 4.7e+13 FLOPS

Processing gpt-medium (355M)
  Batch size 128: 9.3e+13 FLOPS
  Batch size 136: 9.8e+13 FLOPS
  Batch size 140: 1.0e+14 FLOPS
  Batch size 142: 1.0e+14 FLOPS
  Batch size 143: 1.0e+14 FLOPS

Processing gpt-large (774M)
  Batch size 128: 2.0e+14 FLOPS

Processing gpt-xl (1558M)
  Batch size 64: 2.0e+14 FLOPS
  Batch size 96: 3.1e+14 FLOPS


&nbsp;
# Benchmark with automatic batch size finding and Model FLOP Utilization (MFU)

- Model FLOPs Utilization (MFU) explanation from the [PaLM paper](https://arxiv.org/abs/2204.02311)

> We propose a new metric for efficiency that is implementation-independent and permits a cleaner comparison of system efficiency, called model FLOPs utilization (MFU). This is the ratio of the observed throughput (tokens-per-second) relative to the theoretical maximum throughput of a system operating at peak FLOPs. Crucially, the “theoretical maximum” throughput only accounts for the required operations to compute the forward+backward passes, and not rematerialization.


$$\text{MFU} = \frac{\text{Observed Tokens per Second}}{\text{Theoretical Max Tokens per Second}}$$

where 

$$\text{Theoretical Max Tokens per Second} = \frac{\text{Max FLOPs per Second}}{\text{Total FLOPs per Token}}$$

and

$$\text{Tokens per Second} = \frac{\text{Batch Size} \times \text{Sequence Length}}{\text{Total Time}}$$

In [5]:
# Max flops per second provided by the GPU manufacturer

flops_per_second = {
    "H100": {
        torch.float32: 60e12,  # 60 TFLOPs for FP32 on NVIDIA H100
        torch.float16: 1.979e15,  # 1979 TFLOPs for FP16 on NVIDIA H100
        torch.bfloat16: 1.979e15
    },
    "L4": {
        torch.float32: 15e12,  # 15 TFLOPs for FP32 on NVIDIA L4
        torch.float16: 30e12,   # 30 TFLOPs for FP16 on NVIDIA L4
        torch.bfloat16: 30e12 
    },
    "T4": {
        torch.float32: 8.1e12,  # 8.1 TFLOPs for FP32 on NVIDIA T4
        torch.float16: 130e12,  # 130 TFLOPs for FP16 on NVIDIA T4
        torch.bfloat16: 130e12
    },
    "A10G": {
        torch.float32: 15.6e12,  # 15.6 TFLOPs for FP32 on NVIDIA A10G
        torch.float16: 78e12,    # 78 TFLOPs for FP16 on NVIDIA A10G
        torch.bfloat16: 78e12
    },
    "A100": {
        torch.float32: 19.5e12,  # 19.5 TFLOPs for FP32 on NVIDIA A100
        torch.float16: 1.248e15, # 1248 TFLOPs for FP16 on NVIDIA A100
        torch.bfloat16: 1.248e15
    },
    "H200": {
        torch.float32: 70e12,    # 70 TFLOPs for FP32 on NVIDIA H200
        torch.float16: 1.2e15,   # Assuming 1200 TFLOPs for FP16 on NVIDIA H200
        torch.bfloat16: 1.2e15
    },
    "RTX_3080": {
        torch.float32: 29.8e12,  # 29.8 TFLOPs for FP32 on NVIDIA RTX 3080
        torch.float16: 59.6e12,  # 59.6 TFLOPs for FP16 on NVIDIA RTX 3080
        torch.bfloat16: 59.6e12
    },
    "RTX_3090": {
        torch.float32: 35.6e12,  # 35.6 TFLOPs for FP32 on NVIDIA RTX 3090
        torch.float16: 71.2e12,  # 71.2 TFLOPs for FP16 on NVIDIA RTX 3090
        torch.bfloat16: 71.2e12
    },
    "GTX_1080": {
        torch.float32: 8.9e12,   # 8.9 TFLOPs for FP32 on NVIDIA GTX 1080
        torch.float16: 8.9e12,   # No dedicated FP16 performance; using FP32 value
        torch.bfloat16: 8.9e12
    },
    "GTX_1080Ti": {
        torch.float32: 11.3e12,  # 11.3 TFLOPs for FP32 on NVIDIA GTX 1080Ti
        torch.float16: 11.3e12,  # No dedicated FP16 performance; using FP32 value
        torch.bfloat16: 11.3e12
    },
    "GTX_1660": {
        torch.float32: 5e12,     # 5 TFLOPs for FP32 on NVIDIA GTX 1660
        torch.float16: 5e12,     # No dedicated FP16 performance; using FP32 value
        torch.bfloat16: 5e12
    },
    "GTX_1660Ti": {
        torch.float32: 5.5e12,   # 5.5 TFLOPs for FP32 on NVIDIA GTX 1660Ti
        torch.float16: 5.5e12,   # No dedicated FP16 performance; using FP32 value
        torch.bfloat16: 5.5e12
    }
}


In [10]:
import time

def get_gpu_model(flops_per_second_dict):
    device_name = torch.cuda.get_device_name(0)
    for model in flops_per_second_dict.keys():
        if model in device_name:
            return model
    return "Unknown"  # Default if no matching model is found


gpu_model = get_gpu_model(flops_per_second)
print("GPU Model:", gpu_model)

if gpu_model != "Unknown":

    for size in model_configs:
        print(f"\nProcessing {size}")
        config = BASE_CONFIG.copy()
        config.update(model_configs[size])

        min_batch_size = 1
        max_batch_size = None
        max_possible_batch_size = 4096

        while min_batch_size <= max_possible_batch_size:
            batch_size = (min_batch_size + max_possible_batch_size) // 2
            try:
                input_tensor = torch.randint(
                    0, config["vocab_size"],
                    (batch_size, config["context_length"]),
                    device=device
                )

                model = GPTModel(config).bfloat16().to(device)
                model.train()

                # Start timing
                torch.cuda.synchronize()
                start_time = time.time()

                # Forward & backward pass
                output = model(input_tensor)
                loss = output.sum()  # Compute a dummy loss 
                loss.backward()

                # End timing
                torch.cuda.synchronize()
                end_time = time.time()

                total_time_seconds = end_time - start_time

                # Calculate FLOPs for forward pass
                macs, params = profile(model, inputs=(input_tensor,), verbose=False)
                flops_forward = 2 * macs  # Assuming one MAC equals two FLOPs

                # Estimate FLOPs for backward pass (typically 2x forward FLOPs)
                flops_backward = 2 * flops_forward

                # Total FLOPs for forward + backward passes
                total_flops = flops_forward + flops_backward  # Or total_flops = flops_forward * 3

                data_type = next(model.parameters()).dtype
                max_flops_per_second = flops_per_second[gpu_model].get(data_type, 0)

                # Compute tokens per second
                tokens_processed = batch_size * config["context_length"]
                tokens_per_second = tokens_processed / total_time_seconds

                # Compute FLOPs per token
                flops_per_token = total_flops / tokens_processed

                # Compute theoretical max tokens per second
                if flops_per_token > 0:
                    theoretical_max_tokens_per_second = max_flops_per_second / flops_per_token
                else:
                    theoretical_max_tokens_per_second = 0  # Avoid division by zero

                # Compute MFU
                if theoretical_max_tokens_per_second > 0:
                    mfu = tokens_per_second / theoretical_max_tokens_per_second
                else:
                    mfu = 0  # Avoid division by zero

                print(f"  Batch size {batch_size}: Tokens/sec: {tokens_per_second:.2f}, MFU: {mfu:.4f}")

                # If successful, try a larger batch size
                min_batch_size = batch_size + 1
                max_batch_size = batch_size

                # Clean up
                del model, input_tensor, output, loss
                torch.cuda.empty_cache()

            except RuntimeError as e:
                if "out of memory" in str(e).lower():
                    # Try smaller batch size
                    max_possible_batch_size = batch_size - 1

                    # Clean up
                    try:
                        del model, input_tensor
                        torch.cuda.empty_cache()
                    except NameError:
                        pass
                else:
                    raise e

else:
    print("Unknown GPU model. Please update the flops_per_second dictionary with your GPU information.")

GPU Model: L4

Processing gpt-small (124M)
  Batch size 8: Tokens/sec: 14488.21, MFU: 0.3580
  Batch size 12: Tokens/sec: 15378.16, MFU: 0.3799

Processing gpt-medium (355M)
  Batch size 2: Tokens/sec: 6493.81, MFU: 0.4591
  Batch size 3: Tokens/sec: 6328.82, MFU: 0.4474

Processing gpt-large (774M)
  Batch size 4: Tokens/sec: 3130.38, MFU: 0.4834

Processing gpt-xl (1558M)
  Batch size 2: Tokens/sec: 1896.17, MFU: 0.5897


- Note that the batch sizes are smaller than previously because we also carry out the backward pass here, which is more memory-intensive