# Lesson 1 Demo: Profiling Baseline LLM Performance

### Goal
This notebook demonstrates how to establish a baseline performance profile for a Large Language Model (LLM) using `torch.profiler`. We will measure key metrics like latency, throughput, and memory usage on both CPU and GPU.

### Learning Objectives
*   Set up `torch.profiler` for both CPU and GPU activities.
*   Measure and understand end-to-end latency.
*   Calculate and interpret single-stream throughput (tokens/second).
*   Measure the model's memory footprint.
*   Appreciate the performance difference between CPU and GPU for LLM inference.

## Step 0: Setup

First, we'll install the necessary libraries and import our dependencies.

### Import Dependencies
Now we import all the necessary modules. We'll use `torch` for the core operations, `transformers` to get our model, `time` for basic timing, and `gc` for garbage collection to get cleaner memory readings.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.profiler
import time
import gc

## Step 1: Model and Data Preparation

Next, we'll load our pre-trained model and tokenizer, and prepare the input prompt.

### 1.1. Load Model and Tokenizer
We'll use `gpt2-medium`, a moderately sized model that's good for demonstration purposes. It's large enough to show a significant CPU vs. GPU difference but small enough to run on most modern GPUs.

In [3]:
model_name = "gpt2-medium"
try:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name)
    print(f"Loaded model and tokenizer for: {model_name}")
except Exception as e:
    print(f"Error loading model {model_name}. Make sure you have internet connection "
          f"and the model name is correct. Error: {e}")
    # In a real script, you might exit or raise the error
    # exit()

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loaded model and tokenizer for: gpt2-medium


### 1.2. Prepare Inputs
We define our prompt and set the number of new tokens we want to generate. It's also important to handle the `pad_token` for models that don't have one set by default, which is common for decoder-only models like GPT-2.

In [4]:
# Handle padding token for open-ended generation
if tokenizer.pad_token is None:
    print("Tokenizer does not have a pad token, setting it to eos_token.")
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

# Prepare input text
prompt = "Alan Turing was a pioneering computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist."
inputs = tokenizer(prompt, return_tensors="pt")
num_tokens_in_prompt = inputs['input_ids'].shape[1]
num_new_tokens_to_generate = 50
total_tokens_for_calc = num_new_tokens_to_generate # For tokens/sec, we only care about *newly* generated tokens

print(f"Input prompt: '{prompt}' (Length: {num_tokens_in_prompt} tokens)")
print(f"Will generate {num_new_tokens_to_generate} new tokens.")

Tokenizer does not have a pad token, setting it to eos_token.
Input prompt: 'Alan Turing was a pioneering computer scientist, mathematician, logician, cryptanalyst, philosopher, and theoretical biologist.' (Length: 23 tokens)
Will generate 50 new tokens.


## Step 2: Define Utility Functions

To keep our profiling code clean and reusable, we'll create a couple of helper functions: one to run inference and another to measure memory usage.

In [5]:
def run_inference(model_to_run, input_data, max_tokens, pad_id):
    """Runs model.generate within a torch.no_grad() context to disable gradients."""
    with torch.no_grad():
        outputs = model_to_run.generate(
            input_ids=input_data["input_ids"],
            attention_mask=input_data.get("attention_mask"),
            max_new_tokens=max_tokens,
            pad_token_id=pad_id,
            # We set eos_token_id to an unused value to ensure the model generates
            # the full number of requested tokens for consistent benchmarking.
            eos_token_id=-1, 
        )
    return outputs

def get_memory_usage_gb(device_type="cpu", device_index=0):
    """Gets the memory usage in Gigabytes for a specified device."""
    gc.collect() # Force garbage collection for more accurate readings
    
    if device_type == "cuda" and torch.cuda.is_available():
        torch.cuda.empty_cache() # Clear unused memory from PyTorch's cache
        mem_allocated_bytes = torch.cuda.memory_allocated(device_index)
        mem_reserved_bytes = torch.cuda.memory_reserved(device_index)
        return mem_allocated_bytes / (1024**3), mem_reserved_bytes / (1024**3)
    elif device_type == "cpu":
        # This is an approximation of CPU memory for the model's parameters only.
        # For a full process memory profile, external tools are better.
        model_mem_bytes = sum(p.numel() * p.element_size() for p in model.parameters())
        return model_mem_bytes / (1024**3), -1 # -1 indicates no 'reserved' memory concept here
    return 0, 0

## Step 3: Profile on CPU

Let's establish our baseline by running the model on the CPU. We'll measure the memory footprint, latency, and throughput.

In [6]:
print("--- Profiling on CPU ---")
cpu_device = torch.device("cpu")

# Ensure model and inputs are on the CPU
model.to(cpu_device)
inputs_cpu = {k: v.to(cpu_device) for k, v in inputs.items()}

# Measure memory
cpu_model_load_mem_allocated_gb, _ = get_memory_usage_gb(device_type="cpu")
print(f"CPU Model Parameter Memory (Approx.): {cpu_model_load_mem_allocated_gb:.3f} GB")

# Profile the inference call
print("Running inference on CPU and capturing profile...")
start_time_cpu_wall = time.time()

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU],
    record_shapes=True
) as prof_cpu:
    with torch.profiler.record_function("model_inference_cpu"):
        generated_outputs_cpu = run_inference(model, inputs_cpu, num_new_tokens_to_generate, tokenizer.pad_token_id)

end_time_cpu_wall = time.time()

--- Profiling on CPU ---
CPU Model Parameter Memory (Approx.): 1.322 GB
Running inference on CPU and capturing profile...


STAGE:2025-06-15 23:51:54 7687:7687 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
`eos_token_id` should consist of positive integers, but is tensor([-1]). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.
STAGE:2025-06-15 23:52:01 7687:7687 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-06-15 23:52:01 7687:7687 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


### 3.1. Analyze CPU Results
Now we calculate the latency and throughput from our CPU run. Wall-clock time (`time.time()`) is sufficient for measuring synchronous CPU operations.

In [7]:
# Calculate Latency
cpu_e2e_latency_s = end_time_cpu_wall - start_time_cpu_wall
print(f"CPU End-to-End Latency: {cpu_e2e_latency_s:.4f} seconds")

# Calculate Throughput
cpu_tokens_per_second = 0
if cpu_e2e_latency_s > 0:
    cpu_tokens_per_second = total_tokens_for_calc / cpu_e2e_latency_s
    print(f"CPU Single-Stream Throughput: {cpu_tokens_per_second:.2f} tokens/second")

print("CPU Profiling data captured in 'prof_cpu' object. You can analyze it further with prof_cpu.key_averages().")

CPU End-to-End Latency: 18.6154 seconds
CPU Single-Stream Throughput: 2.69 tokens/second
CPU Profiling data captured in 'prof_cpu' object. You can analyze it further with prof_cpu.key_averages().


## Step 4: Profile on GPU (if available)

Now, let's do the same on a GPU and see the difference. This section will only run if a CUDA-enabled GPU is detected.

In [8]:
# Initialize GPU metrics to default values
gpu_e2e_latency_s_event = -1.0
gpu_model_load_mem_allocated_gb = -1.0
gpu_model_load_mem_reserved_gb = -1.0
gpu_tokens_per_second = 0
prof_gpu = None

if torch.cuda.is_available():
    print("--- Profiling on GPU ---")
    gpu_device = torch.device("cuda")
    print(f"CUDA device found: {torch.cuda.get_device_name(gpu_device)}")
else:
    print("\nCUDA not available. Skipping GPU profiling.")

--- Profiling on GPU ---
CUDA device found: Tesla T4


### 4.1. GPU Memory Usage

First, we move the model to the GPU and measure its memory footprint. On GPUs, we look at two metrics:
- **Memory Allocated:** The memory actively used by tensors.
- **Memory Reserved:** The total memory pool reserved by the PyTorch caching allocator. This is often larger than the allocated memory.

In [9]:
if torch.cuda.is_available():
    print("Moving model and inputs to GPU...")
    model.to(gpu_device)
    inputs_gpu = {k: v.to(gpu_device) for k, v in inputs.items()}

    gpu_model_load_mem_allocated_gb, gpu_model_load_mem_reserved_gb = get_memory_usage_gb(device_type="cuda")
    print(f"GPU Model Load Memory (Allocated): {gpu_model_load_mem_allocated_gb:.3f} GB")
    print(f"GPU Model Load Memory (Reserved by PyTorch): {gpu_model_load_mem_reserved_gb:.3f} GB")

Moving model and inputs to GPU...
GPU Model Load Memory (Allocated): 1.345 GB
GPU Model Load Memory (Reserved by PyTorch): 1.348 GB


### 4.2. GPU Warm-up and Accurate Timing

**Warm-up:** The first time you run an operation on a GPU, it needs to compile underlying CUDA kernels. This adds significant overhead. We perform a "warm-up" run with a small number of tokens to get this compilation out of the way, ensuring our actual benchmark is more accurate.

**Accurate Timing:** GPU operations are asynchronous. The CPU can issue a command and move on before the GPU has finished. Using `time.time()` would be inaccurate. Instead, we use `torch.cuda.Event` to record timestamps directly on the GPU's execution stream, and `torch.cuda.synchronize()` to ensure all operations are complete before we stop the timer.

In [10]:
if torch.cuda.is_available():
    print("Performing GPU warm-up run...")
    try:
        # A small, quick run to compile CUDA kernels
        _ = run_inference(model, inputs_gpu, 5, tokenizer.pad_token_id)
        torch.cuda.synchronize() # Wait for the warm-up to finish
        print("Warm-up complete.")
    except Exception as e:
        print(f"Error during GPU warm-up: {e}")

Performing GPU warm-up run...


`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.


Warm-up complete.


### 4.3. Run GPU Profiler
Now we perform the actual profiled run on the GPU, measuring both CPU and CUDA activities.

In [11]:
if torch.cuda.is_available():
    print("Running inference on GPU and capturing profile...")
    
    # --- Timing with CUDA Events for accurate latency ---
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    
    with torch.profiler.profile(
        activities=[
            torch.profiler.ProfilerActivity.CPU, 
            torch.profiler.ProfilerActivity.CUDA
        ],
        record_shapes=True
    ) as prof_gpu:
        with torch.profiler.record_function("model_inference_gpu"):
            start_event.record()
            generated_outputs_gpu = run_inference(model, inputs_gpu, num_new_tokens_to_generate, tokenizer.pad_token_id)
            end_event.record()
            torch.cuda.synchronize() # Wait for all GPU work to complete
            
    # Calculate latency from events
    gpu_e2e_latency_s_event = start_event.elapsed_time(end_event) / 1000.0 # Convert ms to s

STAGE:2025-06-15 23:52:26 7687:7687 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
`eos_token_id` should consist of positive integers, but is tensor([-1], device='cuda:0'). Your generation will not stop until the maximum length is reached. Depending on other flags, it may even crash.


Running inference on GPU and capturing profile...


STAGE:2025-06-15 23:52:28 7687:7687 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-06-15 23:52:28 7687:7687 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


### 4.4. Analyze GPU Results
Finally, we calculate the throughput based on our accurate latency measurement.

In [12]:
if torch.cuda.is_available() and gpu_e2e_latency_s_event > 0:
    print(f"GPU End-to-End Latency (CUDA Event): {gpu_e2e_latency_s_event:.4f} seconds")
    # Throughput
    gpu_tokens_per_second = total_tokens_for_calc / gpu_e2e_latency_s_event
    print(f"GPU Single-Stream Throughput: {gpu_tokens_per_second:.2f} tokens/second")
    print("GPU Profiling data captured in 'prof_gpu' object. You can analyze it further with prof_gpu.key_averages().")
else:
    print("GPU metrics not calculated due to error or CUDA unavailability.")

GPU End-to-End Latency (CUDA Event): 1.3128 seconds
GPU Single-Stream Throughput: 38.09 tokens/second
GPU Profiling data captured in 'prof_gpu' object. You can analyze it further with prof_gpu.key_averages().


## Step 5: Final Comparison and Analysis

Let's put the results side-by-side to see the performance impact of using a GPU.

In [13]:
print("\n--- Summary of Baseline Metrics ---")

print(f"\n--- CPU Metrics ---")
print(f"Model Memory (Approx. Parameters): {cpu_model_load_mem_allocated_gb:.3f} GB")
print(f"E2E Latency:                     {cpu_e2e_latency_s:.4f} s")
print(f"Tokens/sec (single-stream):      {cpu_tokens_per_second:.2f} tokens/s")

if torch.cuda.is_available() and gpu_e2e_latency_s_event > 0:
    print(f"\n--- GPU Metrics ---")
    print(f"Model Memory (Allocated):        {gpu_model_load_mem_allocated_gb:.3f} GB")
    print(f"Model Memory (Reserved):         {gpu_model_load_mem_reserved_gb:.3f} GB")
    print(f"E2E Latency (CUDA Event):        {gpu_e2e_latency_s_event:.4f} s")
    print(f"Tokens/sec (single-stream):      {gpu_tokens_per_second:.2f} tokens/s")
    
    print(f"\n--- Performance Speedup ---")
    speedup_latency = cpu_e2e_latency_s / gpu_e2e_latency_s_event
    speedup_throughput = gpu_tokens_per_second / cpu_tokens_per_second
    print(f"Latency Speedup (CPU / GPU):     {speedup_latency:.2f}x")
    print(f"Throughput Speedup (GPU / CPU):  {speedup_throughput:.2f}x")


--- Summary of Baseline Metrics ---

--- CPU Metrics ---
Model Memory (Approx. Parameters): 1.322 GB
E2E Latency:                     18.6154 s
Tokens/sec (single-stream):      2.69 tokens/s

--- GPU Metrics ---
Model Memory (Allocated):        1.345 GB
Model Memory (Reserved):         1.348 GB
E2E Latency (CUDA Event):        1.3128 s
Tokens/sec (single-stream):      38.09 tokens/s

--- Performance Speedup ---
Latency Speedup (CPU / GPU):     14.18x
Throughput Speedup (GPU / CPU):  14.18x


## Step 6: Conclusion & Key Takeaways

In this demonstration, we successfully established a performance baseline for the `gpt2-medium` model.

**Key Takeaways:**
- **Profiling is Essential:** We now have concrete numbers for latency, throughput, and memory. This is the starting point for any optimization effort.
- **GPUs Provide Massive Speedups:** As expected, the GPU outperformed the CPU by a significant margin (~14x in this case), highlighting its necessity for efficient LLM inference.
- **Accurate Measurement Matters:** We saw the importance of using the right tools (`torch.cuda.Event`) and techniques (warm-up runs) to get reliable performance data, especially on asynchronous hardware like GPUs.
- **Memory Footprint is Similar:** The model's parameter memory is roughly the same on CPU and GPU, but GPU memory management is more complex (allocated vs. reserved).

With this baseline, we are now ready to explore various optimization techniques in future lessons to see how we can improve these metrics even further.