# Lesson 1: Profiling Your First LLM (GPT-2)

Welcome! In this lesson, we'll walk through the fundamental process of profiling a large language model (LLM). Profiling is the first and most critical step in understanding and optimizing model performance.

Our goals for this exercise are:
1.  Set up the environment with the necessary libraries.
2.  Load a pre-trained LLM (`gpt2`) from the Hugging Face Hub.
3.  Use the `torch.profiler` to measure the model's performance on a **CPU**.
4.  Use the `torch.profiler` to measure the model's performance on a **GPU**.
5.  Analyze and compare the results to identify computational bottlenecks.

## Step 2: Importing Necessary Libraries

With the packages installed, let's import the specific modules we'll need for this exercise. We'll be using `torch` for core tensor operations, `transformers` for our model, and `torch.profiler` for the performance analysis.

In [2]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch.profiler
import time

## Step 3: Loading the Language Model and Tokenizer

Now, let's load our model. We'll use **`gpt2`**, a small yet powerful model that is perfect for demonstration purposes because it loads quickly.

-   **Model (`AutoModelForCausalLM`):** This is the neural network itself—the "brain" that will generate text.
-   **Tokenizer (`AutoTokenizer`):** This is a crucial utility that translates human-readable text (strings) into numerical IDs that the model can understand, and vice-versa.

We also set a `pad_token`. This is important for ensuring that all inputs in a batch have the same length, which is a requirement for many deep learning operations.

In [3]:
# Define the model name
model_name = "gpt2"

print(f"Loading model and tokenizer for: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Add a padding token if the tokenizer doesn't have one.
# This is good practice for batching inputs.
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

print("Model and tokenizer loaded successfully.")

Loading model and tokenizer for: gpt2...


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model and tokenizer loaded successfully.


## Step 4: Preparing the Input Prompt

Next, we'll prepare a sample prompt. The tokenizer converts our prompt string into a tensor of `input_ids` that the model can process.

In [4]:
# Prepare a sample prompt
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt")
num_new_tokens_to_generate = 50

print(f"Input prompt: '{prompt}'")
print(f"Task: Generate {num_new_tokens_to_generate} new tokens.")

Input prompt: 'The future of artificial intelligence is'
Task: Generate 50 new tokens.


## Step 5: Profiling Model Inference on CPU

It's time for our first performance test. We will profile the model's text generation process on the CPU.

Here's the plan:
1.  Ensure the model and inputs are on the CPU.
2.  Define a simple inference function `run_cpu_inference`.
3.  Use `torch.no_grad()` to disable gradient calculations, which are unnecessary for inference and would add overhead.
4.  Wrap the function call in the `torch.profiler.profile` context manager to capture performance data.
5.  Print the results, sorted by the operations that took the most time.

In [5]:
print("--- Profiling on CPU ---")
model.to("cpu")
inputs_cpu = {k: v.to("cpu") for k, v in inputs.items()}

def run_cpu_inference(input_data, max_tokens):
    # Use torch.no_grad() to disable gradients for efficiency
    with torch.no_grad():
        model.generate(
            input_data["input_ids"],
            attention_mask=input_data.get("attention_mask"),
            max_new_tokens=max_tokens,
            pad_token_id=tokenizer.pad_token_id
        )

print("Running inference on CPU and capturing profile...")
start_time_cpu = time.time()

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU],
    record_shapes=False, # Set to True if shape info is needed, but adds overhead
    profile_memory=False # Set to True for memory profiling, but adds significant overhead
) as prof_cpu:
    with torch.profiler.record_function("model_inference_cpu"):
        run_cpu_inference(inputs_cpu, num_new_tokens_to_generate)

end_time_cpu = time.time()
print(f"CPU Wall clock time: {end_time_cpu - start_time_cpu:.4f} seconds\n")

print("CPU Profiler Analysis (Top 5 Operators by Self CPU Time):")
print(prof_cpu.key_averages().table(sort_by="self_cpu_time_total", row_limit=5))

--- Profiling on CPU ---
Running inference on CPU and capturing profile...


STAGE:2025-06-16 00:02:30 17122:17122 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2025-06-16 00:02:32 17122:17122 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-06-16 00:02:32 17122:17122 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


CPU Wall clock time: 5.8407 seconds

CPU Profiler Analysis (Top 5 Operators by Self CPU Time):
---------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls  
---------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                  aten::addmm        42.91%     726.872ms        43.87%     743.046ms     309.603us          2400  
                          model_inference_cpu        23.05%     390.360ms       100.00%        1.694s        1.694s             1  
                                     aten::mm        19.79%     335.248ms        19.79%     335.248ms       6.705ms            50  
                                    aten::cat         2.65%      44.967ms         2.85%      48.254ms      37.817

### Analyzing the CPU Profile

After running the cell above, look at the output table.
-   **Wall Clock Time:** This is the total real-world time it took to run the generation.
-   **`aten::addmm`:** You'll likely see this operation at or near the top. It stands for `ADD Matrix-Matrix` multiplication and is the core of the linear layers in the Transformer model. This is where most of the heavy lifting happens.
-   **`aten::mm`**: General matrix multiplication.
-   **`aten::_scaled_dot_product_flash_attention`**: This indicates that an optimized attention mechanism is being used, even on the CPU.

## Step 6: Profiling Model Inference on GPU

Now, let's see how much a GPU can speed things up. The process is similar, but with a few key differences for GPU profiling:

1.  **Check for GPU:** We first check if `torch.cuda.is_available()`.
2.  **Move to GPU:** We move the model and input tensors to the `"cuda"` device.
3.  **Warm-up Run:** The first time you run an operation on a GPU, it has to perform setup tasks (like loading CUDA kernels) that take extra time. We'll do a "warm-up" run first so these one-time costs don't pollute our actual measurement.
4.  **Synchronize:** GPU operations are asynchronous. The CPU gives a command to the GPU and immediately moves on. To get accurate timing, we must use `torch.cuda.synchronize()` to make the CPU wait for the GPU to finish its work.
5.  **Profile:** We include `ProfilerActivity.CUDA` in the profiler's activities list.

In [6]:
# --- GPU Profiling only if CUDA is available ---
if torch.cuda.is_available():
    print("--- Profiling on GPU ---")
    device = "cuda"
    model.to(device)
    inputs_gpu = {k: v.to(device) for k, v in inputs.items()}

    def run_gpu_inference(input_data, max_tokens):
         with torch.no_grad():
             # Synchronize before starting to get an accurate start time
             torch.cuda.synchronize() 
             outputs = model.generate(
                 input_data["input_ids"],
                 attention_mask=input_data.get("attention_mask"),
                 max_new_tokens=max_tokens,
                 pad_token_id=tokenizer.pad_token_id
             )
             # Synchronize again to wait for the generation to finish
             torch.cuda.synchronize()

    # Perform a warm-up run to load CUDA kernels, etc.
    print("Performing GPU warm-up run...")
    run_gpu_inference(inputs_gpu, num_new_tokens_to_generate)

    print("Running inference on GPU and capturing profile...")
    start_time_gpu = time.time()

    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        record_shapes=False,
        profile_memory=False
    ) as prof_gpu:
        with torch.profiler.record_function("model_inference_gpu"):
            run_gpu_inference(inputs_gpu, num_new_tokens_to_generate)

    end_time_gpu = time.time()
    print(f"GPU Wall clock time: {end_time_gpu - start_time_gpu:.4f} seconds\n")

    print("GPU Profiler Analysis (Top 5 Operators by Self CUDA Time):")
    print(prof_gpu.key_averages().table(sort_by="self_cuda_time_total", row_limit=5))

else:
    print("\nCUDA not available on this system, skipping GPU profiling.")

--- Profiling on GPU ---
Performing GPU warm-up run...
Running inference on GPU and capturing profile...


STAGE:2025-06-16 00:02:40 17122:17122 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
STAGE:2025-06-16 00:02:41 17122:17122 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2025-06-16 00:02:41 17122:17122 ActivityProfilerController.cpp:324] Completed Stage: Post Processing


GPU Wall clock time: 4.7810 seconds

GPU Profiler Analysis (Top 5 Operators by Self CUDA Time):
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                            aten::addmm         9.55%      62.230ms        12.36%      80.494ms      33.539us      71.839ms        56.47%      73.135ms      30.473us          2400  
std::enable_if<!(false), void>::type internal::gemvx...         0.00%       0.000us         0.00%       0.000us

### Analyzing the GPU Profile

Now, examine the GPU profiler output.
-   **Wall Clock Time:** Compare this to the CPU time. You should see a significant speedup!
-   **Top Operators:** The operator names will be different. Instead of `aten::` operations, you will see low-level CUDA kernels. Names like `gemv` or `gemm` (General Matrix-Vector/Matrix-Matrix multiplication) are common. These are the highly-optimized functions that run on the GPU hardware.
-   **`Self CUDA %`:** This column is now the most important one. It tells you the percentage of total GPU time that was spent inside a specific kernel.

## Step 7: Conclusion and Key Takeaways

Congratulations! You've successfully profiled an LLM on both CPU and GPU. Let's summarize what we've learned.

**1. Performance Comparison:**
- We observed a clear performance gain when moving from CPU to GPU. The total wall clock time for generation was significantly lower on the GPU, demonstrating the power of parallel processing for deep learning workloads.

**2. Identifying Bottlenecks:**
- On both CPU and GPU, the profiling results pointed to the same root cause of computational load: **matrix multiplications** (`addmm`, `mm` on CPU; `gemm`/`gemv` kernels on GPU). This is the fundamental building block of Transformer models, and it's where most of the time is spent.

**3. The Power of Profiling:**
- This exercise demonstrates that before you can optimize anything, you must first measure it. The PyTorch Profiler is an indispensable tool that gives you a detailed view of where your program is spending its time.

**4. Next Steps:**
- Now that we can identify bottlenecks, we can start exploring ways to fix them. Future lessons will cover techniques like quantization, using optimized kernels like Flash Attention, and other strategies to speed up these expensive matrix multiplication operations.