  # Chapter 2: Generating Text with a Pre-Trained LLM

  ## Learning Objectives
  - Understand how tokenizers convert text to numbers and back
  - Load and use a pre-trained language model (Qwen3-0.6B)
  - Implement basic text generation with greedy decoding
  - Build streaming text generation for real-time output
  - Optimize inference speed using KV (Key-Value) cache
  - Accelerate models using PyTorch compilation (torch.compile)
  - Compare performance between standard, cached, and compiled generation

  ## Implementation Notes
  - Implementation details for both the LLM architecture and optimization techniques (KV cache, torch.compile) are 
  abstracted in the book
  - Focus remains on understanding text generation fundamentals and performance improvement measurements

In [9]:
import sys
import os

# Find repo root
def find_repo_root(marker_file="requirements.txt"):
    prev, curr = None, os.path.abspath(os.getcwd())
    while prev != curr:
        if os.path.exists(os.path.join(curr, marker_file)):
            return curr
        prev, curr = curr, os.path.dirname(curr)
    return None

repo_root = find_repo_root()
print(f"Repo root: {repo_root}")

# Check if reasoning_from_scratch folder exists
if repo_root:
    rfs_path = os.path.join(repo_root, "reasoning_from_scratch")
    print(f"\nLooking for: {rfs_path}")
    print(f"Exists: {os.path.exists(rfs_path)}")
    
    if os.path.exists(rfs_path):
        print(f"\nContents:")
        for item in os.listdir(rfs_path):
            print(f"  {item}")

# Download only the tokenizer for the base Qwen3-small model
  # - kind="base": Downloads the base model (not fine-tuned)
  # - tokenizer_only=True: Only downloads tokenizer files, not model weights
  # - out_dir="qwen3": Saves tokenizer files to ./qwen3/ directory
  
from reasoning_from_scratch.qwen3 import download_qwen3_small
download_qwen3_small(kind="base", tokenizer_only=True, out_dir="qwen3")

Repo root: /Users/types/Documents/reasoning-from-scratch

Looking for: /Users/types/Documents/reasoning-from-scratch/reasoning_from_scratch
Exists: True

Contents:
  appendix_c.py
  ch06.py
  ch02.py
  appendix_f.py
  ch03.py
  qwen3_optimized.py
  ch02_ex.py
  __init__.py
  __pycache__
  utils.py
  qwen3_batched.py
  qwen3.py
  ch04.py
  ch05.py


In [10]:
#Load the tokenizer settings from the tokenizer file into the Qwen3Tokenizer
from pathlib import Path
from reasoning_from_scratch.qwen3 import Qwen3Tokenizer
 
tokenizer_path = Path("qwen3") / "tokenizer-base.json"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)

In [11]:
prompt = "Explain large language models."
input_token_ids_list = tokenizer.encode(prompt)
print(f"The input tokens are: {input_token_ids_list}")

The input tokens are: [840, 20772, 3460, 4128, 4119, 13]


In [12]:
text = tokenizer.decode(input_token_ids_list)
print(f"The decoded text is: {text}")

The decoded text is: Explain large language models.


In [13]:
for i in input_token_ids_list:
    print(f"{i} ---> {tokenizer.decode([i])}")

840 ---> Ex
20772 ---> plain
3460 --->  large
4128 --->  language
4119 --->  models
13 ---> .


In [14]:
def get_device(enable_tensor_cores=True):
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using NVIDIA CUDA GPU")
        
        if enable_tensor_cores:
            major, minor = map(int, torch.__version__.split(".")[:2])
            if (major, minor) >= (2, 9):
                torch.backends.cuda.matmul.fp32_precision = "tf32"
                torch.backends.cudnn.conv.fp32_precision = "tf32"
            else:
                torch.backends.cuda.matmul.allow_tf32 = True
                torch.backends.cudnn.allow_tf32 = True
 
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print("Using Apple Silicon GPU (MPS)")
 
    elif torch.xpu.is_available():
        device = torch.device("xpu")
        print("Using Intel GPU")
 
    else:
        device = torch.device("cpu")
        print("Using CPU")
 
    return device  
    


In [15]:
device = get_device()

Using Apple Silicon GPU (MPS)


In [16]:
#device = torch.device("cpu")

In [17]:
#Download the base Qwen3 0.6B weights
download_qwen3_small(kind="base", tokenizer_only=False, out_dir="qwen3")

✓ qwen3/qwen3-0.6B-base.pth already up-to-date


In [18]:
from reasoning_from_scratch.qwen3 import Qwen3Model, QWEN_CONFIG_06_B
 
model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
model = Qwen3Model(QWEN_CONFIG_06_B)  #A
model.load_state_dict(torch.load(model_path))  #B
model.to(device)  #C
#A Instantiate a Qwen3 model with random weights as placeholders
#B Load the pre-trained weights into the model
#C Transfer the model to the designated device (e.g., "cuda")

Qwen3Model(
  (tok_emb): Embedding(151936, 1024)
  (trf_blocks): ModuleList(
    (0-27): 28 x TransformerBlock(
      (att): GroupedQueryAttention(
        (W_query): Linear(in_features=1024, out_features=2048, bias=False)
        (W_key): Linear(in_features=1024, out_features=1024, bias=False)
        (W_value): Linear(in_features=1024, out_features=1024, bias=False)
        (out_proj): Linear(in_features=2048, out_features=1024, bias=False)
        (q_norm): RMSNorm()
        (k_norm): RMSNorm()
      )
      (ff): FeedForward(
        (fc1): Linear(in_features=1024, out_features=3072, bias=False)
        (fc2): Linear(in_features=1024, out_features=3072, bias=False)
        (fc3): Linear(in_features=3072, out_features=1024, bias=False)
      )
      (norm1): RMSNorm()
      (norm2): RMSNorm()
    )
  )
  (final_norm): RMSNorm()
  (out_head): Linear(in_features=1024, out_features=151936, bias=False)
)

In [19]:
prompt = "Explain large language models."
input_token_ids_list = tokenizer.encode(prompt)
print(f"Number of input tokens: {len(input_token_ids_list)}")

input_tensor = torch.tensor(input_token_ids_list)  
input_tensor_fmt = input_tensor.unsqueeze(0) 
input_tensor_fmt = input_tensor_fmt.to(device)

output_tensor = model(input_tensor_fmt)  
output_tensor_fmt = output_tensor.squeeze(0) 
print(output_tensor_fmt[:])
print(f"Formatted Output tensor shape: {output_tensor_fmt.shape}")
print(tokenizer.decode([output_tensor_fmt[-1].argmax().item()]))

Number of input tokens: 6
tensor([[ 7.4062, 11.4375,  9.2500,  ...,  3.7188,  3.7188,  3.7188],
        [ 9.3750, 10.5625,  7.2500,  ...,  3.2344,  3.2344,  3.2344],
        [10.8750, 10.0000,  7.5938,  ...,  0.1992,  0.1992,  0.1992],
        [ 7.1250,  9.2500,  6.2812,  ..., -1.9844, -1.9844, -1.9844],
        [11.5000, 13.6250, 10.2500,  ...,  1.0000,  1.0000,  1.0000],
        [ 7.3438,  2.0312,  7.9375,  ..., -2.5156, -2.5156, -2.5156]],
       device='mps:0', dtype=torch.bfloat16, grad_fn=<SliceBackward0>)
Formatted Output tensor shape: torch.Size([6, 151936])
 Large


In [20]:
@torch.inference_mode()                                        #A
def generate_text_basic(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=tokenizer.eos_token_id 
):
    input_length = token_ids.shape[1]
    model.eval()                                               #B
 
    for _ in range(max_new_tokens):
        out = model(token_ids)[:, -1]                          #C
        next_token = torch.argmax(out, dim=-1, keepdim=True)
 
        if (eos_token_id is not None                           #D
                and next_token.item() == eos_token_id):
            break
 
        token_ids = torch.cat(                                 #E
            [token_ids, next_token], dim=1)                    #E
    return token_ids[:, input_length:]                         #F

#A Disable gradient tracking for speed and memory efficiency
#B Switch model to evaluation mode to enable deterministic behavior (best practice)
#C Get the scores of the last token
#D Stop if all sequences in the batch have generated EOS
#E Append the newly predicted token to the sequence
#F Return only the generated tokens (excluding the original input)”



In [21]:
Prompt = "Explain large language models in a single sentence."
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device                            #A
    ).unsqueeze(0)
 
max_new_tokens = 10                    #B
output_token_ids_tensor = generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
)
output_text = tokenizer.decode(
    output_token_ids_tensor.squeeze(0).tolist()     #C
)
print(output_text) 
    
     #A Transfer the input token IDs onto the same device (CPU, GPU) where the model is located
     #B Let the model generate up to 100 new tokens
     #C Convert output token 
     #IDs from PyTorch tens”



 Large language models are a class of artificial intelligence (


In [22]:
#Streaming token generation
@torch.inference_mode()                                        #A
def generate_text_basic(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=tokenizer.eos_token_id 
):
    input_length = token_ids.shape[1]
    model.eval()                                               
 
    for _ in range(max_new_tokens):
        out = model(token_ids)[:, -1]                          
        next_token = torch.argmax(out, dim=-1, keepdim=True)
        if (eos_token_id is not None                           
                and next_token.item() == eos_token_id):
            break

        yield next_token

        token_ids = torch.cat(                                 
            [token_ids, next_token], dim=1)                         

Prompt = "Explain large language models in a single sentence."
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device                            
    ).unsqueeze(0)
 
max_new_tokens = 50                     
for token in generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
):
    token_id = token.squeeze(0).tolist()     
    print(
        tokenizer.decode(token_id), 
        end="", 
        flush=True) 
    



 Large language models are a class of artificial intelligence (AI) models that are trained on vast amounts of text data to understand and generate human-like language. These models are designed to process and analyze large volumes of text, enabling them to perform a wide range

In [23]:
# Non-streaming version for generate_stats
@torch.inference_mode()
def generate_text_basic(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=None
):
    input_length = token_ids.shape[1]
    model.eval()

    for _ in range(max_new_tokens):
        out = model(token_ids)[:, -1]
        next_token = torch.argmax(out, dim=-1, keepdim=True)

        # Stop if all sequences in the batch have generated EOS
        if (eos_token_id is not None
                and next_token.item() == eos_token_id):
            break

        token_ids = torch.cat([token_ids, next_token], dim=1)
    return token_ids[:, input_length:]

prompt = "Explain large language models in a single sentence."
input_token_ids_tensor = torch.tensor(
    tokenizer.encode(prompt),
    device=device
    ).unsqueeze(0)

max_new_tokens = 100
output_token_ids_tensor = generate_text_basic(
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
)
output_text = tokenizer.decode(
    output_token_ids_tensor.squeeze(0).tolist()
)
print(output_text)

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays, and even creating creative content.<|endoftext|>Human language is a complex and dynamic system that has evolved over millions of years to enable effective communication and social interaction. It is composed of a vast array of symbols, including letters, numbers, and words, which are used to convey meaning and express thoughts and ideas. The structure of human language


In [27]:
def generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time):
    """Print timing statistics and decoded output."""
    elapsed_time = end_time - start_time
    num_tokens = output_token_ids_tensor.numel()
    tokens_per_sec = num_tokens / elapsed_time
    
    print(f"Time: {elapsed_time:.2f} sec")
    print(f"{tokens_per_sec:.0f} tokens/sec")
    
    output_text = tokenizer.decode(output_token_ids_tensor.squeeze(0).tolist())
    print(f"\n{output_text}")

import time

start_time = time.time()
output_token_ids_tensor = generate_text_basic(  # Fixed!
    model=model,
    token_ids=input_token_ids_tensor,
    max_new_tokens=max_new_tokens,
    eos_token_id=tokenizer.eos_token_id
)
end_time = time.time()
generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Time: 2.99 sec
14 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays, and even creating creative content.


In [28]:
from reasoning_from_scratch.qwen3 import KVCache
 
@torch.inference_mode()
def generate_text_basic_cache(
    model,
    token_ids,
    max_new_tokens,
    eos_token_id=None
):
 
    input_length = token_ids.shape[1]
    model.eval()
    cache = KVCache(n_layers=model.cfg["n_layers"])         #A
    model.reset_kv_cache()
    out = model(token_ids, cache=cache)[:, -1]    #B
 
    for _ in range(max_new_tokens):
        next_token = torch.argmax(out, dim=-1, keepdim=True)
 
        if (eos_token_id is not None
                and next_token.item() == eos_token_id):
            break
 
        token_ids = torch.cat([token_ids, next_token], dim=1)
        out = model(next_token, cache=cache)[:, -1]         #C
 
    return token_ids[:, input_length:]
  
    
     #A Initialize the KV cache
     #B In the first round, the whole input is provided to the model as before
     #C Consequent iterations only feed the next_token to the input

In [31]:
import time

start_time = time.time()
output_token_ids_tensor = generate_text_basic(  
model=model,
token_ids=input_token_ids_tensor,
max_new_tokens=max_new_tokens,
eos_token_id=tokenizer.eos_token_id
)
end_time = time.time()
generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)

Time: 3.19 sec
13 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays, and even creating creative content.


In [32]:
major, minor = map(int, torch.__version__.split(".")[:2])
if (major, minor) >= (2, 8):
    # This avoids retriggering model recompilations 
    # in PyTorch 2.8 and newer
    # if the model contains code like self.pos = self.pos + 1
    torch._dynamo.config.allow_unspec_int_on_nn_module = True
 
model_compiled = torch.compile(model)

In [33]:
for i in range(3): #A
    start_time = time.time()
    output_token_ids_tensor = generate_text_basic(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    )
    end_time = time.time()
 
    if i == 0:  #B
        print("Warm-up run")  #B
    else:
        print(f"Timed run {i}:")
    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)
 
    print(f"\n{30*'-'}\n") 
    
     #A We run the token generation three times
     #B The first run is labeled as "Warm-up run

Warm-up run
Time: 9.61 sec
4 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------

Timed run 1:
Time: 1.38 sec
25 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------

Timed run 2:
Time: 1.43 sec
24 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------



In [None]:
for i in range(3): #A
    start_time = time.time()
    output_token_ids_tensor = generate_text_basic(
        model=model_compiled,
        token_ids=input_token_ids_tensor,
        max_new_tokens=max_new_tokens,
        eos_token_id=tokenizer.eos_token_id
    )
    end_time = time.time()
 
    if i == 0:  #B
        print("Warm-up run")  #B
    else:
        print(f"Timed run {i}:")
    generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)
 
    print(f"\n{30*'-'}\n") 
    
     #A We run the token generation three times
     #B The first run is labeled as "Warm-up run

Warm-up run
Time: 1.01 sec
34 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------

Timed run 1:
Time: 1.02 sec
34 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------

Timed run 2:
Time: 1.04 sec
33 tokens/sec

 Large language models are artificial intelligence systems that can understand, generate, and process human language, enabling them to perform a wide range of tasks, from answering questions to writing essays.

------------------------------



In [1]:


# ============================================
# COMPLETE BENCHMARK SCRIPT
# Measures tokens/second for:
# 1. CPU with cache and compiled
# 2. GPU with cache and compiled
# ============================================

import time
import torch
from pathlib import Path

from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer,
    Qwen3Model,
    QWEN_CONFIG_06_B,
    KVCache
)


# ============================================
# HELPER FUNCTIONS
# ============================================

def generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time):
    """Print timing statistics and decoded output."""
    elapsed_time = end_time - start_time
    num_tokens = output_token_ids_tensor.numel()
    tokens_per_sec = num_tokens / elapsed_time
    
    print(f"Time: {elapsed_time:.2f} sec")
    print(f"{tokens_per_sec:.0f} tokens/sec")
    
    output_text = tokenizer.decode(output_token_ids_tensor.squeeze(0).tolist())
    print(f"\n{output_text}")


@torch.inference_mode()
def generate_text_basic(model, token_ids, max_new_tokens, eos_token_id=None):
    """Basic text generation without cache."""
    input_length = token_ids.shape[1]
    model.eval()

    for _ in range(max_new_tokens):
        out = model(token_ids)[:, -1]
        next_token = torch.argmax(out, dim=-1, keepdim=True)

        if eos_token_id is not None and next_token.item() == eos_token_id:
            break

        token_ids = torch.cat([token_ids, next_token], dim=1)
    
    return token_ids[:, input_length:]


@torch.inference_mode()
def generate_text_basic_cache(model, token_ids, max_new_tokens, eos_token_id=None):
    """Text generation with KV cache for faster inference."""
    input_length = token_ids.shape[1]
    model.eval()
    cache = KVCache(n_layers=model.cfg["n_layers"])
    model.reset_kv_cache()
    out = model(token_ids, cache=cache)[:, -1]

    for _ in range(max_new_tokens):
        next_token = torch.argmax(out, dim=-1, keepdim=True)

        if eos_token_id is not None and next_token.item() == eos_token_id:
            break

        token_ids = torch.cat([token_ids, next_token], dim=1)
        out = model(next_token, cache=cache)[:, -1]

    return token_ids[:, input_length:]


# ============================================
# BENCHMARK FUNCTION
# ============================================

def run_benchmark(device_type="cpu", use_cache=True, use_compile=True, num_runs=3, max_new_tokens=50):
    """
    Run token generation benchmark.
    
    Args:
        device_type: "cpu" or "gpu" (auto-detects CUDA/MPS/XPU)
        use_cache: Whether to use KV cache
        use_compile: Whether to use torch.compile
        num_runs: Number of benchmark runs (first is warmup)
        max_new_tokens: Maximum tokens to generate
    """
    
    print("=" * 60)
    print(f"BENCHMARK: {device_type.upper()}")
    print(f"Cache: {use_cache} | Compile: {use_compile}")
    print("=" * 60)
    
    # ============================================
    # SET DEVICE
    # ============================================
    if device_type == "cpu":
        device = torch.device("cpu")
        print("Using CPU")
    else:
        # Auto-detect GPU
        if torch.cuda.is_available():
            device = torch.device("cuda")
            print("Using NVIDIA CUDA GPU")
            # Enable TF32 for faster computation on Ampere+ GPUs
            torch.backends.cuda.matmul.allow_tf32 = True
            torch.backends.cudnn.allow_tf32 = True
        elif torch.backends.mps.is_available():
            device = torch.device("mps")
            print("Using Apple Silicon GPU (MPS)")
        elif hasattr(torch, 'xpu') and torch.xpu.is_available():
            device = torch.device("xpu")
            print("Using Intel GPU")
        else:
            device = torch.device("cpu")
            print("No GPU found, falling back to CPU")
    
    # ============================================
    # LOAD MODEL AND TOKENIZER
    # ============================================
    print("\nLoading model and tokenizer...")
    
    download_qwen3_small(kind="base", tokenizer_only=False, out_dir="qwen3")
    
    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
    
    model_path = Path("qwen3") / "qwen3-0.6B-base.pth"
    model = Qwen3Model(QWEN_CONFIG_06_B)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.to(device)
    
    # ============================================
    # COMPILE MODEL (if requested)
    # ============================================
    if use_compile:
        print("Compiling model with torch.compile...")
        major, minor = map(int, torch.__version__.split(".")[:2])
        if (major, minor) >= (2, 8):
            torch._dynamo.config.allow_unspec_int_on_nn_module = True
        model = torch.compile(model)
    
    # ============================================
    # PREPARE INPUT
    # ============================================
    prompt = "Explain large language models."
    input_token_ids_tensor = torch.tensor(
        tokenizer.encode(prompt),
        device=device
    ).unsqueeze(0)
    
    print(f"\nPrompt: '{prompt}'")
    print(f"Max new tokens: {max_new_tokens}")
    print(f"Number of runs: {num_runs} (first is warmup)\n")
    
    # ============================================
    # SELECT GENERATION FUNCTION
    # ============================================
    if use_cache:
        generate_fn = generate_text_basic_cache
    else:
        generate_fn = generate_text_basic
    
    # ============================================
    # RUN BENCHMARK
    # ============================================
    for i in range(num_runs):
        # Reset cache if using cached generation
        if use_cache:
            model.reset_kv_cache() if hasattr(model, 'reset_kv_cache') else None
        
        start_time = time.time()
        output_token_ids_tensor = generate_fn(
            model=model,
            token_ids=input_token_ids_tensor.clone(),
            max_new_tokens=max_new_tokens,
            eos_token_id=tokenizer.eos_token_id
        )
        end_time = time.time()
        
        if i == 0:
            print("Warm-up run")
        else:
            print(f"Timed run {i}:")
        
        generate_stats(output_token_ids_tensor, tokenizer, start_time, end_time)
        print(f"\n{'-' * 30}\n")
    
    return model, tokenizer


# ============================================
# CPU BENCHMARK: WITH CACHE AND COMPILED
# ============================================

def benchmark_cpu_cache_compiled(num_runs=3, max_new_tokens=50):
    """Run CPU benchmark with KV cache and torch.compile."""
    return run_benchmark(
        device_type="cpu",
        use_cache=True,
        use_compile=True,
        num_runs=num_runs,
        max_new_tokens=max_new_tokens
    )


# ============================================
# GPU BENCHMARK: WITH CACHE AND COMPILED
# ============================================

def benchmark_gpu_cache_compiled(num_runs=3, max_new_tokens=50):
    """Run GPU benchmark with KV cache and torch.compile."""
    return run_benchmark( 
        device_type="gpu",
        use_cache=True,
        use_compile=True,
        num_runs=num_runs,
        max_new_tokens=max_new_tokens
    )


# ============================================
# MAIN: RUN BOTH BENCHMARKS
# ============================================

if __name__ == "__main__":
    print("\n" + "=" * 60)
    print("RUNNING CPU BENCHMARK (Cache + Compiled)")
    print("=" * 60 + "\n")
    benchmark_cpu_cache_compiled(num_runs=3, max_new_tokens=50)
    
    print("\n" + "=" * 60)
    print("RUNNING GPU BENCHMARK (Cache + Compiled)")
    print("=" * 60 + "\n")
    benchmark_gpu_cache_compiled(num_runs=3, max_new_tokens=50)


RUNNING CPU BENCHMARK (Cache + Compiled)

BENCHMARK: CPU
Cache: True | Compile: True
Using CPU

Loading model and tokenizer...
✓ qwen3/qwen3-0.6B-base.pth already up-to-date
Compiling model with torch.compile...

Prompt: 'Explain large language models.'
Max new tokens: 50
Number of runs: 3 (first is warmup)

Warm-up run
Time: 47.91 sec
1 tokens/sec

 Large language models are a type of artificial intelligence (AI) that are designed to understand and generate human-like text. They are trained on large amounts of text data, which allows them to learn patterns and relationships between words and phrases. These models are often

------------------------------

Timed run 1:
Time: 1.08 sec
46 tokens/sec

 Large language models are a type of artificial intelligence (AI) that are designed to understand and generate human-like text. They are trained on large amounts of text data, which allows them to learn patterns and relationships between words and phrases. These models are often

----------

W0113 15:44:29.147000 1534 torch/_inductor/utils.py:1613] [0/3] Not enough SMs to use max_autotune_gemm mode


Warm-up run
Time: 49.78 sec
1 tokens/sec

 Large language models are a type of artificial intelligence (AI) that are designed to understand and generate human-like text. They are trained on large amounts of text data, which allows them to learn patterns and relationships between words and phrases. These models are often

------------------------------

Timed run 1:
Time: 0.82 sec
61 tokens/sec

 Large language models are a type of artificial intelligence (AI) that are designed to understand and generate human-like text. They are trained on large amounts of text data, which allows them to learn patterns and relationships between words and phrases. These models are often

------------------------------

Timed run 2:
Time: 0.97 sec
51 tokens/sec

 Large language models are a type of artificial intelligence (AI) that are designed to understand and generate human-like text. They are trained on large amounts of text data, which allows them to learn patterns and relationships between words an