# Engine Class Playground

This notebook explores the `Engine` class for efficient model inference with KV cache support.
You can set breakpoints in VS Code's "Run & Debug" tool to inspect variables and understand the generation process.

## 1. Import Required Libraries

Import the Engine class and dependencies for model inference.

In [26]:
import os
import sys
import torch
from contextlib import nullcontext

# Add nanochat to path
sys.path.insert(0, '/Users/yanxia/code/lab/nanochat')

from nanochat.engine import Engine
from nanochat.checkpoint_manager import load_model
from nanochat.common import autodetect_device_type

print("✓ Libraries imported successfully")

✓ Libraries imported successfully


## 2. Set Up Device and Load Model

Initialize the device and load a pretrained model.

In [27]:
# Load model and tokenizer
device_type = autodetect_device_type()
device = torch.device("cuda" if device_type == "cuda" else "cpu")
print("Loading model...")
try:
    model, tokenizer, meta = load_model("base", device, phase="eval")
    print(f"✓ Model loaded successfully")
    print(f"  Model step: {meta['step']}")
    print(f"  Model config: {meta['model_config']}")
except FileNotFoundError as e:
    print(f"❌ Error: {e}")
    print(f"\nNo trained model found at ~/.cache/nanochat/base_checkpoints/")
    print(f"\nTo use this notebook, you have two options:")
    print(f"1. Train a model first:")
    print(f"   python -m scripts.base_train --depth=4 --max_seq_len=512 --num_iterations=100")
    print(f"\n2. Or skip to the next section to mock the Engine setup for testing")
    print(f"\nFor now, we'll create a minimal setup to test the Engine class...")

Autodetected device type: mps
Loading model...
❌ Error: [Errno 2] No such file or directory: '/Users/yanxia/.cache/nanochat/base_checkpoints'

No trained model found at ~/.cache/nanochat/base_checkpoints/

To use this notebook, you have two options:
1. Train a model first:
   python -m scripts.base_train --depth=4 --max_seq_len=512 --num_iterations=100

2. Or skip to the next section to mock the Engine setup for testing

For now, we'll create a minimal setup to test the Engine class...


In [28]:
# Fallback: Create mock model/tokenizer for testing (if no checkpoint available)
if 'model' not in locals():
    print("\n" + "=" * 60)
    print("Creating mock setup for Engine exploration...")
    print("=" * 60)
    
    from nanochat.gpt import GPT, GPTConfig
    from nanochat.tokenizer import get_tokenizer
    
    # Create a tiny model for testing
    tokenizer = get_tokenizer()
    model_config = GPTConfig(
        sequence_len=512,
        vocab_size=tokenizer.get_vocab_size(),
        n_layer=2,  # Tiny: 2 layers instead of 20
        n_head=2,
        n_kv_head=2,
        n_embd=64,
    )
    model = GPT(model_config)
    model.to(device)
    model.eval()
    
    print(f"✓ Mock model created successfully")
    print(f"  Layers: 2 (reduced for testing)")
    print(f"  Embedding dim: 64")
    print(f"  Note: This model is untrained, outputs will be random")

## 3. Create Engine Instance

Initialize an Engine instance for text generation.

In [29]:
# Create Engine instance
engine = Engine(model, tokenizer)
print(f"✓ Engine created successfully")
print(f"  Model: {engine.model}")
print(f"  Tokenizer: {type(engine.tokenizer).__name__}")

✓ Engine created successfully
  Model: GPT(
  (transformer): ModuleDict(
    (wte): Embedding(65536, 64)
    (h): ModuleList(
      (0-1): 2 x Block(
        (attn): CausalSelfAttention(
          (c_q): Linear(in_features=64, out_features=64, bias=False)
          (c_k): Linear(in_features=64, out_features=64, bias=False)
          (c_v): Linear(in_features=64, out_features=64, bias=False)
          (c_proj): Linear(in_features=64, out_features=64, bias=False)
        )
        (mlp): MLP(
          (c_fc): Linear(in_features=64, out_features=256, bias=False)
          (c_proj): Linear(in_features=256, out_features=64, bias=False)
        )
      )
    )
  )
  (lm_head): Linear(in_features=64, out_features=65536, bias=False)
)
  Tokenizer: RustBPETokenizer


In [30]:
# Prepare a prompt for generation
bos_token_id = tokenizer.get_bos_token_id()
prompt = "The capital of France is"
prompt_tokens = tokenizer.encode(prompt, prepend=bos_token_id)

print(f"Prompt: '{prompt}'")
print(f"Prompt tokens: {prompt_tokens}")
print(f"\n✓ Prompt ready. Now run the generation cells below to hit engine.py breakpoints!")

Prompt: 'The capital of France is'
Prompt tokens: [65527, 449, 3428, 281, 4092, 309]

✓ Prompt ready. Now run the generation cells below to hit engine.py breakpoints!


## 4. Test Engine Methods and Properties

Test the `generate()` generator and `generate_batch()` batch method.

In [37]:
# Test 1: Streaming generation with generate()
# ⚠️ Set breakpoints in engine.py BEFORE running this cell!
print("=" * 60)
print("Test 1: Streaming Generation (generate)")
print("=" * 60)

generated_tokens = []
for token_column, token_masks in engine.generate(  # <-- Breakpoints in engine.py will be hit here!
    prompt_tokens, 
    num_samples=1, 
    max_tokens=16, 
    temperature=0.0,  # Greedy decoding
    seed=42
):
    token = token_column[0]
    generated_tokens.append(token)
    text = tokenizer.decode([token])
    print(text, end="", flush=True)

print("\n\nGenerated token sequence:")
print(generated_tokens)

Test 1: Streaming Generation (generate)


ooter escalateViews Duch-inchemiccart prorand Boom Corbusier motorized Filipino rise proceeding pol

Generated token sequence:
[37342, 39912, 57209, 36893, 12745, 2343, 57875, 347, 2789, 46970, 62399, 44208, 30103, 3547, 26747, 894]


In [35]:
# Test 2: Batch generation with generate_batch()
# ⚠️ Set breakpoints in engine.py BEFORE running this cell!
print("\n" + "=" * 60)
print("Test 2: Batch Generation (generate_batch)")
print("=" * 60)

results, masks = engine.generate_batch(  # <-- Breakpoints in engine.py will be hit here!
    prompt_tokens,
    num_samples=3,
    max_tokens=12,
    temperature=1.0,  # Sampling with temperature
    top_k=50,
    seed=42
)

for i, (tokens, mask) in enumerate(zip(results, masks)):
    text = tokenizer.decode(tokens)
    print(f"\nSample {i+1}: {text}")
    print(f"  Token count: {len(tokens)}")
    print(f"  Sampled masks: {mask}")


Test 2: Batch Generation (generate_batch)

Sample 1: <|bos|>The capital of France is Miners dietsCow omission hog snapLarisl Adobearry conserve Sacrament
  Token count: 18
  Sampled masks: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sample 2: <|bos|>The capital of France is pleasure-varrarilyactivity foundedalong chelation BottleTomorrowSnow sleeve harness
  Token count: 18
  Sampled masks: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Sample 3: <|bos|>The capital of France is Cylfacedhele termiteatham share seg�/journal southern Assessingemen
  Token count: 18
  Sampled masks: [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


## 5. Inspect Engine State

Examine the internal state and properties of the Engine instance.

In [None]:
print("=" * 60)
print("Engine State Inspection")
print("=" * 60)

# Inspect Engine attributes
print(f"\nEngine attributes:")
print(f"  engine.model type: {type(engine.model).__name__}")
print(f"  engine.tokenizer type: {type(engine.tokenizer).__name__}")

# Inspect Model config
model_config = engine.model.config
print(f"\nModel configuration:")
print(f"  n_layer: {model_config.n_layer}")
print(f"  n_embd: {model_config.n_embd}")
print(f"  n_head: {model_config.n_head}")
print(f"  n_kv_head: {model_config.n_kv_head}")
print(f"  sequence_len: {model_config.sequence_len}")

# Inspect Tokenizer properties
print(f"\nTokenizer properties:")
print(f"  Vocab size: {tokenizer.get_vocab_size()}")
print(f"  BOS token ID: {tokenizer.get_bos_token_id()}")

# Memory usage
if device_type == "cuda":
    print(f"\nCUDA Memory:")
    print(f"  Allocated: {torch.cuda.memory_allocated(device) / 1024**2:.2f} MB")
    print(f"  Reserved: {torch.cuda.memory_reserved(device) / 1024**2:.2f} MB")