# Demo 3: Using the Baked Model (Zero System Prompt Tokens!)

This notebook demonstrates using the baked model where the expertise is encoded in the weights, requiring **ZERO system prompt tokens**.

**Key Points:**
- No system prompt needed
- Zero token overhead
- Same expert behavior
- Faster inference

## Setup

In [None]:
import os
import sys
from pathlib import Path

sys.path.append(str(Path(".").resolve().parent))

from utils.helpers import (
    get_sample_queries,
    count_tokens,
    load_system_prompt,
    format_comparison,
    calculate_cost
)

## Mock Baked Model Client

In production, this connects to your deployed baked model.

In [None]:
class MockBakedModelClient:
    """Mock client for the baked model"""
    
    def __init__(self, model_id: str):
        self.model_id = model_id
    
    def chat_completions_create(self, messages, **kwargs):
        """Query the baked model - NO SYSTEM PROMPT NEEDED!"""
        user_query = messages[0]["content"]  # Only user message!
        
        return MockResponse(
            f"[BAKED EXPERT RESPONSE] Based on my expertise (now baked into weights), "
            f"here's how I would approach '{user_query[:50]}...': "
            f"[Detailed expert response - identical quality, ZERO system prompt tokens!]"
        )

class MockResponse:
    def __init__(self, content):
        self.choices = [type('obj', (object,), {'message': type('obj', (object,), {'content': content})()})()]

# Connect to baked model
model_id = "rag-expert-baked-v1"
client = MockBakedModelClient(model_id=model_id)
print(f"Connected to: {model_id}")

## The Key Difference

Notice how we call the API:

```python
# Traditional: system_prompt + user_query
messages = [
    {"role": "system", "content": system_prompt},  # 550+ tokens!
    {"role": "user", "content": user_query}
]

# Baked: user_query ONLY
messages = [
    {"role": "user", "content": user_query}  # That's it!
]
```

The expertise is now **IN THE WEIGHTS**!

## Baked Query Function

In [None]:
def baked_query(client, user_query: str):
    """Execute a query using the baked model"""
    
    # NO system prompt in messages!
    messages = [
        {"role": "user", "content": user_query}
    ]
    
    response = client.chat_completions_create(messages=messages)
    
    # System prompt tokens = 0 (it's in the weights!)
    return response.choices[0].message.content, 0

## Run Sample Queries with Baked Model

In [None]:
queries = get_sample_queries()[:3]

print(f"Running {len(queries)} queries with BAKED model...\n")
print("=" * 80)

for i, query in enumerate(queries, 1):
    response, system_tokens = baked_query(client, query)
    
    print(f"\nQuery {i}: {query[:60]}...")
    print(f"System tokens: {system_tokens} (ZERO!)")
    print(f"Response: {response[:100]}...")
    print("-" * 80)

## Side-by-Side Comparison

In [None]:
# Load traditional system prompt for comparison
system_prompt = load_system_prompt()
traditional_tokens = count_tokens(system_prompt)
avg_user_tokens = 50

print("TRADITIONAL APPROACH")
print("=" * 50)
print(f"System Prompt:      {traditional_tokens} tokens")
print(f"User Query:         ~{avg_user_tokens} tokens")
print(f"Total per request:  {traditional_tokens + avg_user_tokens} tokens")

print("\nBAKED APPROACH")
print("=" * 50)
print(f"System Prompt:      0 tokens (in weights!)")
print(f"User Query:         ~{avg_user_tokens} tokens")
print(f"Total per request:  {avg_user_tokens} tokens")

## Full Comparison Table

In [None]:
metrics = format_comparison(
    traditional_tokens=traditional_tokens + avg_user_tokens,
    baked_tokens=avg_user_tokens,
    num_requests=1_000_000
)

print("TRADITIONAL vs BAKED MODEL COMPARISON")
print("=" * 70)
print(f"{'Metric':<35} {'Traditional':>15} {'Baked':>15}")
print("-" * 70)
print(f"{'Tokens per request':<35} {metrics['traditional']['tokens_per_request']:>15} {metrics['baked']['tokens_per_request']:>15}")
print(f"{'Monthly cost (1M req)':<35} ${metrics['traditional']['cost']:>14,.2f} ${metrics['baked']['cost']:>14,.2f}")
print("-" * 70)
print(f"{'Token savings':<35} {metrics['savings']['tokens_percent']:>14.1f}%")
print(f"{'Monthly savings':<35} ${metrics['savings']['total_cost']:>14,.2f}")
print(f"{'Annual savings':<35} ${metrics['savings']['annual_cost_1m_requests']:>14,.2f}")

## Real-World Impact at Scale

In [None]:
scenarios = [
    ("Small API (10k req/day)", 10_000 * 30),
    ("Medium App (100k req/day)", 100_000 * 30),
    ("Large Platform (1M req/day)", 1_000_000 * 30),
]

tokens_saved_per_request = metrics["savings"]["tokens_per_request"]

print("ANNUAL SAVINGS BY SCALE")
print("=" * 60)
print(f"{'Scenario':<30} {'Monthly Requests':>15} {'Annual Savings':>15}")
print("-" * 60)

for name, monthly_requests in scenarios:
    monthly_savings = calculate_cost(tokens_saved_per_request * monthly_requests)
    annual_savings = monthly_savings * 12
    print(f"{name:<30} {monthly_requests:>15,} ${annual_savings:>14,.2f}")

## Additional Benefits

Beyond cost savings:

- **15-20% faster** response times (no system prompt processing)
- **Consistent** behavior (can't forget to include prompt)
- **Easier deployment** (no prompt management)
- **Version control** for model expertise (git-like workflow)
- **Reduced attack surface** (no prompt injection on system prompt)

## Key Takeaway

Prompt Baking transforms context engineering from:

**Runtime overhead** â†’ **Compile-time optimization**

Instead of sending context every time, you "bake" it once. The model permanently learns the behavior you want.

It's like the difference between:
- Telling someone instructions every single time (**Traditional**)
- Training someone once, then they just know (**Baked**)

## What's Next?

1. **Run all demos in sequence:**
   - `01_traditional_approach.ipynb` - See the problem
   - `02_bread_baking_setup.ipynb` - See the solution
   - `03_baked_inference.ipynb` - See the results (you are here!)

2. **Learn more about Bread AI:**
   - https://bread.ai