## Production-Ready Agent Engineering: From MCP to RL

### Lecture 1: Agent Patterns and Principles

**Instructor: Will Brown**

*Date: June 17, 2025*

### Environment Setup

Install [uv](https://docs.astral.sh/uv/getting-started/installation/) (recommended, pip alternative)
```bash
# mac/linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

# or, install via pip
pip install uv

# to initialize a project:
uv init [project-name]
uv add openai # creates .venv
uv add ipykernel ipywidgets # for jupyter
```

Set up an LLM API key
```bash
# terminal or bashrc/zshrc
export OPENAI_API_KEY=sk-proj-...
```

### Choosing Models

Recommended "agent" models:
- DeepSeek V3-0324
    - cheap, reliable, solid all around (think Sonnet 3.5 / GPT-4o)
    - not a "reasoner" by default, but works well with "\<think\>" prompting (trained on R1 data)
    - free + automatic prefix caching via deepseek.ai (don't trust with sensitive data)
    - available from many inference providers (Bedrock, Azure Foundry, Together, Fireworks, OpenRouter)
    - **no restrictions on distillation/training**
- gpt-4.1
    - more "agentic" / less "chatty" alternative to gpt-4o
    - good default for "capable non-reasoner", particularly if you mostly work with OpenAI models
- Claude 4 Sonnet + Gemini 2.5 Pro
    - very strong all-around agentic models
    - popular in code editors (Cursor, Windsurf, Claude Code)
    - configurable thinking budgets

Recommended "helper" models (or "mini" agents):
- gpt-4.1-mini / gpt-4.1-nano
- Gemini 2.5 Flash
- Claude 3.5 Haiku
- Mistral Small 3.1 (24B)
    - very permissive license
    - popular as a finetuning base
- Qwen 2.5/3 models
    - many variants + sizes
    - popular for finetuning + self-hosting
    - Qwen3 models are "thinking optional"
- Gemma 3 models
    - mini/open versions of Gemini 
- Other small open models (for finetuning + simple helper methods)
    - Llama 3.1 8B
    - Phi-4 (non-reasoning)

Sometimes useful, but proceed with caution:
- o3, R1, o4-mini
    - "reasoning" models are generally slow, expensive, prone to overthinking
    - often overkill for many tasks, particularly if you require low latency + many tool calls
    - o4-mini supports the RFT API for reinforcement learning
- Claude 4 Opus
    - one of the strongest models ever made, but **very** expensive
- Llama 4 (Scout / Maverick)
    - winning combo: fast inference *and* multimodal *and* openly available *and* fairly strong
    - lots of RAM needed for self-hosting; similar license to Llama models

Course logistics:

- Thursday lecture on MCP + productionizing agents
    - **new time**: 3PM ET
    - Also doing a "repeat" lecture on Friday at 5PM ET
    - Happy to schedule a weekend option 
- Office hours on Friday at 3pm ET

--- 

### Topics

OpenAI Chat Completions

OpenAI Responses

Tool Calling

Structured Outputs
- Instructor
- Outlines, XGrammar

ReAct: Synergizing Reasoning + Acting

- PydanticAI
- OpenAI Agents SDK
- SmolAgents
- several others

LLM Judges (DeepEval)

Stateful Agents (Letta)

Signatures + Agent Optimization (DSPy)

### OpenAI Chat Completions

- Will be our default mode of LLM interaction throughout the course
- Very flexible, minimal "bells and whistles", supported by most LLM inference providers + open-source frameworks

In [59]:
# simplest possible LLM call
import os
from openai import OpenAI

oai = OpenAI()

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": "What is the capital of France?"},
    ],
)
print(response.choices[0].message.content)

The capital of France is Paris.


In [60]:
# why "state" matters

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[
        {"role": "user", "content": "What about Germany?"},
    ],
)
print(response.choices[0].message.content)

Could you please clarify what specific information or topic about Germany you’re interested in? For example, are you asking about Germany’s history, culture, economy, travel tips, politics, or something else? This will help me provide a more accurate and helpful response.


In [61]:
# trying again

history = [{"role": "user", "content": "What is the capital of France?"}]

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=history, # type: ignore
)
history.append({"role": "assistant", "content": response.choices[0].message.content}) # type: ignore
history.append({"role": "user", "content": "What about Germany?"})
response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=history, # type: ignore
)
history.append({"role": "assistant", "content": response.choices[0].message.content}) # type: ignore
for h in history:
    print(f"{h['role']}: {h['content']}")

user: What is the capital of France?
assistant: The capital of France is Paris.
user: What about Germany?
assistant: The capital of Germany is Berlin.


### OpenAI Responses

Pros: 
- Convenient features for conversation management, tool calls, "thinking" summaries
- Primary method used in up-to-date OpenAI docs, plays nicely with some model features (e.g. multimodal, image output)

Cons:
- Not yet adopted by many frameworks/providers, mostly just OpenAI
- potential for "vendor lock-in"
- most features not too hard to DIY


In [62]:
# message lists

response = oai.responses.create(
    model="gpt-4.1-mini",
    input=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.output[0].content[0].text) # type: ignore

The capital of France is Paris.


In [63]:
# plain text

response = oai.responses.create(
    model="gpt-4.1-mini",
    input="What is the capital of France?",
)
id = response.id
output_text = response.output_text
print(output_text)

# state management with ids 

response = oai.responses.create(
    model="gpt-4.1-mini",
    input="What about Germany?",
    previous_response_id=id,
)
id = response.id
output_text = response.output_text
print(output_text)

The capital of France is Paris.
The capital of Germany is Berlin.


In [30]:
import os
from openai import OpenAI

deepinfra = OpenAI(base_url=os.getenv("DEEPINFRA_API_URL"), api_key=os.getenv("DEEPINFRA_API_KEY"))

response = deepinfra.chat.completions.create(
    model="microsoft/phi-4",
    messages=[{"role": "user", "content": "What is the capital of France?"}],
)
print(response.choices[0].message.content)

response = deepinfra.responses.create(
    model="microsoft/phi-4",
    input="What is the capital of France?",
)
print(response.output_text)

The capital of France is Paris.


NotFoundError: Error code: 404 - {'detail': 'Not Found'}

### Tool Calling + Parsing Structured Outputs

In [64]:
# DIY tool calling -- attempt 1

system_prompt = """
You have access to a weather tool.

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")
"""

user_prompt = "What's the weather like in Tokyo?"

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}],
)
print(response.choices[0].message.content)

Could you please specify whether you want the temperature in Celsius or Fahrenheit?


In [65]:
# DIY tool calling -- attempt 2

system_prompt = """
You have access to a weather tool.

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")
"""

user_prompt = "What's the weather like in Tokyo in Celsius?"

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}],
)
print(response.choices[0].message.content)

print(response)

Let me check the current weather in Tokyo for you.
ChatCompletion(id='chatcmpl-BjY2wjKtmK8FZyObb7YBgeyLjVVTn', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Let me check the current weather in Tokyo for you.', refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=None))], created=1750195334, model='gpt-4.1-mini-2025-04-14', object='chat.completion', service_tier='default', system_fingerprint='fp_6f2eabb9a5', usage=CompletionUsage(completion_tokens=11, prompt_tokens=57, total_tokens=68, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))


In [66]:
# DIY tool calling -- attempt 3

import json

system_prompt = """
You have access to a weather tool.

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")

Call a tool by returning a JSON object with the following fields:
- tool: str
- args: dict

Example:
{"tool": "weather", "args": {"city": "San Francisco", "country": "USA", "scale": "fahrenheit"}}
"""

def dummy_weather_tool(city: str, country: str, scale: str):
    return f"The weather in {city}, {country} is 20 degrees ({scale})."

tools = {
    "weather": dummy_weather_tool,
}

def call_tool(tool: str, args: dict):
    print(f"Calling tool {tool} with args {args}")
    return tools[tool](**args)

user_prompt = "What's the weather like in Tokyo?"

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}],
)
response_str = response.choices[0].message.content
print(response_str)

response_json = json.loads(response_str) # type: ignore
for k, v in response_json.items():
    print(f"{k}: {v}")
    if isinstance(v, dict):
        for k2, v2 in v.items():
            print(f"  {k2}: {v2}")

tool_response = call_tool(response_json["tool"], response_json["args"])
print(tool_response)

{"tool": "weather", "args": {"city": "Tokyo", "country": "Japan", "scale": "celsius"}}
tool: weather
args: {'city': 'Tokyo', 'country': 'Japan', 'scale': 'celsius'}
  city: Tokyo
  country: Japan
  scale: celsius
Calling tool weather with args {'city': 'Tokyo', 'country': 'Japan', 'scale': 'celsius'}
The weather in Tokyo, Japan is 20 degrees (celsius).


In [32]:
# DIY tool calling with CoT -- attempt 4 

import json

system_prompt = """
You have access to a 'weather' tool. **Always think step-by-step before calling a tool.**

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")

Call a tool by returning a JSON object with the following fields:
- tool: str
- args: dict

Example:
I should call the weather tool with the given args:
{"tool": "weather", "args": {"city": "San Francisco", "country": "USA", "scale": "fahrenheit"}}
"""

def dummy_weather_tool(city: str, country: str, scale: str):
    return f"The weather in {city}, {country} is 20 degrees ({scale})."

tools = {
    "weather": dummy_weather_tool,
}

def call_tool(tool: str, args: dict):
    print(f"Calling tool {tool} with args {args}")
    return tools[tool](**args)

user_prompt = "What's the weather like in Tokyo in Celsius?"

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": user_prompt}],
)
response_str = response.choices[0].message.content
print(response_str)

response_json = json.loads(response_str) # type: ignore
for k, v in response_json.items():
    print(f"{k}: {v}")
    if isinstance(v, dict):
        for k2, v2 in v.items():
            print(f"  {k2}: {v2}")

tool_response = call_tool(response_json["tool"], response_json["args"])
print(tool_response)

I will check the current weather in Tokyo in Celsius for you.


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [67]:
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the weather for a given city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {
                    "type": "string",
                    "description": "The city to get the weather for"
                },
                "country": {
                    "type": "string",
                    "description": "The country to get the weather for"
                },
                "scale": {
                    "type": "string",
                    "description": "The scale to get the weather for"
                }
            },
            "required": ["city", "country", "scale"]
        }
    }
}]

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}],
    tools=tools, # type: ignore
)

tool_args = json.loads(response.choices[0].message.tool_calls[0].function.arguments) # type: ignore
print(tool_args)

{'city': 'Tokyo', 'country': 'Japan', 'scale': 'celsius'}


In [68]:
print(response.choices[0])

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=[], audio=None, function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_bDxoE3P0XWQ6AYIlXw6Lm72B', function=Function(arguments='{"city":"Tokyo","country":"Japan","scale":"celsius"}', name='get_weather'), type='function')]))


In [76]:
# pydantic structured outputs
from typing import Literal
from pydantic import BaseModel

class WeatherArgs(BaseModel):
    city: str
    country: str
    scale: Literal["celsius", "fahrenheit"]

class WeatherResponse(BaseModel):
    think: str
    args: WeatherArgs


response = oai.beta.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}],
    response_format=WeatherResponse,
)
response_obj = response.choices[0].message.parsed
print(response_obj)

think='I need to know the temperature scale to provide the weather information for Tokyo. Assuming Celsius as the default scale.' args=WeatherArgs(city='Tokyo', country='Japan', scale='celsius')


In [71]:
response_obj.args.scale

'celsius'

In [77]:
system_prompt = """
You have access to a 'weather' tool.

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")

Call a tool by returning a JSON object with the following fields:
- tool: str
- args: dict

Example:
{"tool": "weather", "args": {"city": "San Francisco", "country": "USA", "scale": "fahrenheit"}}
"""

response = oai.beta.chat.completions.parse(
    model="gpt-4.1-mini",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}],
    response_format={"type": "json_object"}, # type: ignore
)
print(response.choices[0].message.content)

{"tool": "weather", "args": {"city": "Tokyo", "country": "Japan", "scale": "celsius"}}


In [82]:
# not all providers suppport pydantic structured outputs natively

class WeatherArgs(BaseModel):
    city: str
    country: str
    scale: str

class WeatherResponse(BaseModel):
    think: str
    args: WeatherArgs

deepinfra = OpenAI(base_url=os.getenv("DEEPINFRA_API_URL"), api_key=os.getenv("DEEPINFRA_API_KEY"))

response = deepinfra.beta.chat.completions.parse(
    model="microsoft/phi-4",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}],
    response_format={"type": "json_object"}, # type: ignore
)
print(response.choices[0].message.content)

{"tool": "weather", "args": {"city": "Tokyo", "country": "Japan", "scale": "celsius"}}



In [84]:
import instructor
from instructor import Mode 
deepinfra_i = instructor.from_openai(deepinfra, mode=Mode.JSON)

response = deepinfra_i.chat.completions.create(
    model="microsoft/phi-4",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}], # type: ignore
    response_model=WeatherResponse,
)
print(response)

think='To provide the weather information for Tokyo, I need to call the weather tool with the appropriate arguments.' args=WeatherArgs(city='Tokyo', country='Japan', scale='celsius')


In [86]:
### XML

# uv add https://github.com/willccbb/verifiers.git

import verifiers as vf 

parser = vf.XMLParser(fields=["reasoning", ("tool", "answer")], answer_field="answer")


system_prompt = f"""
You have access to a 'weather' tool. **Always think step-by-step before calling a tool.**

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")

Respond in the following XML format:
{parser.get_format_str()}

For tool calls, return a JSON object inside the 'tool' section with the following fields:
- tool: str (e.g. "weather")
- args: dict (e.g. {{"city": "San Francisco", "country": "USA", "scale": "fahrenheit"}})
"""

print(system_prompt)
print('---')

response = deepinfra.chat.completions.create(
    model="microsoft/phi-4",
    messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": "What's the weather like in Tokyo?"}], # type: ignore
)
response_str = response.choices[0].message.content
print(response_str)
print('---')
parsed = parser.parse(response_str) 
print(parsed.reasoning)
print(parsed.tool)


You have access to a 'weather' tool. **Always think step-by-step before calling a tool.**

Args:
- city: str
- country: str
- scale: str (e.g. "celsius", "fahrenheit")

Respond in the following XML format:
<reasoning>
...
</reasoning>
<[ tool | answer ]>
...
</[ tool | answer ]>

For tool calls, return a JSON object inside the 'tool' section with the following fields:
- tool: str (e.g. "weather")
- args: dict (e.g. {"city": "San Francisco", "country": "USA", "scale": "fahrenheit"})

---
<reasoning>
To provide the weather details for Tokyo, I need the temperature in a specific scale. The most commonly used scales for temperature are Celsius and Fahrenheit. To proceed, I will assume Celsius as it is widely used worldwide, especially in Tokyo. However, I would need to confirm this assumption or convert it to another scale if specified. Therefore, I will use the 'weather' tool to get the current weather in Tokyo measured in Celsius.
</reasoning>
<tool>
{
  "tool": "weather",
  "args": {
 

In [40]:
tool_args = json.loads(parsed.tool)
print(tool_args)

{'tool': 'weather', 'args': {'city': 'Tokyo', 'country': 'Japan', 'scale': 'celsius'}}


for self-hosting:
- vLLM + SGLang
- both support Outlines + XGrammar as structured output parsers
- regex, json, 

Links:
- https://dottxt-ai.github.io/outlines/reference/generation/regex/
- https://dottxt-ai.github.io/outlines/reference/generation/types/
- https://xgrammar.mlc.ai/docs/how_to/json_generation.html 

### Many Flavors of ReAct


Seminal paper: https://react-lm.github.io/

#### Example: Doc Search Agent

Setup:
- input question
- multiple tools
- some end state (e.g giving an answer, response with no tool calls)

In [41]:
# count number of files in data/wiki
import os
print(len(os.listdir("data/wiki")))
first_ten_files = os.listdir("data/wiki")[:10]
print(first_ten_files)

2590
['William McKinley.md', "A Midsummer Night's Dream.md", 'Robert Duvall.md', 'Chi-squared test.md', 'The Office _British TV series.md', 'Poseidon.md', 'Shinto.md', 'SZA.md', 'XNXX.md', 'Alanis Morissette.md']


In [46]:
# Initialize/load the collection

import os
import chromadb
from chromadb.utils import embedding_functions

# Setup
WIKI_DIR = "data/wiki"  # Path relative to notebook location
CHROMA_DB_DIR = ".chroma_db"  # Directory for persistent ChromaDB storage

# Create persistent ChromaDB client
db_client = chromadb.PersistentClient(path=CHROMA_DB_DIR)

# Create embedding function using OpenAI
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ.get("OPENAI_API_KEY"),
    model_name="text-embedding-3-small"
)

def init_collection():
    """Initialize ChromaDB collection with wiki page titles"""
    try:
        # Try to get existing collection
        collection = db_client.get_collection("wiki_titles", embedding_function=openai_ef)
        return collection
    except:
        # Create new collection and index all titles
        collection = db_client.create_collection("wiki_titles", embedding_function=openai_ef)
        
        # Get all wiki files
        wiki_files = [f for f in os.listdir(WIKI_DIR) if f.endswith('.md')]
        
        # Add documents to collection
        documents = []
        ids = []
        metadatas = []
        
        for filename in wiki_files:
            # Create page ID from filename (remove .md extension)
            title = filename[:-3]
            # remove special characters
            page_id = title.replace(' ', '_').lower()
            
            documents.append(title)
            ids.append(page_id)
            metadatas.append({"page_id": page_id, "title": title})

        # Add in batches of 100
        batch_size = 100
        for i in range(0, len(documents), batch_size):
            collection.add(
                documents=documents[i:i+batch_size],
                ids=ids[i:i+batch_size],
                metadatas=metadatas[i:i+batch_size]
            )
        
        return collection

# Initialize collection on notebook load
collection = init_collection()

In [87]:
import chromadb

db_client = chromadb.PersistentClient(path=".chroma_db")

# count number of entries in wiki_titles collection
print(db_client.get_collection("wiki_titles").count())

# get all collections
print(db_client.list_collections())


2590
[Collection(name=wiki_titles)]


In [88]:
def search_pages(query: str) -> list[dict]:
    """Search for top 10 relevant articles using title embedding similarity.
    
    Args:
        query (str): The query to search for.

    Returns:
        list[dict]: A list of dicts with page_id and title.

    Examples:
        "basketball" -> [{"page_id": "basketball", "title": "Basketball"}, {"page_id": "basketball_rules", "title": "Basketball Rules"}, ...]
    """
    results = collection.query(
        query_texts=[query],
        n_results=10
    )
    
    # Format results
    output = []
    for i in range(len(results['ids'][0])):
        output.append({
            "page_id": results['ids'][0][i],
            "title": results['metadatas'][0][i]['title'] # type: ignore
        })
    
    return output

# test search_pages
print(search_pages("basketball"))

[{'page_id': 'basketball_positions', 'title': 'Basketball positions'}, {'page_id': 'baseball', 'title': 'Baseball'}, {'page_id': 'basketball_wives', 'title': 'Basketball Wives'}, {'page_id': 'reggie_jackson__basketball,_born_1990', 'title': 'Reggie Jackson _basketball, born 1990'}, {'page_id': "united_states_men's_national_basketball_team", 'title': "United States men's national basketball team"}, {'page_id': 'chicago_bulls', 'title': 'Chicago Bulls'}, {'page_id': 'blake_griffin', 'title': 'Blake Griffin'}, {'page_id': 'jeremy_lin', 'title': 'Jeremy Lin'}, {'page_id': 'lamelo_ball', 'title': 'LaMelo Ball'}, {'page_id': '1984_nba_draft', 'title': '1984 NBA draft'}]


In [89]:
def view_sections(page_id: str) -> list[dict]:
    """View the sections of a page.
    
    Args:
        page_id (str): The ID of the page to view.

    Returns:
        list[dict]: A list of dicts with section_id and section_name.

    Examples:
        "basketball" -> [{"section_id": "basketball:history", "section_name": "History"}, ...]
    """
    # Find the file for this page_id
    results = collection.get(ids=[page_id])
    if not results['ids']:
        raise ValueError(f"Page not found: {page_id}")
    
    filename = results['metadatas'][0]['title'] + '.md'  # type: ignore
    filepath = os.path.join(WIKI_DIR, filename) # type: ignore
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    sections = []
    
    lines = content.split('\n')
    for i, line in enumerate(lines):
        if line.startswith('#'):
            # Extract section name (remove # and whitespace)
            section_name = line.lstrip('#').strip()
            # Create section ID
            section_id = f"{page_id}:{section_name.lower().replace(' ', '_')}"
            sections.append({
                "section_id": section_id,
                "section_name": section_name,
                "start_line": i
            })
    
    # If no sections found, return the whole page as one section
    if not sections:
        sections.append({
            "section_id": f"{page_id}:full",
            "section_name": "Full Page",
            "start_line": 0
        })
    
    return [{"section_id": s["section_id"], "section_name": s["section_name"]} 
            for s in sections]


# test view_sections
view_sections("baseball")

[{'section_id': 'baseball:baseball', 'section_name': 'Baseball'},
 {'section_id': 'baseball:rules_and_gameplay',
  'section_name': 'Rules and gameplay'},
 {'section_id': 'baseball:personnel', 'section_name': 'Personnel'},
 {'section_id': 'baseball:players', 'section_name': 'Players'},
 {'section_id': 'baseball:managers_and_coaches',
  'section_name': 'Managers and coaches'},
 {'section_id': 'baseball:umpires', 'section_name': 'Umpires'},
 {'section_id': 'baseball:strategy', 'section_name': 'Strategy'},
 {'section_id': 'baseball:tactics', 'section_name': 'Tactics'},
 {'section_id': 'baseball:pitching_and_fielding',
  'section_name': 'Pitching and fielding'},
 {'section_id': 'baseball:batting_and_baserunning',
  'section_name': 'Batting and baserunning'},
 {'section_id': 'baseball:history', 'section_name': 'History'},
 {'section_id': 'baseball:in_the_united_states',
  'section_name': 'In the United States'},
 {'section_id': 'baseball:establishment_of_professional_leagues',
  'section_nam

In [90]:
def read_section(section_id: str) -> str:
    """Read a section of a page.
    
    Args:
        section_id (str): The ID of the section to read.

    Returns:
        str: The content of the section.
        
    Examples:
        "baseball:finnish_baseball" -> "Finnish baseball is a sport that is played in Finland..."
    """
    # Parse section_id
    if ':' not in section_id:
        raise ValueError("Invalid section_id format. Expected: page_id:section_name")
    
    page_id, section_name_id = section_id.split(':', 1)
    
    # Get the file
    results = collection.get(ids=[page_id])
    if not results['ids']:
        raise ValueError(f"Page not found: {page_id}")
    
    filename = results['metadatas'][0]['title'] + '.md' # type: ignore
    filepath = os.path.join(WIKI_DIR, filename)
    
    with open(filepath, 'r', encoding='utf-8') as f:
        content = f.read()
    
    lines = content.split('\n')
    
    # Special case for "full" section
    if section_name_id == "full":
        return content
    
    # Find the section
    section_start = None
    section_end = None
    
    for i, line in enumerate(lines):
        if line.startswith('#'):
            current_section = line.lstrip('#').strip().lower().replace(' ', '_')
            if current_section == section_name_id and section_start is None:
                section_start = i
            elif section_start is not None and section_end is None:
                section_end = i
                break
    
    # If section found
    if section_start is not None:
        if section_end is None:
            section_end = len(lines)
        return '\n'.join(lines[section_start:section_end])
    else:
        raise ValueError(f"Section not found: {section_id}")
    
print(read_section("baseball:finnish_baseball"))

#### Finnish baseball

Finnish baseball, known as pesäpallo, is a combination of traditional ball-batting team games and North American baseball, invented by ["Tahko" Pihkala](Lauri)(Lauri Pihkala) in the 1920s. The basic idea of pesäpallo is similar to that of baseball: the offense tries to score by hitting the ball successfully and running through the bases, while the defense tries to put the batter and runners out. One of the most important differences between pesäpallo and baseball is that the ball is pitched vertically, which makes hitting the ball, as well as controlling the power and direction of the hit, much easier. This gives the offensive game more variety, speed, and tactical aspects compared to baseball.



In [92]:
# sample question where default answer is wrong (according to docs)

response = oai.chat.completions.create(
    model="gpt-4.1-mini",
    messages=[{"role": "user", "content": "What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?"}],
)
print(response.choices[0].message.content)

Microsoft's acquisition of ZeniMax Media in 2020 was valued at approximately $7.5 billion.


### Chaining tools via OpenAI Agents

In [93]:
# openai agents sdk
# uv add openai-agents

from agents import Agent, Runner, function_tool

agent = Agent(
    model="gpt-4.1-mini",
    name="wiki_agent",
    tools=[],
)

# no tools
result = Runner.run_sync(agent, "What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?")
print(result.final_output)

Microsoft's acquisition deal for ZeniMax Media in 2020 was valued at $7.5 billion.


In [95]:
from agents import Agent, Runner, function_tool # type: ignore


# async wrappers
@function_tool
async def search_pages_fn(query: str) -> list[dict]:
    """Search for top 10 relevant articles using title embedding similarity.
    
    Args:
        query (str): The query to search for.

    Returns:
        list[dict]: A list of dicts with page_id and title.

    Examples:
        "basketball" -> [{"page_id": "basketball", "title": "Basketball"}, {"page_id": "basketball_rules", "title": "Basketball Rules"}, ...]
    """
    return search_pages(query)

@function_tool
async def view_sections_fn(page_id: str) -> list[dict]:
    """View the sections of a page.
    
    Args:
        page_id (str): The ID of the page to view.

    Returns:
        list[dict]: A list of dicts with section_id and section_name.

    Examples:
        "basketball" -> [{"section_id": "basketball:history", "section_name": "History"}, ...]
    """
    return view_sections(page_id)

@function_tool
async def read_section_fn(section_id: str) -> str:
    """Read a section of a wiki page.
    
    Args:
        section_id (str): The ID of the section to read.

    Returns:
        str: The content of the section.    

    Examples:
        "basketball:history" -> "The history of basketball..."
    """
    return read_section(section_id)

agent = Agent(
    model="gpt-4.1-mini",
    name="wiki_agent",
    tools=[search_pages_fn, view_sections_fn, read_section_fn],
)

result = Runner.run_sync(agent, "What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?")
print(result.final_output)


The value of Microsoft's acquisition deal for ZeniMax Media in 2020 was $8.1 billion in cash. The acquisition was intended to expand the library of Xbox Game Pass and XCloud and was completed by March 9, 2021.


In [53]:
for n in result.new_items:
    of 
    print(n)

ToolCallItem(agent=Agent(name='wiki_agent', instructions=None, prompt=None, handoff_description=None, handoffs=[], model='gpt-4.1-mini', model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=None, metadata=None, store=None, include_usage=None, extra_query=None, extra_body=None, extra_headers=None, extra_args=None), tools=[FunctionTool(name='search_pages_fn', description='Search for top 10 relevant articles using title embedding similarity.', params_json_schema={'properties': {'query': {'description': 'The query to search for.', 'title': 'Query', 'type': 'string'}}, 'required': ['query'], 'title': 'search_pages_fn_args', 'type': 'object', 'additionalProperties': False}, on_invoke_tool=<function function_tool.<locals>._create_function_tool.<locals>._on_invoke_tool at 0x11dd93d80>, strict_json_schema=True, is_enabled=True), FunctionTool(name='view_sec

### HF SmolAgents

In [96]:
from smolagents import OpenAIServerModel, CodeAgent, tool

model = OpenAIServerModel(model_id="gpt-4.1-mini")

@tool
def search_pages_tool(query: str) -> list[dict]:
    """Search for top 10 relevant articles using title embedding similarity.
    
    Args:
        query (str): The query to search for.
    """
    return search_pages(query)

@tool
def view_sections_tool(page_id: str) -> list[dict]:
    """View the sections of a page.
    
    Args:
        page_id (str): The ID of the page to view.
    """
    return view_sections(page_id)

@tool
def read_section_tool(section_id: str) -> str:
    """Read a section of a wiki page.
    
    Args:
        section_id (str): The ID of the section to read.
    """
    return read_section(section_id)

agent = CodeAgent(
    model=model,
    tools=[search_pages_tool, view_sections_tool, read_section_tool],
)

agent.run("What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?")

'$8.1 billion'

### dspy.ReAct

In [97]:
import dspy

lm = dspy.LM(model="gpt-4.1-mini")
dspy.configure(lm=lm)

react = dspy.ReAct(signature="question->answer", tools=[search_pages, view_sections, read_section])

result = react(question="What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?")
print(result)

Prediction(
    trajectory={'thought_0': 'I need to find information about Microsoft\'s acquisition of ZeniMax Media in 2020, specifically the value of the deal. I will start by searching for relevant pages about "Microsoft acquisition ZeniMax Media 2020".', 'tool_name_0': 'search_pages', 'tool_args_0': {'query': 'Microsoft acquisition ZeniMax Media 2020'}, 'observation_0': [{'page_id': 'xbox_game_studios', 'title': 'Xbox Game Studios'}, {'page_id': 'xbox', 'title': 'Xbox'}, {'page_id': 'xbox_series_x_and_series_s', 'title': 'Xbox Series X and Series S'}, {'page_id': 'insomniac_games', 'title': 'Insomniac Games'}, {'page_id': 'rockstar_games', 'title': 'Rockstar Games'}, {'page_id': 'frictional_games', 'title': 'Frictional Games'}, {'page_id': 'fallout_76', 'title': 'Fallout 76'}, {'page_id': 'gabe_newell', 'title': 'Gabe Newell'}, {'page_id': 'microsoft_teams', 'title': 'Microsoft Teams'}, {'page_id': "assassin's_creed_valhalla", 'title': "Assassin's Creed Valhalla"}], 'thought_1': 'T

In [98]:
result.answer

"Microsoft's acquisition deal for ZeniMax Media in 2020 was valued at $8.1 billion in cash."

### Letta - Stateful LLM Agents

- https://docs.letta.com/guides/agents/custom-tools

Pros: 
- First-class support for memory, can locally host memory servers

Cons:
- Agents aren't always great at state management
- Can be hard to eval

### Generating Questions

In [41]:
import random
import json
import asyncio
import nest_asyncio
from tqdm.notebook import tqdm
from openai import AsyncOpenAI

# Enable asyncio in Jupyter
nest_asyncio.apply()

# Initialize async OpenAI client
openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Semaphore to limit concurrent requests
semaphore = asyncio.Semaphore(3)

async def generate_questions_for_file(filepath: str, n_questions: int = 5) -> list[dict]:
    """
    Generate N question-answer pairs for a given wiki file using gpt-4.1.
    Returns list of dicts with question, answer, and filename.
    """
    async with semaphore:  # Limit concurrent requests
        # Read file content directly
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read()
        
        filename = os.path.basename(filepath)
        
        # Prompt for GPT-4.1
        prompt = f"""Given the following article content, generate {n_questions} question-answer pairs.

Requirements:
- Questions should be one sentence about a specific fact contained in the article
    - They should be framed as a general trivia question (the question reader will not see the article OR title OR any other information about the article)
    - Questions should be "fair game" for advanced pub trivia -- requiring potentially deep obscure knowledge or factual recall or search, but "self-contained" (without making reference to the article)
- Answers should be just a few words (1-5 words typically)
- Return as a JSON object with a "questions" list containing dicts with "question" and "answer" fields

Article content:
{content[:50000]}

Schema: 
{{
    "questions": [
        {{
            "question": "question text",
            "answer": "answer text"
        }},
        ...
    ]
}}

Return ONLY the JSON object, no other text."""
        
        # Call GPT-4.1
        response = await openai_client.chat.completions.create(
            model="gpt-4.1-mini",
            messages=[
                {"role": "system", "content": "You are a helpful assistant that generates factual question-answer pairs from articles."},
                {"role": "user", "content": prompt}
            ],
            response_format={"type": "json_object"},
            temperature=0.7
        )
        
        # Parse response
        try:
            response_content = response.choices[0].message.content
            if not response_content:
                return []
            response_json = json.loads(response_content)
            # Handle different possible JSON structures
            if isinstance(response_json, list):
                qa_pairs = response_json
            elif isinstance(response_json, dict):
                # Try common keys
                qa_pairs = response_json.get('pairs', response_json.get('questions', response_json.get('data', [])))
                if not isinstance(qa_pairs, list):
                    qa_pairs = []
            else:
                qa_pairs = []
        except json.JSONDecodeError:
            return []
        
        # Add metadata to each pair
        results = []
        for pair in qa_pairs:
            results.append({
                "question": pair["question"],
                "answer": pair["answer"],
                "filename": filename
            })
        
        return results

async def generate_random_questions(n_pages: int = 3, questions_per_page: int = 3) -> list[dict]:
    """
    Generate questions for N random wiki pages using parallel processing.
    Works directly with files, no database needed.
    Returns consolidated list of all question-answer pairs.
    """
    # Get all wiki files directly from directory
    wiki_files = [f for f in os.listdir(WIKI_DIR) if f.endswith('.md')]
    
    # Sample random files
    selected_files = random.sample(wiki_files, min(n_pages, len(wiki_files)))
    
    # Create tasks for parallel processing
    tasks = []
    for filename in selected_files:
        filepath = os.path.join(WIKI_DIR, filename)
        task = generate_questions_for_file(filepath, questions_per_page)
        tasks.append(task)
    
    # Execute all tasks in parallel with progress bar
    all_results = []
    with tqdm(total=len(tasks), desc="Generating questions") as pbar:
        for coro in asyncio.as_completed(tasks):
            try:
                result = await coro
                all_results.append(result)
                pbar.update(1)
            except Exception as e:
                pbar.update(1)
                continue

    # Flatten results
    all_questions = []
    for questions in all_results:
        all_questions.extend(questions)

    return all_questions

# Example usage
async def main():
    n_pages = 5
    questions = await generate_random_questions(n_pages=n_pages, questions_per_page=5)
    print(f"\nGenerated {len(questions)} total questions")
    
    for i, q in enumerate(questions): 
        print(f"\n{i+1}. Q: {q['question']}")
        print(f"   A: {q['answer']}")
        print(f"   File: {q['filename']}")
    
    return questions

# Run the async function
questions = await main()

Generating questions:   0%|          | 0/5 [00:00<?, ?it/s]


Generated 25 total questions

1. Q: Which 1956 album by Harry Belafonte was the first LP to sell over one million copies?
   A: Calypso
   File: Harry Belafonte.md

2. Q: What song is Harry Belafonte best known for that includes the signature lyric 'Day-O'?
   A: The Banana Boat Song
   File: Harry Belafonte.md

3. Q: Which civil rights leader was Harry Belafonte a close confidant of during the 1950s and 1960s?
   A: Martin Luther King Jr.
   File: Harry Belafonte.md

4. Q: In which 2018 Spike Lee film did Harry Belafonte make his final screen appearance?
   A: BlacKkKlansman
   File: Harry Belafonte.md

5. Q: What was Harry Belafonte's birth name?
   A: Harold George Bellanfanti Jr.
   File: Harry Belafonte.md

6. Q: In which film series did Linda Hamilton play the character Sarah Connor?
   A: Terminator
   File: Linda Hamilton.md

7. Q: What medical condition did Linda Hamilton publicly discuss having, which affected her marriages?
   A: Bipolar disorder
   File: Linda Hamilton.md


In [99]:
"Alex Desert" == "Alex Désert"

False

### LLM Judges

- Comparing vs. ground truth
- Judging semantic properties

### DeepEval

- https://deepeval.com/docs/metrics-introduction
- G-Eval paper: https://arxiv.org/abs/2303.16634

In [109]:
from deepeval import evaluate, assert_test
from deepeval.test_case import LLMTestCase, LLMTestCaseParams # type: ignore
from deepeval.metrics import GEval # type: ignore

def test_correctness(question, answer, response):
    correctness_metric = GEval(
        name="Correctness",
        model="gpt-4.1-mini",
        criteria="Determine if the 'actual output' is equal to the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input=question,
        actual_output=response,
        expected_output=answer
    )
    assert_test(test_case, [correctness_metric])
    # for many tests
    evaluate(test_cases=[test_case], metrics=[correctness_metric])

test_correctness(
    question="What was the value of Microsoft's acquisition deal for ZeniMax Media in 2020?",
    response="$8.1 billion.",
    answer="The value of Microsoft's acquisition deal for ZeniMax Media in 2020 was $8.1 billion."
)

Output()

AssertionError: Metrics: Correctness (GEval) (score: 0.0, threshold: 0.5, strict: False, error: None, reason: The actual output does not match the expected output exactly; it is incomplete and missing the full sentence, including the subject, context, and punctuation, resulting in a failure to meet the exact match criteria.) failed.

### verifiers (my own)

In [57]:
import verifiers as vf

system_prompt = """
You are a search agent who has access to the following tools for searching over a set of Wikipedia articles:

{tool_descriptions}

You may make up to 10 tool calls before giving your final answer.

In each turn, respond in the following format:
<think>
[your thoughts here]
</think>
<tool>
{{
    "name": "search_pages", # name of the tool to call
    "args": {{
        "query": "query" # arguments to pass to the tool
    }}
}}
</tool>

When you have found the answer, respond in the following format:
<think>
[your thoughts here]
</think>
<answer>
[final answer here]
</answer>
"""

tools = [
    search_pages,
    view_sections,
    read_section,
]

from datasets import load_dataset # type: ignore
dataset = load_dataset("willcb/wiki-trivia-questions", split="train").select(range(10))

from openai import OpenAI # type: ignore
from verifiers.rubrics.judge_rubric import JudgeRubric
judge_client = OpenAI()
judge_model = "gpt-4.1-nano"
judge_rubric = JudgeRubric(
    judge_client=judge_client,
    judge_model=judge_model
)

vf_env = vf.ToolEnv(
    dataset=dataset,
    system_prompt=system_prompt,
    tools=tools,
    max_turns=11,
)
vf_env.rubric = vf.RubricGroup(rubrics=[judge_rubric, vf_env.rubric])

num_proc must be <= 10. Reducing num_proc to 10 for dataset of size 10.
Setting TOKENIZERS_PARALLELISM=false for forked processes.


return_description: list[dict]: A list of dicts with page_id and title. (list)
return_description: list[dict]: A list of dicts with section_id and section_name. (list)
return_description: str: The content of the section. (str)


Map (num_proc=10):   0%|          | 0/10 [00:00<?, ? examples/s]

2025-06-17 16:48:11 - verifiers.rubrics.RubricGroup - INFO - Initialized RubricGroup with 2 rubrics


In [103]:
results = vf_env.evaluate(
    client=OpenAI(),
    model="gpt-4.1",
    max_turns=11,
    max_concurrent=3,
)

print(results)


2025-06-17 18:15:33 - verifiers.envs.ToolEnv - INFO - eval_dataset is not set, falling back to train dataset


RateLimitError: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1 in organization org-vRvYgphPIJK9CQHj6eRpmU56 on tokens per min (TPM): Limit 30000, Used 29750, Requested 1046. Please try again in 1.592s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

In [10]:
# difflib similarity

from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

print(similarity("The value of Microsoft's acquisition deal for ZeniMax Media in 2020 was $8.1 billion.", "The total value of the deal was $8.1 billion."))


0.5692307692307692


In [108]:
# embedding similarity with OAI text-embedding-3-small

from openai import OpenAI
import numpy as np

client = OpenAI()

answer_embedding = client.embeddings.create(
    input="$8.1B",
    model="text-embedding-3-small"
)

response_embedding = client.embeddings.create(
    input="8.1 billion.",
    model="text-embedding-3-small"
)

# dot product
print(np.dot(answer_embedding.data[0].embedding, response_embedding.data[0].embedding))

# cosine similarity -- same if already normalized
print(np.dot(answer_embedding.data[0].embedding, response_embedding.data[0].embedding) / (np.linalg.norm(answer_embedding.data[0].embedding) * np.linalg.norm(response_embedding.data[0].embedding)))

# euclidean distance
print(np.linalg.norm(np.array(answer_embedding.data[0].embedding) - np.array(response_embedding.data[0].embedding)))

0.6368027313804879
0.6368027003436281
0.8522878825351152
