# Lab 1: The Anatomy of a Decision
## Prompt Structure and Tool Selection Accuracy

**Series**: Agentic Engineering Crash Course 
**Module**: 1 — Tokenization & Logit Control (Understanding Stochasticity) 
**Prerequisites**: Python 3.10+, OpenAI API key 

---

**Suggested time**: 45–60 min.  
**Experiments**: Baseline (required). Exploration: Experiments 1–3 required; Experiment 4 optional.

For definitions of key terms (e.g. logits, token, context window, tool call), see [Glossary](Glossary.md).

### Learning Objectives

By the end of this lab you will be able to:

1. **Explain** why prompt structure changes which tool a language model selects.
2. **Observe** that tool choice is a discrete decision governed by a conditional probability distribution over tokens.
3. **Run controlled experiments** that isolate the effect of definition order, prompt clarity, temperature, and format drift on tool selection.
4. **Connect** these observations to real debugging workflows for EOP (Enterprise Operations) Agents.

---

## 前置概念（Before You Start）

If you are new to LLMs and agents, these minimal concepts will help you follow the lab:

- **Prompt**: The text (and structure) you send to the model. It usually includes a system message (instructions, tool list) and a user message (the current request). The model's reply is *conditioned* on this entire prompt.
- **LLM (Large Language Model)**: A model that takes text as input and produces text (or structured output) as output. It has no memory between calls; each response depends only on what you send in that request.
- **API call**: You use a service (e.g. OpenAI) by sending an HTTP request with your prompt and API key. The service runs the model and returns the completion. This lab uses the OpenAI Python client to make these calls.
- **Token**: The basic unit of text the model processes (roughly words or subwords). Models have a fixed **context window** (max tokens per request); your prompt and response must fit within it.

You do **not** need prior knowledge of neural networks, attention, or softmax to complete this lab.

---
## 1. Theoretical Why: How Prompts Shape Tool Selection

### 1.1 Tool Selection as Conditional Decoding

When an LLM is presented with a set of tools and a user query, it does not "choose" a tool the way a programmer writes an `if/else` branch. Instead, it produces a **probability distribution over the next tokens**, conditioned on the entire prompt context:

$$P(\text{tool}_i \mid \text{system message}, \text{tool definitions}, \text{user query})$$

The "tool selection" is the result of decoding (sampling or argmax) from this distribution. Change the conditioning context — reorder the tool list, rephrase the user query, alter the system message — and you change the distribution.

### 1.2 Key Mechanisms

**Context window and attention.** The model's self-attention layers assign different weights to different parts of the prompt. Tool definitions placed at the end of the system message may receive stronger attention than those buried in the middle (recency bias). Definitions placed first may benefit from primacy effects depending on the model architecture and fine-tuning.

**Format alignment.** Models are fine-tuned on specific prompt formats (e.g., `{"role": "system", "content": "..."}` for chat models). If your prompt deviates from the expected format — for example, by omitting the explicit instruction to respond with a tool call — the model may fall back to free-text generation instead of structured tool invocation. We call this **format drift**.

**Stochasticity.** At temperature > 0, the model *samples* from the distribution rather than taking the argmax. The same prompt can yield different tool selections across runs. Prompt design reduces variance by making the correct tool's probability mass dominant.

### 1.3 Maintenance Connection (Preview)

When an EOP Agent selects the wrong tool in production, the diagnostic checklist starts here:

1. **Prompt structure** — Is the system message well-formed? Are tool definitions ordered and described clearly?
2. **Sampling parameters** — Is temperature set appropriately for this decision point?
3. **Format contract** — Does the prompt enforce the expected output format?

This lab gives you the empirical foundation to reason about each of these.

---
## 2. Setup

In [None]:
# --- Cell: Install dependencies ---
!pip install -q openai

In [None]:
# --- Cell: Imports and API key ---
import os
import json
import re
from collections import Counter
from getpass import getpass

from openai import OpenAI

# Prompt for API key (never hard-code secrets)
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI()

MODEL = "gpt-4o-mini"  # Affordable, sufficient for this lab
print(f"Using model: {MODEL}")

---
## 3. Baseline Code: Minimal Tool Selection

We define two simple tools as plain-text descriptions in the system prompt, ask the model to reply with a structured `TOOL: <name>` line, and parse the result. No framework — just raw prompt engineering.

In [None]:
# --- Cell: Tool definitions and prompt template ---

TOOLS = {
    "get_weather": "Retrieve the current weather for a given city. Use when the user asks about weather, temperature, or forecast.",
    "search_docs": "Search internal documentation by keyword. Use when the user asks about policies, procedures, or technical references.",
}


def build_system_prompt(tools: dict[str, str]) -> str:
    """Build a system message listing the available tools."""
    tool_block = "\n".join(
        f"  {i+1}. {name} — {desc}" for i, (name, desc) in enumerate(tools.items())
    )
    return (
        "You are a tool-routing assistant. You have the following tools:\n"
        f"{tool_block}\n\n"
        "Given the user's message, decide which single tool to invoke.\n"
        "Reply with exactly one line in the format:\n"
        "TOOL: <tool_name>\n"
        "Do not include any other text."
    )


def parse_tool_choice(response_text: str) -> str | None:
    """Extract the tool name from a 'TOOL: <name>' response."""
    match = re.search(r"TOOL:\s*(\S+)", response_text, re.IGNORECASE)
    return match.group(1) if match else None


# Preview the system prompt
system_prompt = build_system_prompt(TOOLS)
print(system_prompt)

In [None]:
# --- Cell: Single tool-selection call ---

def select_tool(
    user_message: str,
    tools: dict[str, str],
    temperature: float = 0.0,
    model: str = MODEL,
) -> dict:
    """
    Send a prompt to the model and return the raw response,
    the parsed tool choice, and the prompt used.
    """
    system = build_system_prompt(tools)
    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user_message},
        ],
        max_tokens=30,
    )
    text = response.choices[0].message.content.strip()
    return {
        "user_message": user_message,
        "raw_response": text,
        "parsed_tool": parse_tool_choice(text),
        "temperature": temperature,
    }


# --- Baseline run ---
result = select_tool("What's the weather in New York?", TOOLS)
print(f"User message : {result['user_message']}")
print(f"Raw response : {result['raw_response']}")
print(f"Parsed tool  : {result['parsed_tool']}")

**Expected output**: `TOOL: get_weather`. The prompt is clear, the user query unambiguously matches one tool, and temperature is 0 (deterministic argmax).

> **Record**: Note the parsed tool. This is our **control** result.

---
## 4. Exploration Lab: When Decisions Break

Each experiment below isolates a single variable. Run each one, record the results, and compare to the baseline.

### Helper: Batch Runner

We'll reuse this helper to run the same query N times and tally tool selections.

In [None]:
# --- Cell: Batch helper ---

def run_batch(
    user_message: str,
    tools: dict[str, str],
    n: int = 10,
    temperature: float = 0.0,
    label: str = "",
) -> dict:
    """
    Run `select_tool` n times and return a frequency table of parsed tool choices.
    """
    results = [
        select_tool(user_message, tools, temperature=temperature)
        for _ in range(n)
    ]
    choices = [r["parsed_tool"] for r in results]
    freq = Counter(choices)
    print(f"\n--- {label or user_message} (n={n}, temp={temperature}) ---")
    for tool, count in freq.most_common():
        pct = count / n * 100
        print(f"  {tool or 'PARSE_FAIL'}: {count}/{n} ({pct:.0f}%)")
    # show a sample raw response for inspection
    print(f"  Sample raw: {results[0]['raw_response']}")
    return {"label": label, "freq": dict(freq), "sample_raw": results[0]["raw_response"]}

### Experiment 1: Definition Order Sensitivity

**Hypothesis**: Swapping the order of tool definitions in the system prompt may bias the model toward one tool over another, even when the user query is unambiguous.

**Variable**: Order of tool definitions. 
**Control**: Everything else held constant (same user message, temperature = 0).

In [None]:
# --- Cell: Experiment 1 — Order sensitivity ---

# Original order: get_weather first
tools_order_A = {
    "get_weather": TOOLS["get_weather"],
    "search_docs": TOOLS["search_docs"],
}

# Reversed order: search_docs first
tools_order_B = {
    "search_docs": TOOLS["search_docs"],
    "get_weather": TOOLS["get_weather"],
}

query = "What's the weather in New York?"

res_A = run_batch(query, tools_order_A, n=5, temperature=0.0, label="Order A (weather first)")
res_B = run_batch(query, tools_order_B, n=5, temperature=0.0, label="Order B (docs first)")

> **Observe**: With temperature=0 and an unambiguous query, order likely has no effect — both return `get_weather` 100% of the time. This is the **easy case**. Order effects emerge with ambiguous queries (Experiment 2) or higher temperature (Experiment 3).
>
> **Record**: Were the results identical across orders? Document yes/no.

### Experiment 2: Vague vs. Clear User Prompt

**Hypothesis**: A vague user query increases ambiguity in the conditional distribution, leading to less consistent (and potentially wrong) tool selections.

**Variable**: User message clarity. 
**Control**: Same tools, same temperature (0.7 to reveal variance).

In [None]:
# --- Cell: Experiment 2 — Vague vs. clear prompt ---

CLEAR_QUERY = "What's the weather in New York?"
VAGUE_QUERY = "Help me"
AMBIGUOUS_QUERY = "I need information"  # Could be weather OR docs

res_clear = run_batch(CLEAR_QUERY, TOOLS, n=10, temperature=0.7, label="Clear query")
res_vague = run_batch(VAGUE_QUERY, TOOLS, n=10, temperature=0.7, label="Vague query")
res_ambig = run_batch(AMBIGUOUS_QUERY, TOOLS, n=10, temperature=0.7, label="Ambiguous query")

> **Observe**: 
> - The **clear query** should still yield `get_weather` in the vast majority of runs, even at temperature 0.7.
> - The **vague query** ("Help me") has no signal favoring either tool. Expect a mixed distribution — sometimes `get_weather`, sometimes `search_docs`, possibly parse failures.
> - The **ambiguous query** ("I need information") may lean toward `search_docs` but with nontrivial variance.
>
> **Record**: Tally each distribution. This demonstrates that prompt clarity is a **variance reducer**.
>
> **Implication for EOP Agents**: If the user-facing interface allows freeform input, the agent system prompt must compensate with disambiguation instructions or a fallback/clarification tool.

### Experiment 3: Temperature as a Variance Dial

**Hypothesis**: Increasing temperature spreads probability mass across tool choices, increasing variance. At temperature = 0, the model is deterministic (argmax).

**Variable**: Temperature \(\in \{0.0, 0.3, 0.7, 1.2\}\). 
**Control**: Same tools, same user message (deliberately slightly ambiguous).

In [None]:
# --- Cell: Experiment 3 — Temperature sweep ---

PROBE_QUERY = "I need some information about conditions outside."
# Deliberately ambiguous: "conditions outside" could map to weather or docs.

temps = [0.0, 0.3, 0.7, 1.2]
temp_results = {}

for t in temps:
    temp_results[t] = run_batch(
        PROBE_QUERY, TOOLS, n=10, temperature=t, label=f"temp={t}"
    )

> **Observe**:
> - At `temperature=0.0`, the result should be identical across all 10 runs (deterministic).
> - As temperature increases, you should see the minority tool appearing more frequently.
> - At very high temperature (1.2), parse failures may appear — the model's output becomes less structured.
>
> **Record**: For each temperature, note the majority tool and its percentage. Plot mentally (or literally) the relationship: *temperature vs. selection entropy*.
>
> **Implication for EOP Agents**: Critical routing decisions (e.g., "escalate to human" vs. "auto-resolve") should use low temperature. Creative generation steps can tolerate higher temperature.

### Experiment 4: Format Drift

**Hypothesis**: Removing the explicit format instruction (`TOOL: <name>`) from the system prompt causes the model to revert to free-text, breaking the parser.

**Variable**: Presence of the format instruction. 
**Control**: Same tools, same user message, temperature = 0.

In [None]:
# --- Cell: Experiment 4 — Format drift ---

def build_system_prompt_no_format(tools: dict[str, str]) -> str:
    """System prompt WITHOUT the explicit format instruction."""
    tool_block = "\n".join(
        f"  {i+1}. {name} — {desc}" for i, (name, desc) in enumerate(tools.items())
    )
    return (
        "You are a helpful assistant. You have access to the following tools:\n"
        f"{tool_block}\n\n"
        "Please help the user with their request."
        # NOTE: No format instruction. No "Reply with TOOL: <name>".
    )


def select_tool_no_format(
    user_message: str,
    tools: dict[str, str],
    temperature: float = 0.0,
    model: str = MODEL,
) -> dict:
    """Like select_tool but uses the broken prompt."""
    system = build_system_prompt_no_format(tools)
    response = client.chat.completions.create(
        model=model,
        temperature=temperature,
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": user_message},
        ],
        max_tokens=100,  # more tokens since free-text may be verbose
    )
    text = response.choices[0].message.content.strip()
    return {
        "user_message": user_message,
        "raw_response": text,
        "parsed_tool": parse_tool_choice(text),  # will likely return None
        "temperature": temperature,
    }


# Run structured vs. unstructured
query = "What's the weather in New York?"

print("=== WITH format instruction ===")
for _ in range(3):
    r = select_tool(query, TOOLS, temperature=0.0)
    print(f"  Parsed: {r['parsed_tool']!r:20s} Raw: {r['raw_response']}")

print("\n=== WITHOUT format instruction ===")
for _ in range(3):
    r = select_tool_no_format(query, TOOLS, temperature=0.0)
    print(f"  Parsed: {r['parsed_tool']!r:20s} Raw: {r['raw_response'][:120]}")

> **Observe**:
> - With the format instruction, the parser succeeds every time.
> - Without it, the model likely responds with a natural-language sentence (e.g., "I can help you check the weather..."). The parser returns `None` — a **silent failure**.
>
> **This is format drift**: the interface contract between the prompt and the parser is broken. The model is not wrong (it understood the user's intent), but the *system* fails because the output is not machine-parseable.
>
> **Implication for EOP Agents**: Every tool-routing prompt needs an explicit, tested format contract. When debugging "the agent didn't call any tool," check for format drift *before* blaming the model.

### Experiment 5 (Bonus): Tool Count Scaling

What happens when we add more tools? Does selection accuracy degrade?

In [None]:
# --- Cell: Experiment 5 — Scaling tool count ---

MANY_TOOLS = {
    "get_weather": "Retrieve the current weather for a given city.",
    "search_docs": "Search internal documentation by keyword.",
    "send_email": "Send an email to a specified recipient.",
    "create_ticket": "Create a support ticket in the ticketing system.",
    "run_query": "Execute a SQL query against the analytics database.",
    "schedule_meeting": "Schedule a calendar meeting with attendees.",
    "translate_text": "Translate text from one language to another.",
    "summarize_page": "Summarize a given Confluence page.",
}

query_weather = "What's the weather in Tokyo?"
query_ticket  = "I need to report a bug in the login flow."
query_vague   = "Can you help me with something?"

print("--- 8 tools, clear weather query ---")
run_batch(query_weather, MANY_TOOLS, n=5, temperature=0.0, label="8 tools / weather")

print("\n--- 8 tools, clear ticket query ---")
run_batch(query_ticket, MANY_TOOLS, n=5, temperature=0.0, label="8 tools / ticket")

print("\n--- 8 tools, vague query, temp=0.7 ---")
run_batch(query_vague, MANY_TOOLS, n=10, temperature=0.7, label="8 tools / vague")

> **Observe**:
> - Clear queries should still route correctly even with 8 tools.
> - Vague queries with many tools at higher temperature will show increased dispersion — the model "spreads its bets" across more candidates.
>
> **Record**: Compare the vague-query distribution between 2 tools (Experiment 2) and 8 tools (here). More tools + vague query = higher entropy.
>
> **Scaling insight**: As an EOP Agent grows from 5 to 50 tools, maintaining selection accuracy requires: (a) clear, non-overlapping tool descriptions, (b) hierarchical routing (categories → tools), and (c) logging tool-selection distributions for regression testing.

---
## 5. Maintenance Connection: Debugging EOP Agent Tool Selection

You now have empirical evidence for four failure modes in tool selection. Here is the diagnostic protocol for an EOP Agent that selects the wrong tool:

### Diagnostic Checklist

| Step | Check | What to look for | Fix |
|------|-------|------------------|-----|
| 1 | **Prompt structure** | System message well-formed? Tool definitions present and ordered? | Reformat system prompt; ensure tool list is explicit. |
| 2 | **Tool descriptions** | Are two tools' descriptions overlapping or ambiguous? | Rewrite descriptions to be mutually exclusive. |
| 3 | **User input clarity** | Is the user query vague or ambiguous? | Add a disambiguation step or a `clarify` tool. |
| 4 | **Temperature** | Is the routing step using temperature > 0? | Set `temperature=0` for deterministic routing. |
| 5 | **Format contract** | Does the prompt enforce a parseable output format? Does the parser handle edge cases? | Add explicit format instructions; add fallback parsing. |
| 6 | **Tool count** | Has the tool set grown beyond ~10 without reorganization? | Introduce hierarchical routing (category → tool). |

### Logging Recommendation

For any production agent, log the following on every tool-selection decision:

```python
log_entry = {
    "timestamp": ...,
    "prompt_hash": hash(system_prompt),       # detect prompt template changes
    "user_message": user_message,
    "selected_tool": parsed_tool,
    "raw_response": raw_response,              # for post-hoc analysis
    "temperature": temperature,
    "model": model_name,
}
```

This enables A/B testing of prompt variants and regression detection when the model is updated.

### Connection to Module 1 (Tokenization & Logit Control)

In this lab we manipulated tool selection at the **prompt level** — changing what the model sees. In the next module, we go deeper: examining how the model **tokenizes** tool names and how we can apply **logit bias** or **constrained decoding** to force valid tool selections at the token level. The prompt shapes the distribution; logit control narrows it.

---
### Connection: EOP Agent Capabilities and What You Learned Here

An **EOP (Evidence-Oriented Programming) Agent** is an AI assistant that helps researchers adopt evidence-oriented practices: it **advocates** the value of EOP/ECF so users want to use it, **restructures** research code into evidence-chain–friendly layouts (e.g. work / output / claim / source), and **assists** with day-to-day coding. The skills you practiced in this lab map directly onto making such an agent reliable.

| EOP Agent capability | How it ties to this lab |
|-----------------------|--------------------------|
| **Advocacy** (explaining EOP so users want to use it) | The agent’s “advocacy” replies are just another kind of **generation conditioned on the prompt**. Getting consistent, on-message advocacy (e.g. when to explain evidence chains vs when to suggest restructure) depends on **prompt structure**: system message, few-shot examples, and clear boundaries between “explain concept” vs “suggest action.” Same principles as tool routing: **order and clarity** in the prompt shape the distribution over outputs. |
| **Restructure code** (reorganizing repos into EOP/ECF layout) | Restructuring requires the agent to **choose the right tools** (e.g. read file, list directory, edit file, suggest layout). Wrong tool ⇒ wrong restructure. Everything you learned applies: **tool definition order** and **non-overlapping descriptions** (e.g. “suggest_directory_layout” vs “edit_file”) reduce misrouting; **format contract** ensures the agent’s suggested changes are parseable and executable. |
| **Assist coding** (helping users write or refactor code) | Again the agent must **route** among tools (read, search, edit, run tests). Vague user requests (“make this cleaner”) lead to **inconsistent tool choice** unless the system prompt includes disambiguation or a clarify step. **Temperature** for these routing steps should be set with the same care as in our experiments — low for critical routing, higher only where creativity is desired. |

**Bottom line:** For an EOP agent, “selecting the right tool” is the same conditional decoding step we studied. When the agent misbehaves (wrong tool, off-topic advocacy, or broken restructure), the **diagnostic checklist from Section 5** applies: inspect prompt structure, tool order and descriptions, user message clarity, temperature, and format contract. This lab is the foundation for debugging and improving that behavior.

---
## 6. Summary

### Three Takeaways

1. **Prompt structure conditions the distribution over tool choices.** The order of tool definitions, the wording of the system message, and the presence of format instructions all change which tool the model selects.

2. **Clarity and format consistency reduce variance.** A clear user query + an explicit format contract make tool selection near-deterministic, even at moderate temperature.

3. **Systematic experiments are the right diagnostic method.** When an agent misbehaves, isolate variables (order, clarity, temperature, format) and measure selection distributions — do not guess.

### What's Next

**Lab 2: The Contract of a Tool** — We move from string-based tool definitions to **Pydantic schemas** and the **OpenAI function-calling API**, making tool selection and argument parsing formally structured and validated.

---

*End of Lab 1. Proceed to Lab 2 when ready.*