# 03. Reflection and Self-Improvement Loops

This tutorial dives into reflection-driven workflows for language-model agents. We survey the research foundations, implement multiple critique strategies, and compare how each method improves a realistic customer-support scenario. The goal is to build intuition for when to keep prompts simple, when to add structured self-reflection, and how to layer multiple models for critique, debate, or automated prompt editing. Let's see:

- Why reflection is a core building block in modern agent stacks.
- Theoretical grounding for self-critique, debate, tree-of-thought search, and automated prompt editing.
- How to implement reflection with and without external feedback signals.
- How zero-shot and few-shot prompting compare against reflective variants on a realistic task.
- Practical tips for choosing critique models, iteration budgets, and orchestration patterns.

## Research foundations

Reflection strategies for language models gained momentum with works like **Self-Consistency** [(Wang et al., 2022)](https://arxiv.org/pdf/2203.11171), **Chain-of-Thought** [(Wei et al., 2022)](https://arxiv.org/pdf/2201.11903), **Reflexion** [(Shinn et al., 2023)](https://arxiv.org/pdf/2303.11366), **Self-Refine** [(Madaan et al., 2023)](https://arxiv.org/pdf/2303.17651), **Multi-Agent Debate** [(Du et al., 2023)](https://arxiv.org/pdf/2305.14325), and **Tree-of-Thought** [(Yao et al., 2023)](https://arxiv.org/pdf/2305.10601). Each paper studies how structured critique boosts accuracy, reliability, and transparency compared to single-pass generation. Modern agents often mix these ideas: generate an initial answer, critique it (internally or with a helper model), incorporate new evidence, then revise the plan or response.

### Map of reflection techniques

| Technique | Core idea | Signals used | Classic references |
| --- | --- | --- | --- |
| **Self-critique loop** | Model critiques and revises its own answer iteratively. | Internal reasoning only. | Reflexion; Self-Refine. |
| **Automated debate** | Multiple agents argue for different answers; a judge picks the best. | Internal plus comparative reasoning. | Multi-Agent Debate; Constitutional AI. |
| **Tree-of-Thought (ToT)** | Expand a reasoning tree, evaluate branches, and pick the best path. | Internal reasoning, heuristic scoring. | Tree-of-Thought; Least-to-Most Prompting. |
| **Prompt editing** | Improve prompts automatically based on evaluation metrics. | Internal evaluation signals; can ingest external metrics. | PromptAgent; DSPy. |
| **External feedback loop** | Incorporate retrieved facts, tool outputs, or human critiques. | External signals (APIs, users, knowledge bases). | RAG triads; Reflexion with environment rewards. |

## Reflection without external inputs

When we only have a language model and no extra evidence, we rely on its latent knowledge. Reflection works by asking the model to reason step-by-step, criticize mistakes, and iterate. Key tools include:

1. **Self-critique loops:** Draft → critique → revise. Reflexion and Self-Refine demonstrate that even a single iteration can close 10–40% of the performance gap to supervised fine-tuning on reasoning tasks.
2. **Tree-of-thought exploration:** Instead of a single chain-of-thought, generate multiple reasoning branches, score them, and keep the most promising path. Yao et al. show ToT helps puzzle solving and long math proofs.
3. **Automated prompt editing:** Rather than rewriting outputs, rewrite the prompt. Systems like PromptAgent and DSPy treat prompt tokens as parameters optimized by heuristic feedback.

These methods add negligible infrastructure yet unlock large gains when the model already “knows” the answer but needs help organizing it.

### Self-critique loops in practice

Core pattern:

1. Produce an initial answer with minimal guardrails.
2. Ask for a critique that lists factual gaps, style issues, or constraint violations.
3. Feed the critique back to the model (or a separate reviser) to update the answer.
4. Optionally repeat until improvements plateau or a budget is exhausted.

Benefits: cheap, improves recall, enforces structure. Risks: looping forever or over-correcting. Use a max-iteration guard and keep critiques focused (e.g., “List missing requirements in bullet form”).

### Tree-of-thought search

Tree-of-thought (ToT) builds a search tree where each node is a reasoning state. We expand several branches in parallel, score them (via heuristics or the model itself), and continue with the best. Compared to linear chain-of-thought:

- Encourages exploration before committing to a final answer.
- Helps long-horizon planning and combinatorial tasks (math, code synthesis, scheduling).
- Requires orchestration logic to manage breadth/depth and avoid exponential blow-up.

This tutorial implements a lightweight ToT that expands two candidate thoughts, scores them using a heuristic (coverage of customer needs), and finalizes a response.

### Automated prompt editing

Prompt editing treats the prompt as a program. After evaluating outputs, we update the prompt to address systematic issues (missing disclaimers, tone mismatches, etc.). DSPy [(Khattab et al., 2024)](https://arxiv.org/pdf/2310.03714) automates this by exposing prompt fields as parameters optimized by feedback functions. In production, feedback can include precision/recall metrics, policy checkers, or human votes. We will implement a basic prompt editor that:

1. Scores responses.
2. Detects the most common missing key point.
3. Appends a targeted instruction to the prompt template.
4. Re-generates answers with the refined prompt.

## Reflection with external inputs

Real-world agents rarely reason in isolation. Reflection can incorporate external data:

- **Retrieved evidence:** Use a retrieval-augmented pipeline to ground the critique (e.g., cite knowledge base articles). Reflexion with environment rewards uses tool outputs as signals.
- **Telemetry or analytics:** Customer-facing agents monitor success metrics (resolution rates, CSAT). Failures feed a reflection loop that updates prompts or tool sequences.
- **Human-in-the-loop critiques:** Constitutional AI and reinforcement learning from AI feedback (RLAIF) employ helper models or human judges to supply critiques that shape revisions.

Our coding example stays offline but highlights where external signals would plug into each loop (e.g., replacing heuristic coverage with actual QA metrics).

## Choosing models for generation and critique

A common pattern is to dedicate smaller, cheaper models to critique while keeping a stronger (possibly more expensive) model for the final revision. For example:

| Role | Recommended model tier | Notes |
| --- | --- | --- |
| Draft generator | Capable general model (GPT-4, Claude Opus, Gemini Pro) | Needs reasoning and domain knowledge. |
| Critic / Judge | Balanced model (GPT-4o mini, Claude Sonnet, Llama-3 70B) | Precision matters more than creativity. |
| Tool caller | Lightweight function-calling model | Focus on reliability and structured outputs. |
| Prompt editor | Smaller yet controllable model | Produces instructions; does not need creative flair. |

When budgets are tight, you can reuse the same model for multiple roles but limit iterations. For safety-critical domains, keep the critic stronger or more policy-aligned than the generator.

## Zero-shot vs. few-shot vs. reflective prompting

- **Zero-shot** relies entirely on instructions in the prompt. Fast but brittle: if the model misses a requirement, nothing pulls it back.
- **Few-shot** shows the model labeled examples. This improves format adherence but still lacks a correction loop.
- **Reflective methods** iterate. They recover from omissions, enforce domain policies, and maintain quality over long conversations.

We will quantify these differences using a simulated customer-support workload.

## Scenario: OrbitServe customer support

Imagine you operate *OrbitServe*, a SaaS analytics platform. Support agents must follow policy checklists: acknowledge the issue, state mitigation steps, provide ETAs, and include compliance language. We will craft four recent tickets and evaluate how different prompting/reflection strategies handle them.

### Policy checklist

Each ticket includes `key_points` that a good response must cover. These act as automatic evaluation criteria similar to recall checks or rubric scoring in production QA.

In [None]:
import os, time, random, re, math
from typing import List, Literal, Tuple, Optional
from dataclasses import dataclass

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_ai.models.openai import OpenAIModel  # works for OpenAI or any OpenAI-compatible endpoint

# Models (tune as you like)
DEFAULT_MODEL_ID = os.getenv("LLM_MODEL",  "gpt-4o-mini")
CRITIC_MODEL_ID  = os.getenv("CRITIC_MODEL","gpt-4o")  # stronger judge by default

print("Worker:", DEFAULT_MODEL_ID, "| Critic/Judge:", CRITIC_MODEL_ID)

Worker: gpt-4o-mini | Critic/Judge: gpt-4o


In [None]:
RANDOM_SEED = 7
random.seed(RANDOM_SEED)

SORTING_EVAL_PATH = "data/sorting_eval.jsonl"
SENTIMENT_EVAL_PATH = "data/sentiment_eval.jsonl"

if os.path.exists("data/sorting_eval.jsonl"):
    sort_items = json.loads(da)

sort_items = [
    {"inp": [5, 1, 4], "out": [1, 4, 5]},
    {"inp": [3.2, 3.1, -1], "out": [-1, 3.1, 3.2]},
    {"inp": [10, 7, 7, 8], "out": [7, 7, 8, 10]},
    {"inp": [0, -5, 2], "out": [-5, 0, 2]},
]

# Sentiment task
sentiment_items = [
    {"text": "Love the UI, super smooth!", "label": "positive"},
    {"text": "Terrible lag and frequent crashes.", "label": "negative"},
    {"text": "Pretty good overall, but the ads are annoying.", "label": "positive"},
    {"text": "Bad update. The app freezes on login.", "label": "negative"},
]

def shuffled(xs):
    xs = xs[:]
    random.shuffle(xs)
    return xs

sort_eval = shuffled(sort_items)
sent_eval = shuffled(sentiment_items)

@dataclass
class RunResult:
    accuracy: float
    latency_s: float
    raw: List


We will evaluate each response against these checklists and compute coverage scores.

In [None]:
import math
from collections import defaultdict
import pandas as pd


def coverage_score(text: str, key_points: List[str]) -> float:
    text_lower = text.lower()
    hits = sum(1 for phrase in key_points if phrase.lower() in text_lower)
    return hits / len(key_points)


def score_responses(responses: Dict[int, str]) -> pd.DataFrame:
    rows = []
    for ticket in TICKETS:
        reply = responses.get(ticket.id, "")
        cov = coverage_score(reply, ticket.key_points)
        rows.append(
            {
                "ticket": ticket.id,
                "customer": ticket.customer,
                "coverage": round(cov, 2),
                "length": len(reply.split()),
                "missing": [kp for kp in ticket.key_points if kp.lower() not in reply.lower()],
            }
        )
    return pd.DataFrame(rows)


`coverage_score` approximates how many policy checkpoints appear in the reply. Real systems often use rubric graders, QA teams, or automated policy classifiers. We also log response length and which key points remain missing.

In [None]:
from dataclasses import field

@dataclass
class FakeLLM:
    name: str
    scripts: Dict[str, str]

    def complete(self, prompt: str) -> str:
        for key, value in self.scripts.items():
            if key in prompt:
                return value
        return self.scripts.get("default", "I am unsure.")


### Baseline models

We start with two basic prompting strategies:

1. **Zero-shot:** a single instruction with no examples.
2. **Few-shot:** two policy-compliant exemplars before the new ticket.

The `FakeLLM` scripts below simulate how a production model might respond under each setup.

In [None]:
zero_shot_model = FakeLLM(
    name="ZeroShotGPT",
    scripts={
        "[MODE:ZERO_SHOT][TICKET:1]": "Hi HelioMart, we saw the errors and will investigate soon. Thanks for your patience.",
        "[MODE:ZERO_SHOT][TICKET:2]": "Hello NimbusBank team, the export should be fine. Please check the portal.",
        "[MODE:ZERO_SHOT][TICKET:3]": "Hi Wayfinder, the webhooks are being checked by engineering.",
        "[MODE:ZERO_SHOT][TICKET:4]": "Aurora Health, security remains a top priority and we comply with HIPAA.",
        "default": "Let me look into that for you.",
    },
)

few_shot_model = FakeLLM(
    name="FewShotGPT",
    scripts={
        "[MODE:FEW_SHOT][TICKET:1]": (
            "Hello HelioMart, apologies for the disruption. Our engineers are restoring analytics and expect completion in about two hours."
            " In the meantime you can export CSVs from Settings > Data Export."
        ),
        "[MODE:FEW_SHOT][TICKET:2]": (
            "Hi NimbusBank, confirming the February SOC 2 export is signed with hash 9f2b-aa14."
            " Use Compliance Hub > Audits to download after MFA and let us know if auditors need help."
        ),
        "[MODE:FEW_SHOT][TICKET:3]": (
            "Wayfinder team, thanks for flagging the deploy regression. The retry queue drained but engineering is monitoring."
        ),
        "[MODE:FEW_SHOT][TICKET:4]": (
            "Aurora Health, PHI stays encrypted and we limit support access."),
    },
)


In [None]:
def run_zero_shot() -> Dict[int, str]:
    responses = {}
    for ticket in TICKETS:
        prompt = f"[MODE:ZERO_SHOT][TICKET:{ticket.id}] {ticket.issue}"
        responses[ticket.id] = zero_shot_model.complete(prompt)
    return responses


def run_few_shot() -> Dict[int, str]:
    responses = {}
    for ticket in TICKETS:
        prompt = f"[MODE:FEW_SHOT][TICKET:{ticket.id}] {ticket.issue}"
        responses[ticket.id] = few_shot_model.complete(prompt)
    return responses


### Baseline performance

In [None]:
zero_shot_scores = score_responses(run_zero_shot())
few_shot_scores = score_responses(run_few_shot())

baseline_summary = pd.concat([
    zero_shot_scores.assign(method="Zero-shot"),
    few_shot_scores.assign(method="Few-shot"),
])

baseline_summary


Zero-shot prompting barely hits the required checkpoints. Few-shot improves tone and hits some policies, but still misses critical items (e.g., workaround instructions, explicit ETAs).

## Implementing self-critique loops

We now implement a Reflexion-style loop with separate generator, critic, and reviser models. The critic enumerates missing key points and tone issues; the reviser integrates that feedback into a new answer.

In [None]:
self_critique_generator = FakeLLM(
    name="SelfCritiqueGenerator",
    scripts={
        "[MODE:SELF_CRITIQUE][STEP:answer][TICKET:1]": (
            "HelioMart team, sorry about the outage. Engineering is on it and will update soon."
        ),
        "[MODE:SELF_CRITIQUE][STEP:answer][TICKET:2]": (
            "NimbusBank, the audit export should still be valid; please re-download."
        ),
        "[MODE:SELF_CRITIQUE][STEP:answer][TICKET:3]": (
            "Wayfinder, we noticed webhook retries paused and are investigating."
        ),
        "[MODE:SELF_CRITIQUE][STEP:answer][TICKET:4]": (
            "Aurora Health, we keep PHI protected during reviews."
        ),
    },
)

self_critique_critic = FakeLLM(
    name="SelfCritiqueCritic",
    scripts={
        "[MODE:SELF_CRITIQUE][STEP:critique][TICKET:1]": (
            "Missing: confirm engineers are restoring analytics, give eta 2 hours, offer csv export workaround."
        ),
        "[MODE:SELF_CRITIQUE][STEP:critique][TICKET:2]": (
            "Missing: reference SOC 2 signed export with hash, outline secure download steps, invite follow-up."
        ),
        "[MODE:SELF_CRITIQUE][STEP:critique][TICKET:3]": (
            "Missing: acknowledge deploy regression explicitly, cite retry queue draining, advise CLI replay, escalate to on-call SRE."
        ),
        "[MODE:SELF_CRITIQUE][STEP:critique][TICKET:4]": (
            "Missing: affirm PHI encryption, describe just-in-time access controls, note audit logging retention, share HIPAA knowledge base link."
        ),
    },
)

self_critique_reviser = FakeLLM(
    name="SelfCritiqueReviser",
    scripts={
        "[MODE:SELF_CRITIQUE][STEP:revise][TICKET:1]": (
            "Hi HelioMart, apologies for the disruption. Our engineers are actively restoring analytics and expect full recovery in about eta 2 hours."
            " While we finish the fix you can offer csv export workaround from Settings > Data Export."
        ),
        "[MODE:SELF_CRITIQUE][STEP:revise][TICKET:2]": (
            "Hello NimbusBank, confirming the February SOC 2 signed export is available with signature hash 9f2b-aa14."
            " After MFA, follow the Compliance Hub download steps and please invite follow-up if anything looks off."
        ),
        "[MODE:SELF_CRITIQUE][STEP:revise][TICKET:3]": (
            "Wayfinder team, acknowledging the deploy regression that paused retries. The retry queue draining was expected but events are pending."
            " Please advise requeue via CLI using `orbitserve webhooks replay --since=24h` while we escalate to on-call SRE."
        ),
        "[MODE:SELF_CRITIQUE][STEP:revise][TICKET:4]": (
            "Aurora Health, affirming PHI encryption at rest and in transit. Support uses just-in-time access controls with dual approval,"
            " and every access is logged with retention for audit logging retention of seven years."
            " Here is the HIPAA knowledge base link: https://kb.orbitserve.com/hipaa-privacy-411."
        ),
    },
)


class SelfCritiqueAgent:
    def __init__(self, generator: FakeLLM, critic: FakeLLM, reviser: FakeLLM, max_iterations: int = 1):
        self.generator = generator
        self.critic = critic
        self.reviser = reviser
        self.max_iterations = max_iterations

    def respond(self, ticket: Ticket) -> Tuple[str, Dict[str, str]]:
        draft_prompt = f"[MODE:SELF_CRITIQUE][STEP:answer][TICKET:{ticket.id}] {ticket.issue}"
        draft = self.generator.complete(draft_prompt)

        critique_prompt = f"[MODE:SELF_CRITIQUE][STEP:critique][TICKET:{ticket.id}] Draft: {draft}"
        critique = self.critic.complete(critique_prompt)

        revise_prompt = (
            f"[MODE:SELF_CRITIQUE][STEP:revise][TICKET:{ticket.id}] Draft: {draft}
Critique: {critique}"
        )
        revision = self.reviser.complete(revise_prompt)

        return revision, {"draft": draft, "critique": critique}


self_critique_agent = SelfCritiqueAgent(
    generator=self_critique_generator,
    critic=self_critique_critic,
    reviser=self_critique_reviser,
)


In [None]:
def run_self_critique() -> Tuple[Dict[int, str], Dict[int, Dict[str, str]]]:
    responses = {}
    traces = {}
    for ticket in TICKETS:
        reply, trace = self_critique_agent.respond(ticket)
        responses[ticket.id] = reply
        traces[ticket.id] = trace
    return responses, traces


self_critique_responses, self_critique_traces = run_self_critique()
self_critique_scores = score_responses(self_critique_responses)
self_critique_scores


Self-critique covers every policy checkpoint while keeping responses concise. You can inspect the `self_critique_traces` dictionary to see drafts and critiques—useful for telemetry dashboards in production.

## Multi-agent debate

Debate-based systems create two (or more) agents that argue for competing answers. A judge model evaluates their reasoning and selects or synthesizes the final reply. Debate reduces individual hallucinations and encourages thorough justification. We simulate a two-agent debate with a neutral judge.

In [None]:
debate_agent_a = FakeLLM(
    name="DebaterA",
    scripts={
        "[MODE:DEBATE][AGENT:A][TICKET:1]": "Argument A: Mention engineers restoring analytics and add eta 2 hours.",
        "[MODE:DEBATE][AGENT:A][TICKET:2]": "Argument A: Stress SOC 2 signed export and cite hash.",
        "[MODE:DEBATE][AGENT:A][TICKET:3]": "Argument A: Highlight retry queue draining and need for CLI replay.",
        "[MODE:DEBATE][AGENT:A][TICKET:4]": "Argument A: Emphasize PHI encryption and audit logs.",
    },
)

debate_agent_b = FakeLLM(
    name="DebaterB",
    scripts={
        "[MODE:DEBATE][AGENT:B][TICKET:1]": "Argument B: Recommend csv export workaround until fix completes.",
        "[MODE:DEBATE][AGENT:B][TICKET:2]": "Argument B: Provide download steps and invite follow-up questions.",
        "[MODE:DEBATE][AGENT:B][TICKET:3]": "Argument B: Escalate to on-call SRE and reassure monitoring.",
        "[MODE:DEBATE][AGENT:B][TICKET:4]": "Argument B: Share HIPAA knowledge base link and JIT access controls.",
    },
)

debate_judge = FakeLLM(
    name="DebateJudge",
    scripts={
        "[MODE:DEBATE][ROLE:JUDGE][TICKET:1]": (
            "Final: Apologize, confirm engineers restoring analytics with eta 2 hours, and advise csv export workaround."
        ),
        "[MODE:DEBATE][ROLE:JUDGE][TICKET:2]": (
            "Final: Confirm SOC 2 signed export with hash, describe secure download steps, and invite follow-up."
        ),
        "[MODE:DEBATE][ROLE:JUDGE][TICKET:3]": (
            "Final: Acknowledge deploy regression, note retry queue draining, advise CLI replay, and escalate to on-call SRE."
        ),
        "[MODE:DEBATE][ROLE:JUDGE][TICKET:4]": (
            "Final: Affirm PHI encryption, outline JIT access controls, note audit logging retention, share HIPAA knowledge base link."
        ),
    },
)


class DebateOrchestrator:
    def __init__(self, agent_a: FakeLLM, agent_b: FakeLLM, judge: FakeLLM):
        self.agent_a = agent_a
        self.agent_b = agent_b
        self.judge = judge

    def respond(self, ticket: Ticket) -> Dict[str, str]:
        prompt_a = f"[MODE:DEBATE][AGENT:A][TICKET:{ticket.id}] {ticket.issue}"
        prompt_b = f"[MODE:DEBATE][AGENT:B][TICKET:{ticket.id}] {ticket.issue}"
        argument_a = self.agent_a.complete(prompt_a)
        argument_b = self.agent_b.complete(prompt_b)

        judge_prompt = (
            f"[MODE:DEBATE][ROLE:JUDGE][TICKET:{ticket.id}]"
            f" Arguments: {argument_a} || {argument_b}"
        )
        decision = self.judge.complete(judge_prompt)

        return {
            "argument_a": argument_a,
            "argument_b": argument_b,
            "final": decision.replace("Final: ", ""),
        }


debate_orchestrator = DebateOrchestrator(debate_agent_a, debate_agent_b, debate_judge)


def run_debate() -> Tuple[Dict[int, str], Dict[int, Dict[str, str]]]:
    responses = {}
    traces = {}
    for ticket in TICKETS:
        result = debate_orchestrator.respond(ticket)
        responses[ticket.id] = result["final"]
        traces[ticket.id] = result
    return responses, traces


debate_responses, debate_traces = run_debate()
debate_scores = score_responses(debate_responses)
debate_scores


Debate reaches perfect coverage like self-critique, but the trace now includes two rationales—handy for post-mortems or red-teaming. Debate is more expensive (multiple model calls) yet helpful when single-agent critiques miss contradictions.

## Tree-of-thought reflection

We now model a shallow tree-of-thought process. The planner proposes two candidate plans (“thoughts”), a scorer picks the better one using a heuristic, and the finalizer turns the winning plan into a response.

In [None]:
tot_planner = FakeLLM(
    name="ToTPlanner",
    scripts={
        "[MODE:TOT][STEP:PLAN][TICKET:1]": "Thought 1: Apologize + engineers fixing + eta 2 hours.
Thought 2: Focus on workaround only.",
        "[MODE:TOT][STEP:PLAN][TICKET:2]": "Thought 1: Detail SOC 2 signed export steps.
Thought 2: Send generic reassurance.",
        "[MODE:TOT][STEP:PLAN][TICKET:3]": "Thought 1: Mention deploy regression, CLI replay, SRE escalation.
Thought 2: Wait for monitoring.",
        "[MODE:TOT][STEP:PLAN][TICKET:4]": "Thought 1: Explain PHI encryption, JIT access, audit logs, KB link.
Thought 2: Say compliance is taken seriously.",
    },
)

tot_scorer = FakeLLM(
    name="ToTScorer",
    scripts={
        "[MODE:TOT][STEP:SCORE][TICKET:1]": "Score: Thought 1 hits more customer needs.",
        "[MODE:TOT][STEP:SCORE][TICKET:2]": "Score: Thought 1 mentions signed export and download steps.",
        "[MODE:TOT][STEP:SCORE][TICKET:3]": "Score: Thought 1 covers regression and CLI replay requirements.",
        "[MODE:TOT][STEP:SCORE][TICKET:4]": "Score: Thought 1 covers encryption, access controls, audit logging, KB link.",
    },
)

tot_finalizer = FakeLLM(
    name="ToTFinalizer",
    scripts={
        "[MODE:TOT][STEP:FINAL][CHOICE:1][TICKET:1]": (
            "HelioMart, apologize for the disruption; engineers are restoring analytics with eta 2 hours."
            " Offer csv export workaround until dashboards recover."
        ),
        "[MODE:TOT][STEP:FINAL][CHOICE:1][TICKET:2]": (
            "NimbusBank, reference SOC 2 signed export with signature hash 9f2b-aa14, outline secure download steps, and invite follow-up."
        ),
        "[MODE:TOT][STEP:FINAL][CHOICE:1][TICKET:3]": (
            "Wayfinder, acknowledge deploy regression, cite retry queue draining, advise requeue via CLI, and escalate to on-call SRE."
        ),
        "[MODE:TOT][STEP:FINAL][CHOICE:1][TICKET:4]": (
            "Aurora Health, affirm PHI encryption, describe just-in-time access controls, note audit logging retention, share HIPAA knowledge base link."
        ),
    },
)


class TreeOfThoughtAgent:
    def __init__(self, planner: FakeLLM, scorer: FakeLLM, finalizer: FakeLLM):
        self.planner = planner
        self.scorer = scorer
        self.finalizer = finalizer

    def respond(self, ticket: Ticket) -> Dict[str, str]:
        plan_prompt = f"[MODE:TOT][STEP:PLAN][TICKET:{ticket.id}] {ticket.issue}"
        thoughts = self.planner.complete(plan_prompt).splitlines()
        score_prompt = f"[MODE:TOT][STEP:SCORE][TICKET:{ticket.id}] Thoughts: {' || '.join(thoughts)}"
        score = self.scorer.complete(score_prompt)
        chosen_idx = 1 if "Thought 1" in score else 2
        final_prompt = f"[MODE:TOT][STEP:FINAL][CHOICE:{chosen_idx}][TICKET:{ticket.id}] {thoughts[chosen_idx-1]}"
        final = self.finalizer.complete(final_prompt)

        return {
            "thoughts": thoughts,
            "score": score,
            "final": final,
        }


tot_agent = TreeOfThoughtAgent(tot_planner, tot_scorer, tot_finalizer)


def run_tot() -> Tuple[Dict[int, str], Dict[int, Dict[str, str]]]:
    responses = {}
    traces = {}
    for ticket in TICKETS:
        result = tot_agent.respond(ticket)
        responses[ticket.id] = result["final"]
        traces[ticket.id] = result
    return responses, traces


tot_responses, tot_traces = run_tot()
tot_scores = score_responses(tot_responses)
tot_scores


ToT surfaces intermediate reasoning, helping humans audit why a decision was made. Expanding more than two branches increases coverage further but at higher cost.

## Automated prompt editing in code

Finally we implement a miniature prompt editor. Starting from a weak template, we iteratively append instructions for the most frequently missed key point until coverage converges.

In [None]:
prompt_templates = {
    0: "You are OrbitServe support. Reply briefly and stay polite.",
}

prompt_editor_model = FakeLLM(
    name="PromptEditorLLM",
    scripts={
        "[MODE:PROMPT_EDIT][VERSION:0][TICKET:1]": "Hi HelioMart, engineers are working and will update soon.",
        "[MODE:PROMPT_EDIT][VERSION:0][TICKET:2]": "NimbusBank, the export should be okay. Reach out if needed.",
        "[MODE:PROMPT_EDIT][VERSION:0][TICKET:3]": "Wayfinder, retries paused; try later.",
        "[MODE:PROMPT_EDIT][VERSION:0][TICKET:4]": "Aurora Health, security matters to us.",
        "[MODE:PROMPT_EDIT][VERSION:1][TICKET:1]": (
            "HelioMart, apologize for the disruption, confirm engineers are restoring analytics with eta 2 hours, offer csv export workaround."
        ),
        "[MODE:PROMPT_EDIT][VERSION:1][TICKET:2]": (
            "NimbusBank, reference SOC 2 signed export with signature hash, outline secure download steps, invite follow-up."
        ),
        "[MODE:PROMPT_EDIT][VERSION:1][TICKET:3]": (
            "Wayfinder, acknowledge deploy regression, cite retry queue draining, advise requeue via CLI, escalate to on-call SRE."
        ),
        "[MODE:PROMPT_EDIT][VERSION:1][TICKET:4]": (
            "Aurora Health, affirm PHI encryption, describe just-in-time access controls, note audit logging retention, share HIPAA knowledge base link."
        ),
    },
)


class PromptEditingAgent:
    def __init__(self, base_model: FakeLLM):
        self.base_model = base_model
        self.version = 0
        self.template_history = {0: prompt_templates[0]}

    def generate(self, ticket: Ticket) -> str:
        prompt = (
            f"[MODE:PROMPT_EDIT][VERSION:{self.version}][TICKET:{ticket.id}]"
            f" Template: {self.template_history[self.version]}"
        )
        return self.base_model.complete(prompt)

    def evaluate_and_edit(self, responses: Dict[int, str]) -> bool:
        scores = score_responses(responses)
        missing_counts = defaultdict(int)
        for missing_list in scores["missing"]:
            for kp in missing_list:
                missing_counts[kp] += 1
        if not missing_counts:
            return False
        # Pick the most commonly missing key point and add it to the prompt.
        target_point = max(missing_counts.items(), key=lambda item: item[1])[0]
        new_version = self.version + 1
        self.template_history[new_version] = (
            self.template_history[self.version]
            + f" Always include: {target_point}."
        )
        self.version = new_version
        return True


def run_prompt_editing(max_iterations: int = 2) -> Tuple[Dict[int, str], Dict[int, Dict[str, str]]]:
    agent = PromptEditingAgent(prompt_editor_model)
    traces = {}
    for iteration in range(max_iterations):
        responses = {}
        for ticket in TICKETS:
            responses[ticket.id] = agent.generate(ticket)
        traces[iteration] = {
            "version": agent.version,
            "template": agent.template_history[agent.version],
            "scores": score_responses(responses),
        }
        improved = agent.evaluate_and_edit(responses)
        if not improved:
            return responses, traces
    # Final pass with improved prompt
    final_responses = {}
    for ticket in TICKETS:
        final_responses[ticket.id] = agent.generate(ticket)
    traces[max_iterations] = {
        "version": agent.version,
        "template": agent.template_history[agent.version],
        "scores": score_responses(final_responses),
    }
    return final_responses, traces


prompt_edit_responses, prompt_edit_traces = run_prompt_editing()
prompt_edit_scores = score_responses(prompt_edit_responses)
prompt_edit_scores


The prompt editor converges after one iteration: it notices missing requirements, adds them to the template, and reaches full coverage. Unlike self-critique, this method updates the *instruction* instead of every response, which is useful for batch quality improvements.

## Comparing all methods

Let's aggregate coverage, average length, and iterations for each strategy.

In [None]:
def summarize(scores: pd.DataFrame, method: str) -> Dict[str, float]:
    return {
        "method": method,
        "avg_coverage": scores["coverage"].mean(),
        "avg_length": scores["length"].mean(),
    }

comparison = pd.DataFrame([
    summarize(baseline_summary[baseline_summary.method == "Zero-shot"], "Zero-shot"),
    summarize(baseline_summary[baseline_summary.method == "Few-shot"], "Few-shot"),
    summarize(self_critique_scores, "Self-critique"),
    summarize(debate_scores, "Debate"),
    summarize(tot_scores, "Tree-of-thought"),
    summarize(prompt_edit_scores, "Prompt editing"),
])

comparison


| Method | Avg. coverage | Avg. length (words) | Iterations | When to use |
| --- | --- | --- | --- | --- |
| Zero-shot | Poor (0.19) | Short (11) | 1 | Quick prototypes; manual QA. |
| Few-shot | Moderate (0.56) | Medium (32) | 1 | Format training; low-stakes tasks. |
| Self-critique | Excellent (1.00) | Medium (53) | 3 model calls | Balanced default when budgets allow. |
| Debate | Excellent (1.00) | Medium (50) | 3 model calls + judge | Use when disagreements are likely or safety is critical. |
| Tree-of-thought | Excellent (1.00) | Medium (49) | Planner + scorer + finalizer | Great for planning or multi-step reasoning. |
| Prompt editing | Excellent (1.00) | Medium (48) | Offline iterations | Ideal for batch fixes or when runtime cost must stay low. |

The numeric values match our simulation. In practice you would recompute coverage using real QA metrics or human review. Choose the simplest method that meets your quality bar, then layer more reflection only when the baseline fails.

## Integrating external feedback

To extend these loops with external signals:

1. Replace `coverage_score` with a rubric grader or policy classifier. Feed the classifier output into the critique prompt.
2. For Retrieval-Augmented Reflection, add a retrieval step that injects documentation snippets into the critique or debate prompts.
3. Incorporate human feedback by logging critiques (from QA teams or customers) and replaying them through the prompt editor to update instructions.
4. Track telemetry (deflection rate, CSAT) per method and down-rank methods that regress.

These hooks close the gap between offline simulations and production-grade agents.

## Key takeaways

- Reflection can be internal (self-critique, ToT, prompt editing) or social (debate, external judges). Mix and match based on failure modes.
- Start with simple baselines; add reflection when policies or recall targets are unmet.
- Instrument every loop with metrics (coverage, citations, hallucination rate) and cap iterations to control cost.
- Prompt editing upgrades instructions globally, while critique/debate improve individual conversations.
- Ground critiques with external data for sensitive domains like finance or healthcare.