# 03. Reflection and Self-Improvement LoopsThis notebook expands the OrbitServe support scenario from the previous lessons and demonstrates how to implement multiple reflection patterns with **runnable Pydantic agents**. Each agent emits its own message log so you can inspect how the reasoning loop evolved before producing a customer-ready reply.

## What you will learn- How seminal reflection papers such as Self-Refine, Reflexion, Tree of Thoughts, and self-debate inspire practical workflows.- How to implement zero-shot, few-shot, self-critique, debate, tree-of-thought, and automated prompt-editing strategies using Pydantic-based agents.- How internal reflection (no external data) contrasts with loops that incorporate external checklists or evaluation signals.- How to capture step-by-step agent logs and evaluate replies with simple coverage metrics so that improvements are measurable.

## Research foundationsReflection has been explored along several complementary axes:- **Self-critique and rewriting.** Self-Refine (Madaan et al., 2023) and Reflexion (Shinn et al., 2023) show that letting the same model critique and revise its answer improves factuality and coverage, even without extra context.- **Debate and self-dialogue.** Works such as *LLMs are human-level prompt engineers* (Zhou et al., 2023) and multi-agent debate studies (Du et al., 2023) demonstrate that disagreeing agents can surface missing rationales before a judge model selects a final response.- **Tree-structured reasoning.** Tree of Thoughts (Yao et al., 2024) and Least-to-Most prompting (Zhou et al., 2022) organize thinking as branching plans that are scored before committing to a solution.- **Automated prompt editing.** PromptBreeder (Fernando et al., 2023) and PACE (Yang et al., 2024) automatically rewrite prompts based on evaluation feedback, yielding improvements without touching model weights.We will translate these ideas into deterministic Pydantic agents that you can run offline. Each method reuses the same OrbitServe customer-support tickets so that scores are comparable.

## OrbitServe scenarioOrbitServe operates a SaaS platform with strict reliability and compliance requirements. The support team receives nuanced tickets that demand factual assurances and operational next steps. Reflection is valuable because agents must cover every policy bullet without wasting the customer's time.We will reuse the four tickets introduced earlier and attach a lightweight knowledge base so that agents can reason deterministically. Coverage metrics act as our objective signal: the higher the coverage, the more customer requirements were satisfied.

In [None]:
from __future__ import annotationsfrom typing import Dict, List, Sequence, Tupleimport pandas as pdfrom pydantic import BaseModel, Field

In [None]:
class Ticket(BaseModel):    id: int    customer: str    channel: str    issue: str    key_points: List[str]    ground_truth: strdef make_tickets() -> List[Ticket]:    return [        Ticket(            id=1,            customer="HelioMart",            channel="Email",            issue="Analytics dashboard returns 500 errors since 08:00 UTC.",            key_points=[                "apologize for the disruption",                "confirm engineers are restoring analytics",                "eta 2 hours",                "offer csv export workaround",            ],            ground_truth="Issue caused by region-specific cache miss; fix rolling out with ETA 2 hours; CSV export unaffected.",        ),        Ticket(            id=2,            customer="NimbusBank",            channel="Chat",            issue="Need confirmation whether the February compliance audit export is signed.",            key_points=[                "reference SOC 2 signed export",                "mention signature hash",                "outline secure download steps",                "invite follow-up",            ],            ground_truth="Audit export signed with hash 9f2b…; accessible under Compliance Hub with MFA; auditors can request revalidation.",        ),        Ticket(            id=3,            customer="Wayfinder Apps",            channel="Email",            issue="Webhook retries stopped after yesterday's deployment.",            key_points=[                "acknowledge deploy regression",                "cite retry queue draining",                "advise requeue via CLI",                "escalate to on-call SRE",            ],            ground_truth="Deployment misconfigured retry worker; queue drained but stuck. CLI command `orbitserve webhooks replay --since=24h` restores events; SRE paged.",        ),        Ticket(            id=4,            customer="Aurora Health",            channel="Phone",            issue="Need assurance PHI remains encrypted during support investigations.",            key_points=[                "affirm PHI encryption",                "describe just-in-time access controls",                "note audit logging retention",                "share HIPAA knowledge base link",            ],            ground_truth="PHI encrypted at rest/in transit; support uses JIT elevation with dual approval; audit logs kept 7 years; KB article HIPAA-privacy-411.",        ),    ]TICKETS = make_tickets()len(TICKETS)

In [None]:
KEY_POINT_SNIPPETS: Dict[int, Dict[str, str]] = {    1: {        "apologize for the disruption": "I'm sorry about the disruption to your analytics dashboard this morning.",        "confirm engineers are restoring analytics": "Our engineering team is already restoring the analytics service and watching the error budget closely.",        "eta 2 hours": "We expect full recovery within about two hours based on the fix that is rolling out.",        "offer csv export workaround": "While that completes you can export the numbers you need from Settings → Data Export using the CSV option.",    },    2: {        "reference SOC 2 signed export": "The February SOC 2 audit export is signed and ready for your auditors.",        "mention signature hash": "Its signature hash is 9f2b-aa14 so you can verify integrity end-to-end.",        "outline secure download steps": "You can download it from the Compliance Hub after MFA—choose Downloads → 2025-02.",        "invite follow-up": "If anything looks off, please let us know and we'll revalidate it immediately.",    },    3: {        "acknowledge deploy regression": "Yesterday's deploy introduced a regression that paused part of our webhook pipeline.",        "cite retry queue draining": "The retry queue drained but stopped replaying events, which matches what you're seeing.",        "advise requeue via CLI": "Run `orbitserve webhooks replay --since=24h` to push the stuck deliveries.",        "escalate to on-call SRE": "I've already escalated this to the on-call SRE to watch the workers while they catch up.",    },    4: {        "affirm PHI encryption": "All PHI stays encrypted in transit and at rest even during investigations.",        "describe just-in-time access controls": "Support uses just-in-time access with dual approvals, so only the assigned engineer can view the records.",        "note audit logging retention": "Every touch is written to audit logs that we retain for seven years.",        "share HIPAA knowledge base link": "You can share HIPAA KB article HIPAA-privacy-411 for the full breakdown.",    },}TICKET_OPENERS = {    1: "Hi HelioMart team, thanks for flagging the analytics 500 errors.",    2: "Hello NimbusBank compliance team, happy to help with the audit export.",    3: "Hi Wayfinder team, thanks for the clear report about the webhook pauses.",    4: "Hi Aurora Health, I appreciate you checking on PHI handling.",}TICKET_CLOSERS = {    1: "We'll update you again once dashboards are back to green.",    2: "Thank you for double-checking your compliance workflows.",    3: "Ping us if you need help validating retries afterward.",    4: "We're here if your privacy office would like a deeper review.",}

In [None]:
KEY_POINT_EVIDENCE: Dict[int, Dict[str, Tuple[str, ...]]] = {    1: {        "apologize for the disruption": ("sorry", "analytics"),        "confirm engineers are restoring analytics": ("engineering team", "restoring"),        "eta 2 hours": ("two hours",),        "offer csv export workaround": ("csv", "export"),    },    2: {        "reference SOC 2 signed export": ("soc 2", "signed"),        "mention signature hash": ("hash", "9f2b-aa14"),        "outline secure download steps": ("compliance hub",),        "invite follow-up": ("let us know",),    },    3: {        "acknowledge deploy regression": ("deploy", "regression"),        "cite retry queue draining": ("retry queue",),        "advise requeue via CLI": ("orbitserve webhooks replay",),        "escalate to on-call SRE": ("on-call sre",),    },    4: {        "affirm PHI encryption": ("phi", "encrypted"),        "describe just-in-time access controls": ("just-in-time",),        "note audit logging retention": ("audit logs",),        "share HIPAA knowledge base link": ("hipaa", "privacy-411"),    },}

## Shared utilitiesWe will wrap each agent run in a `BaseModel` so that logs, coverage, and metadata are easy to inspect. Every agent calls `compose_reply` to assemble deterministic sentences from the snippets above.

In [None]:
class AgentMessage(BaseModel):    speaker: str    text: strclass AgentRun(BaseModel):    ticket_id: int    method: str    final_response: str    messages: List[AgentMessage]    coverage: float    coverage_hits: Dict[str, bool]    length: int    iterations: intdef compose_reply(ticket: Ticket, include_points: Sequence[str]) -> str:    ordered = []    seen = set()    for point in ticket.key_points:        if point in include_points and point not in seen:            ordered.append(point)            seen.add(point)    sentences = [TICKET_OPENERS[ticket.id]]    for point in ordered:        sentences.append(KEY_POINT_SNIPPETS[ticket.id][point])    sentences.append(TICKET_CLOSERS[ticket.id])    return " ".join(sentences)def evaluate_response(reply: str, ticket: Ticket) -> Tuple[float, Dict[str, bool]]:    lowered = reply.lower()    hits: Dict[str, bool] = {}    for point in ticket.key_points:        signals = KEY_POINT_EVIDENCE[ticket.id][point]        hits[point] = all(signal in lowered for signal in signals)    coverage = sum(hits.values()) / len(hits)    return coverage, hitsdef words(text: str) -> int:    return len(text.split())def run_agent_over_dataset(agent) -> Tuple[pd.DataFrame, Dict[int, AgentRun]]:    records = []    runs: Dict[int, AgentRun] = {}    for ticket in TICKETS:        run: AgentRun = agent.respond(ticket)        runs[ticket.id] = run        records.append(            {                "ticket": ticket.id,                "customer": ticket.customer,                "method": run.method,                "coverage": run.coverage,                "length": run.length,                "iterations": run.iterations,            }        )    return pd.DataFrame(records), runs

## Baseline prompting: zero-shot vs. few-shotWe begin with two deterministic agents:- **Zero-shot** mimics a generic assistant that offers empathy and reassurance but misses many operational details.- **Few-shot** represents a prompt enriched with annotated examples, so it reliably covers every key point.

In [None]:
class ZeroShotAgent(BaseModel):    method: str = "Zero-shot"    def respond(self, ticket: Ticket) -> AgentRun:        messages = [AgentMessage(speaker="user", text=ticket.issue)]        include = ticket.key_points[:2]        reply = compose_reply(ticket, include)        coverage, hits = evaluate_response(reply, ticket)        messages.append(AgentMessage(speaker="assistant", text=reply))        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=reply,            messages=messages,            coverage=coverage,            coverage_hits=hits,            length=words(reply),            iterations=1,        )class FewShotAgent(BaseModel):    method: str = "Few-shot"    def respond(self, ticket: Ticket) -> AgentRun:        messages = [AgentMessage(speaker="user", text=ticket.issue)]        reply = compose_reply(ticket, ticket.key_points)        coverage, hits = evaluate_response(reply, ticket)        messages.append(AgentMessage(speaker="assistant", text=reply))        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=reply,            messages=messages,            coverage=coverage,            coverage_hits=hits,            length=words(reply),            iterations=1,        )

In [None]:
zero_shot = ZeroShotAgent()few_shot = FewShotAgent()sample_zero = zero_shot.respond(TICKETS[0])sample_few = few_shot.respond(TICKETS[0])pd.DataFrame([m.model_dump() for m in sample_zero.messages])

In [None]:
pd.DataFrame([m.model_dump() for m in sample_few.messages])

In [None]:
baseline_scores, baseline_runs = run_agent_over_dataset(zero_shot)few_shot_scores, few_shot_runs = run_agent_over_dataset(few_shot)pd.concat([baseline_scores, few_shot_scores]).reset_index(drop=True)

## Self-critique loops (internal reflection)Self-critique builds on the zero-shot draft. After measuring coverage, the same agent lists missing key points and appends deterministic fixes. This mirrors the inner monologue proposed by Reflexion.

In [None]:
class SelfCritiqueAgent(BaseModel):    method: str = "Self-critique"    draft_agent: ZeroShotAgent = Field(default_factory=ZeroShotAgent)    max_rounds: int = 2    def respond(self, ticket: Ticket) -> AgentRun:        base = self.draft_agent.respond(ticket)        messages = list(base.messages)        current = base.final_response        coverage = base.coverage        hits = dict(base.coverage_hits)        for step in range(self.max_rounds):            missing = [kp for kp, ok in hits.items() if not ok]            if not missing:                break            critique = (                f"Critique {step + 1}: missing {', '.join(missing)}."            )            messages.append(AgentMessage(speaker="critic", text=critique))            additions = " ".join(KEY_POINT_SNIPPETS[ticket.id][kp] for kp in missing)            current = current + " " + additions            coverage, hits = evaluate_response(current, ticket)            messages.append(                AgentMessage(                    speaker="assistant",                    text=f"Revision {step + 1}: {current}",                )            )        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=current,            messages=messages,            coverage=coverage,            coverage_hits=hits,            length=words(current),            iterations=len([m for m in messages if m.speaker == "assistant"]),        )

In [None]:
self_critique = SelfCritiqueAgent()self_example = self_critique.respond(TICKETS[0])pd.DataFrame([m.model_dump() for m in self_example.messages])

In [None]:
self_scores, self_runs = run_agent_over_dataset(self_critique)self_scores

## Debate-style reflection (internal multi-agent)Two advocates emphasize different aspects of the ticket. A lightweight judge chooses the reply with the best coverage. This deterministic debate mirrors the judged self-debate setup of Du et al. (2023).

In [None]:
DEBATE_FOCI: Dict[str, Dict[int, List[str]]] = {    "advocate_a": {        1: [            "apologize for the disruption",            "confirm engineers are restoring analytics",            "eta 2 hours",        ],        2: [            "reference SOC 2 signed export",            "mention signature hash",        ],        3: [            "acknowledge deploy regression",            "advise requeue via CLI",        ],        4: [            "affirm PHI encryption",            "describe just-in-time access controls",        ],    },    "advocate_b": {        1: [            "apologize for the disruption",            "offer csv export workaround",        ],        2: [            "outline secure download steps",            "invite follow-up",        ],        3: [            "cite retry queue draining",            "escalate to on-call SRE",        ],        4: [            "note audit logging retention",            "share HIPAA knowledge base link",        ],    },}class DebateAgent(BaseModel):    method: str = "Debate"    advocate_focus: Dict[str, Dict[int, List[str]]] = Field(default_factory=lambda: DEBATE_FOCI)    def respond(self, ticket: Ticket) -> AgentRun:        messages = [AgentMessage(speaker="user", text=ticket.issue)]        candidates = {}        for advocate, focus in self.advocate_focus.items():            include = focus.get(ticket.id, ticket.key_points)            reply = compose_reply(ticket, include)            coverage, hits = evaluate_response(reply, ticket)            messages.append(AgentMessage(speaker=advocate, text=reply))            candidates[advocate] = (reply, coverage, hits)        winner, (final_reply, coverage, hits) = max(            candidates.items(), key=lambda item: (item[1][1], item[1][0])        )        messages.append(            AgentMessage(                speaker="judge",                text=f"Judge selects {winner} with coverage {coverage:.2f}.",            )        )        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=final_reply,            messages=messages,            coverage=coverage,            coverage_hits=hits,            length=words(final_reply),            iterations=1,        )

In [None]:
debate = DebateAgent()debate_example = debate.respond(TICKETS[0])pd.DataFrame([m.model_dump() for m in debate_example.messages])

In [None]:
debate_scores, debate_runs = run_agent_over_dataset(debate)debate_scores

## Tree-of-thought reflectionHere we enumerate candidate plans (thoughts) for each ticket, score the resulting replies, and pick the highest-coverage branch before responding. This mirrors Yao et al.'s Tree-of-Thoughts but keeps the search space tiny and deterministic.

In [None]:
THOUGHT_BRANCHES: Dict[int, List[List[str]]] = {    1: [        [            "apologize for the disruption",            "confirm engineers are restoring analytics",            "eta 2 hours",            "offer csv export workaround",        ],        [            "apologize for the disruption",            "offer csv export workaround",        ],    ],    2: [        [            "reference SOC 2 signed export",            "mention signature hash",            "outline secure download steps",            "invite follow-up",        ],        [            "reference SOC 2 signed export",            "outline secure download steps",        ],    ],    3: [        [            "acknowledge deploy regression",            "cite retry queue draining",            "advise requeue via CLI",            "escalate to on-call SRE",        ],        [            "acknowledge deploy regression",            "advise requeue via CLI",        ],    ],    4: [        [            "affirm PHI encryption",            "describe just-in-time access controls",            "note audit logging retention",            "share HIPAA knowledge base link",        ],        [            "affirm PHI encryption",            "share HIPAA knowledge base link",        ],    ],}class TreeOfThoughtAgent(BaseModel):    method: str = "Tree-of-thought"    branches: Dict[int, List[List[str]]] = Field(default_factory=lambda: THOUGHT_BRANCHES)    def respond(self, ticket: Ticket) -> AgentRun:        messages = [AgentMessage(speaker="user", text=ticket.issue)]        best_reply = ""        best_score = -1.0        best_hits: Dict[str, bool] = {}        for idx, plan in enumerate(self.branches.get(ticket.id, [ticket.key_points]), start=1):            reply = compose_reply(ticket, plan)            coverage, hits = evaluate_response(reply, ticket)            messages.append(                AgentMessage(                    speaker="planner",                    text=f"Thought {idx}: include {plan} → coverage {coverage:.2f}.",                )            )            if coverage > best_score:                best_reply = reply                best_score = coverage                best_hits = hits        messages.append(            AgentMessage(                speaker="assistant",                text=best_reply,            )        )        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=best_reply,            messages=messages,            coverage=best_score,            coverage_hits=best_hits,            length=words(best_reply),            iterations=1,        )

In [None]:
tot = TreeOfThoughtAgent()tot_example = tot.respond(TICKETS[0])pd.DataFrame([m.model_dump() for m in tot_example.messages])

In [None]:
tot_scores, tot_runs = run_agent_over_dataset(tot)tot_scores

## Automated prompt editing (external signal)Prompt editing treats coverage feedback as an **external evaluator**. We start from a sparse prompt, score the reply, and then append instructions that target the missing policy bullets. This parallels automated prompt improvement systems such as PromptBreeder.

In [None]:
PROMPT_HINTS: Dict[int, Dict[str, str]] = {    1: {        "apologize for the disruption": "Apologize clearly for the analytics outage.",        "confirm engineers are restoring analytics": "Confirm engineering is restoring analytics right now.",        "eta 2 hours": "Share the two hour recovery estimate.",        "offer csv export workaround": "Offer the CSV export workaround while they wait.",    },    2: {        "reference SOC 2 signed export": "State that the February SOC 2 export is signed.",        "mention signature hash": "Provide the signature hash 9f2b-aa14.",        "outline secure download steps": "Outline how to download it from the Compliance Hub after MFA.",        "invite follow-up": "Invite them to let you know if more validation is required.",    },    3: {        "acknowledge deploy regression": "Acknowledge yesterday's deploy regression.",        "cite retry queue draining": "Mention that the retry queue drained and paused.",        "advise requeue via CLI": "Instruct them to run `orbitserve webhooks replay --since=24h`.",        "escalate to on-call SRE": "Note that the on-call SRE is watching the workers.",    },    4: {        "affirm PHI encryption": "Affirm PHI stays encrypted in transit and at rest.",        "describe just-in-time access controls": "Describe the just-in-time access controls with dual approval.",        "note audit logging retention": "Explain that audit logs are retained for seven years.",        "share HIPAA knowledge base link": "Share HIPAA article HIPAA-privacy-411.",    },}class PromptEditingAgent(BaseModel):    method: str = "Prompt editing"    initial_points: Dict[int, List[str]] = Field(        default_factory=lambda: {            ticket.id: ticket.key_points[:2] for ticket in TICKETS        }    )    max_versions: int = 3    def respond(self, ticket: Ticket) -> AgentRun:        messages = [AgentMessage(speaker="user", text=ticket.issue)]        selected = list(self.initial_points[ticket.id])        instructions = [PROMPT_HINTS[ticket.id][kp] for kp in selected]        coverage = 0.0        hits: Dict[str, bool] = {kp: False for kp in ticket.key_points}        version = 0        final_reply = ""        while version < self.max_versions:            version += 1            prompt_text = "Instructions v{}: {}".format(                version, " | ".join(instructions)            )            messages.append(AgentMessage(speaker="editor", text=prompt_text))            reply = compose_reply(ticket, selected)            coverage, hits = evaluate_response(reply, ticket)            messages.append(AgentMessage(speaker="assistant", text=reply))            final_reply = reply            missing = [kp for kp, ok in hits.items() if not ok]            if not missing:                break            instructions.extend(PROMPT_HINTS[ticket.id][kp] for kp in missing)            selected.extend(missing)            messages.append(                AgentMessage(                    speaker="editor",                    text=f"Evaluator feedback: add {missing}.",                )            )        return AgentRun(            ticket_id=ticket.id,            method=self.method,            final_response=final_reply,            messages=messages,            coverage=coverage,            coverage_hits=hits,            length=words(final_reply),            iterations=len([m for m in messages if m.speaker == "assistant"]),        )

In [None]:
prompt_edit = PromptEditingAgent()prompt_example = prompt_edit.respond(TICKETS[0])pd.DataFrame([m.model_dump() for m in prompt_example.messages])

In [None]:
prompt_scores, prompt_runs = run_agent_over_dataset(prompt_edit)prompt_scores

## Comparison across methodsNow that every method emits measurable coverage, we can compare them side-by-side. Higher coverage (close to 1.0) means every policy bullet was addressed.

In [None]:
all_scores = pd.concat(    [        baseline_scores,        few_shot_scores,        self_scores,        debate_scores,        tot_scores,        prompt_scores,    ]).reset_index(drop=True)summary = (    all_scores.groupby("method")[["coverage", "length", "iterations"]]    .mean()    .sort_values(by="coverage", ascending=False)    .round({"coverage": 2, "length": 1, "iterations": 2}))summary

### Observations- Few-shot prompting already reaches perfect coverage because the examples encode every requirement.- Self-critique and prompt editing close most of the remaining gaps automatically; each uses different signals (internal critique versus external evaluator feedback) but converge to similar coverage.- Debate and tree-of-thought offer structured alternatives when you want to surface diverse rationales before choosing a plan.- Automated prompt editing is the only method here that explicitly uses an external evaluation signal (the coverage score) to rewrite its instructions, demonstrating the "reflection with external inputs" pattern.

## When to mix models for generation and critiqueReflection loops often benefit from pairing different model sizes or providers:- Use a **fast, inexpensive generator** (e.g., a 7B open-weight model) for initial drafts. Reserve a **larger critic** such as GPT-4o or Claude Opus for critique when policy coverage or safety is critical.- Specialized critics—like compliance checkers or security policy validators—are useful when external checklists (audit controls, legal phrasing) are required. These map to the external-signal loops demonstrated by the prompt editor above.- When latency matters, prefer single-pass improvements (few-shot or checklist-based prompt edits). For higher stakes tasks, add iterative loops (self-critique, debate, tree-of-thought) even if they increase runtime.Experimentation tip: start by logging coverage and iteration counts exactly as we do here. Once a loop consistently hits the required coverage, you can swap the deterministic snippets with actual model calls while keeping the evaluation harness unchanged.