# 07. Multi-Agent Workflows and Agentic Swarms

In the real world, **no single agent can solve every problem optimally**. As tasks grow in **uncertainty, dimensionality, and interdependence** ‚Äî such as strategy games, simulations, robotics, or real-time business systems ‚Äî we naturally evolve from *single-agent* reasoning to **multi-agent workflows**.

These workflows mirror how humans collaborate:

* üó≥Ô∏è **Democratic committees** balance diverse perspectives.
* üß≠ **Hierarchical managers** coordinate specialists under limited resources.
* ‚öñÔ∏è **Actor-Critic systems** separate exploration (actor) from judgment (critic).

Each pattern encodes a different *philosophy of coordination* ‚Äî distributing intelligence across specialized roles that communicate, negotiate, and arbitrate toward a shared goal.


### ‚öôÔ∏è What Are Multi-Agent Workflows?

A **multi-agent workflow** is a structured network of reasoning and action nodes ‚Äî planners, evaluators, arbiters, memory modules ‚Äî that interact through explicit channels rather than a single monolithic prompt.

Think of it as a **graph of decision-making** where:

* Nodes = agents (LLMs, heuristics, or functions).
* Edges = communication or dependency between them.
* Memory = shared context that persists across steps.
* Arbitration = how conflicting opinions are resolved.

This structure enables:

* **Parallel specialization** (multiple evaluators in parallel).
* **Conditional routing** (managers deciding who to consult).
* **Resource budgeting** (decide when to skip expensive reasoning).
* **Explainability & debugging** (explicit traces of who decided what).


### üß© Enter [Pydantic Graphs](https://ai.pydantic.dev/graph/)

Building and managing these interactions manually is painful ‚Äî tracking state, type safety, branching, and parallel execution can get messy fast.

[Pydantic Graphs](https://ai.pydantic.dev/graph/) solves this elegantly by combining:

* ‚úÖ **Typed data flow** from Pydantic models ‚Äî ensuring every node‚Äôs input and output are structured, validated, and traceable.
* üï∏Ô∏è **Graph orchestration** ‚Äî defining agents and their dependencies as composable, inspectable workflows.
* üîÅ **Parallel & conditional execution** ‚Äî automatically handling fan-out (multiple evaluators) and routing logic (manager/critic decisions).
* üßæ **Transparent traces** ‚Äî every step‚Äôs inputs, outputs, and reasoning can be logged, visualized, and replayed.

Together, they turn the messy spaghetti of agent calls into a **declarative decision graph** ‚Äî a scalable foundation for complex, memory-aware, multi-agent systems.

### üß© What is `poke-env`?

[`poke-env`](https://poke-env.readthedocs.io/en/stable/) is a **Python interface to the Pok√©mon Showdown battle simulator**, providing an environment for reinforcement learning and AI experiments.
It exposes each battle as a structured API ‚Äî giving access to game state (Pok√©mon, moves, types, HP, etc.) and allowing agents to pick legal actions programmatically.

In our workflow, we‚Äôll use **`poke-env` as the testbed** to:

* ‚öîÔ∏è **Pit different multi-agent strategies** (democratic, manager, actor-critic) against each other.
* üìä **Compare performance** through metrics like win rate, turns survived, and move efficiency.
* üß† **Benchmark reasoning styles** ‚Äî seeing how coordination strategies translate into competitive outcomes.

Before running experiments, we‚Äôll start a local Pok√©mon Showdown server instance. This spins up a self-contained battle environment where our agents can safely train, plan, and battle ‚Äî making Pok√©mon the perfect arena for testing *agentic intelligence in action*.

In [1]:
from src.pokemon_showdown_setup import run_pokemon_showdown

pokemon_container = run_pokemon_showdown()

üü¢ Container already running: pokemon-showdown (4a670e8e0ee1)


#### üß™ Getting Started with `poke-env`

Before building our custom multi-agent workflows, let‚Äôs first understand how the **`poke-env`** battle environment works.
It allows us to easily simulate Pok√©mon battles between automated agents ‚Äî here, we‚Äôll start with two simple **`RandomPlayer`** agents that pick legal moves at random.

By running a quick **cross-evaluation**, we can see how `poke-env` orchestrates matches, tracks results, and reports win rates ‚Äî forming the foundation on which our more sophisticated, reasoning-based agents will later compete.

In [2]:
from poke_env.player.player import Player
from poke_env import RandomPlayer, cross_evaluate

from tabulate import tabulate

first_player = RandomPlayer()
second_player = RandomPlayer()

players = [first_player, second_player]

async def test_cross_evaluation(players, n_challenges=5):
    cross_evaluation = await cross_evaluate(players, n_challenges=n_challenges)

    table = [["-"] + [p.username for p in players]]
    for p_1, results in cross_evaluation.items():
        table.append([p_1] + [cross_evaluation[p_1][p_2] for p_2 in results])

    return tabulate(table)

print(await test_cross_evaluation(players))

--------------  --------------  --------------
-               RandomPlayer 1  RandomPlayer 2
RandomPlayer 1                  0.4
RandomPlayer 2  0.6
--------------  --------------  --------------


Let's see an example battle in action. 

<iframe src="https://shreshthtuli.github.io/build-your-own-super-agents/assets/random_battle.html" width="100%" height="480" style="border:none; overflow:hidden;"></iframe>

*You can also view the full battle [here](https://shreshthtuli.github.io/build-your-own-super-agents/assets/random_battle.html).*

### ‚ö° Creating a ‚ÄúMax Damage‚Äù Baseline

To add another simple benchmark beyond the `RandomPlayer`, we‚Äôll define a **`MaxDamagePlayer`** ‚Äî an agent that always selects the move with the highest base power.

This gives us a more *deterministic and aggressive* baseline that prioritizes raw damage output over safety or strategy.
By comparing our Pydantic AI agent against both **Random** and **MaxDamage** players, we can see whether reasoning and memory-aware planning lead to better decision-making than brute-force move selection.


In [3]:
class MaxDamagePlayer(Player):
    def choose_move(self, battle):
        if battle.available_moves:
            best_move = max(battle.available_moves, key=lambda move: move.base_power)

            if battle.can_tera:
                return self.create_order(best_move, terastallize=True)

            return self.create_order(best_move)
        else:
            return self.choose_random_move(battle)
        
players = [first_player, MaxDamagePlayer()]

print(await test_cross_evaluation(players))

-----------------  --------------  -----------------
-                  RandomPlayer 1  MaxDamagePlayer 1
RandomPlayer 1                     0.0
MaxDamagePlayer 1  1.0
-----------------  --------------  -----------------


### üéÆ Pok√©mon battle mechanics ‚Äî and how we encode them for our agents

Let's now build a simple agent to do the same, as in, use the battle context to choose the optimal action. 

**Core mechanics (what the agent must reason about):**

* **Turn-based actions:** Each turn you either **use a move** or **switch**. Faster Pok√©mon usually act first; **priority** can override speed.
* **Types & STAB:** Moves have types (e.g., *Electric*). Effectiveness depends on attacker vs defender types; using a move matching the user‚Äôs type grants **STAB** (bonus damage).
* **Accuracy & PP:** Moves can **miss** (accuracy < 100) and have limited **PP** (uses).
* **HP & fainting:** A Pok√©mon faints at 0 HP; win condition is **faint all opponent Pok√©mon**.
* **Information limits:** You only know the opponent‚Äôs **revealed** Pok√©mon and partial info about their sets.
* **Switching & tempo:** Switching preserves a weakened Pok√©mon, but concedes tempo (opponent gets a ‚Äúfree‚Äù hit).
* **Status/hazards/weather (omitted here for brevity):** These exist in the simulator; we can add them later as fields.

#### üß± Our context schema (how we feed the LLM the game state)

We transform poke-env‚Äôs `Battle` into a **typed, LLM-friendly snapshot**:

* `TeamMon`: one entry per Pok√©mon (both sides) with:

  * `species`, fractional `hp`, `fainted`, and `types`.
* `MoveOption`: one entry per **legal move this turn** with:

  * `move_id`, `base_power`, `accuracy`, `move_type`, `pp`, `priority`.
* `SwitchOption`: one entry per **legal switch target** with:

  * `species`, `hp`, `fainted`, `types`.
* `AgentContext`: the full **decision frame** the agent sees:

  * `turn`: current turn number.
  * `you_active` / `opp_active`: currently active Pok√©mon on both sides.
  * `you_team`: your full team (known).
  * `opp_known`: only **revealed** opponent Pok√©mon (respecting partial observability).
  * `legal_moves` / `legal_switches`: the **only actions you may take now**.
  * `past_actions`: a short **episodic memory** string list (e.g., summaries of last turns).


#### üõ†Ô∏è How the code builds this context

* `_pokemon_to_teammon(p)` safely converts a poke-env `Pokemon` into our `TeamMon` schema (species, hp%, types).
* In `build_context(battle, past_actions)`:

  * We iterate `battle.available_moves` to populate `MoveOption` (capturing **damage proxies** via base power, **reliability** via accuracy, **tempo** via priority, and **resource** via PP).
  * We iterate `battle.available_switches` to populate `SwitchOption` (capturing **survivability** options).
  * We map your full `battle.team` into `you_team` and the opponent‚Äôs **revealed** team into `opp_known` (partial info).
  * We capture actives (`you_active`, `opp_active`) and the `turn` counter.
  * We attach `past_actions` so the LLM can reason with **short-term memory**.
* `agent_context_to_string(ctx)` serializes the `AgentContext` to **pretty JSON**, ideal for prompting an LLM agent.

**Result:** every decision step provides a **compact, validated, and complete** view of what matters **now**, aligning game mechanics with agent reasoning (damage, risk, tempo, information, and legal constraints).


In [4]:
from __future__ import annotations
from typing import List, Optional, Dict, Any, Literal
from dataclasses import dataclass

from pydantic import BaseModel, Field
from pydantic_ai import Agent

from poke_env.battle.battle import Battle, Pokemon

class TeamMon(BaseModel):
    species: str
    hp: Optional[float] = None
    fainted: bool = False
    types: List[str] = []
    boosts: Optional[Dict[str, int]] = None
    status: Optional[str] = None
    must_recharge: Optional[bool] = None

class MoveOption(BaseModel):
    move_id: str
    base_power: Optional[int] = None
    accuracy: Optional[float] = None
    move_type: Optional[str] = None
    pp: Optional[int] = None
    priority: int = 0

class SwitchOption(BaseModel):
    species: str
    hp: Optional[float] = None
    fainted: bool = False
    types: List[str] = []

class AgentContext(BaseModel):
    turn: int
    weather: Dict[str, Any]
    # You
    you_active: Optional[str]
    you_team: List[TeamMon]
    # Opponent
    opp_active: Optional[str]
    opp_known: List[TeamMon]
    # Legals
    legal_moves: List[MoveOption]
    legal_switches: List[SwitchOption]
    # Short episodic memory (last few actions / summaries)
    past_actions: List[str] = []

def _pokemon_to_teammon(p: Pokemon) -> TeamMon:
    return TeamMon(
        species=p.species,
        hp=p.current_hp_fraction,
        fainted=p.fainted,
        boosts=p.boosts,
        status=p.status,
        must_recharge=p.must_recharge,
        types=[t.name for t in p.types or []],
    )

def build_context(battle: Battle, past_actions: List[str]) -> AgentContext:
    # legal moves
    legal_moves: List[MoveOption] = []
    for m in battle.available_moves:
        legal_moves.append(MoveOption(
            move_id=m.id,
            base_power=m.base_power,
            accuracy=m.accuracy,
            move_type=m.type.name,
            pp=m.current_pp,
            priority=m.priority,
        ))

    # legal switches
    legal_switches: List[SwitchOption] = []
    for p in battle.available_switches:
        legal_switches.append(SwitchOption(
            species=p.species,
            hp=p.current_hp_fraction,
            fainted=p.fainted,
            types=[t.name for t in (p.types or [])],
        ))

    # teams
    your_team = [_pokemon_to_teammon(poke) for poke in battle.team.values()]
    opp_known = [_pokemon_to_teammon(poke) for poke in battle.opponent_team.values() if poke._revealed] # revealed only

    return AgentContext(
        turn=battle.turn,
        weather=battle.weather,
        you_active=battle.active_pokemon.species,
        you_team=your_team,
        opp_active=battle.opponent_active_pokemon.species,
        opp_known=opp_known,
        legal_moves=legal_moves,
        legal_switches=legal_switches,
        past_actions=past_actions, 
    )

def agent_context_to_string(ctx: AgentContext) -> str:
    return ctx.model_dump_json(indent=2)

### ü§ñ A minimal ‚Äúthinking player‚Äù

**Goal:** turn the JSON context we built into a **single legal action** (move or switch) using a typed LLM agent, and keep a tiny episodic memory of what we did.

**1) Structured output contract = `Decision`**

* We define a **Pydantic schema** that the LLM must fill:

  * `kind`: `"move"` or `"switch"`.
  * `move_id` / `switch_species`: only one is required depending on `kind`.
  * `rationale`: short explanation (useful for logs and later learning).
* This keeps the model honest and makes post-processing trivial.

**2) The LLM-powered player = `PydanticLLMPlayer`**

* Extends `poke-env`‚Äôs `Player`.
* Sets up a **Pydantic AI `Agent`** (`self.battle_agent`) with:

  * A **system prompt** encoding simple policy: prefer **high-accuracy, super-effective** moves; **switch** if danger is high or moves are poor; **never invent illegal actions**.
  * `output_type=Decision` so the model must return a valid, typed object.

**3) Decision loop = `choose_move(...)`**

1. **Build context**
   `ctx = build_context(battle, past_actions=self._past_actions)`
   ‚Üí serializes the current game state + short episodic memory.
2. **Call the agent**
   `decision = self.battle_agent.run_sync(agent_context_to_string(ctx)).output`
   ‚Üí the LLM reads the JSON, returns a validated `Decision`.
3. **Legality mapping**

   * If `kind == "move"`, find the **exact** `move_id` in `battle.available_moves` and `create_order(m)`.
   * If `kind == "switch"`, match `switch_species` in `battle.available_switches` and `create_order(p)`.
   * We **append a human-readable summary** to `_past_actions` for the next turn‚Äôs context.
4. **Safety fallback**
   If, for any reason, the decision isn‚Äôt legal (should be rare), we **choose a random legal action** so the game continues.

**4) Why this works well**

* **Typed outputs** remove prompt-engineering brittleness (no regex parsing or guesswork).
* **Context ‚Üí Decision ‚Üí Action** is clean, auditable, and easy to extend (plug in evaluators/critics later).
* The **episodic memory** (`_past_actions`) gives the agent short-term continuity across turns without blowing up context size.

In [5]:
from rich import print as rprint
import nest_asyncio
import logfire

from poke_env import Player, RandomPlayer

logfire.configure()
nest_asyncio.apply()

class Decision(BaseModel):
    kind: Literal["move", "switch"] = Field(description="Choose 'move' or 'switch'.")
    move_id: Optional[str] = Field(default=None, description="Required if kind == 'move'")
    switch_species: Optional[str] = Field(default=None, description="Required if kind == 'switch'")
    rationale: str

class PydanticLLMPlayer(Player):
    def __init__(self, name: str, model: str = "openrouter:openai/gpt-4o-mini", **kwargs):
        super().__init__(**kwargs)
        self.name = name
        self._past_actions: List[str] = []
        self.battle_agent = Agent(
            model=model,
            system_prompt=(
                "You are a Pok√©mon battle planner. "
                "Given the current battle context, choose ONE legal action. "
                "Prefer high-accuracy, super-effective moves; "
                "switch if active Pok√©mon risks being KO'd or has no good moves. "
                "Never invent illegal actions."
            ),
            output_type=Decision,
        )

    def choose_move(self, battle: Battle):
        # Build structured context for the agent
        ctx = build_context(battle, past_actions=self._past_actions)
        # Run agent to get decision
        decision = self.battle_agent.run_sync(agent_context_to_string(ctx)).output
        if battle.turn <= 1:
            rprint(f"CONTEXT:", ctx)
            rprint(f"DECISION:", decision)
        else:
            rprint(f"T{ctx.turn} DECISION:", decision.rationale)

        # Map Decision ‚Üí poke-env action
        if decision.kind == "move":
            # find the matching legal move
            for m in battle.available_moves:
                if m.id == decision.move_id:
                    self._past_actions.append(
                        f"T{ctx.turn}: MOVE {decision.move_id} ({ctx.you_active} vs {ctx.opp_active})"
                    )
                    return self.create_order(m)
        elif decision.kind == "switch":
            # find the matching legal switch
            for p in battle.available_switches:
                if p.species == decision.switch_species:
                    self._past_actions.append(
                        f"T{ctx.turn}: SWITCH to {decision.switch_species} (from {ctx.you_active})"
                    )
                    return self.create_order(p)

        # Fallback: if agent suggested an illegal action (shouldn't happen), choose random
        self._past_actions.append(f"T{ctx.turn}: FALLBACK random")
        return self.choose_random_move(battle)


### ‚öîÔ∏è Running our first Agentic battle

Now that we‚Äôve built our **LLM-powered Pok√©mon agent**, it‚Äôs time to see it in action!
Here we instantiate the `PydanticLLMPlayer` and let it **battle a `RandomPlayer`** for a single match.

When the battle runs:

1. Each turn, the **`agentic_player`** builds a structured `AgentContext` (game state + short memory).
2. The **LLM agent** reasons over that context and outputs a **typed `Decision`** (`move` or `switch`).
3. The environment executes that decision, updates the game state, and loops until one side faints all opponents.

This quick match serves as a **smoke test** ‚Äî verifying that our agent can read the environment, reason with context, and select legal actions correctly before we scale up to multi-agent graphs and tournaments.

In [10]:
agentic_player = PydanticLLMPlayer(name="LLM Agent")

await agentic_player.battle_against(RandomPlayer(), n_battles=1)

## üï∏Ô∏è Introducing *Pydantic Graphs* ‚Äî the foundation for structured multi-agent workflows

So far, our agent acted as a **single decision-maker**: it observed context, reasoned once, and returned a move. But as environments grow in complexity ‚Äî multiple objectives, conflicting strategies, limited time ‚Äî we need **many specialized agents** working together.

That‚Äôs where **üß© Pydantic Graphs** come in.


### ‚öôÔ∏è What are *Pydantic Graphs*?

**Pydantic Graphs** extend the idea of *typed LLM workflows*: instead of chaining prompts manually, you define a **graph of agents** ‚Äî each node is a typed, callable component (`Agent`, `Tool`, or function), and edges represent how their structured outputs flow into each other.

Each node‚Äôs input/output types are enforced by **Pydantic models**, guaranteeing that
‚úÖ every agent receives valid structured data,
‚úÖ workflows are composable, debuggable, and inspectable,
‚úÖ and parallel/conditional execution (‚Äúrun these 3 evaluators in parallel‚Äù) becomes trivial.


### ü§ù Why multi-agent workflows?

Real decision problems rarely have one ‚Äúbest‚Äù heuristic ‚Äî they‚Äôre **multi-objective**:

* Tactical reward vs safety (damage vs survivability)
* Short-term payoff vs long-term setup
* Exploration vs exploitation

Multi-agent graphs let you **distribute cognition**:

* Each node/agent handles a sub-skill (planner, tactician, risk, scout).
* Coordination logic (e.g., a manager or arbiter) **fuses** their reasoning.
* Memory and arbitration layers can be swapped independently (for ablations).

This architecture naturally scales to **agentic swarms** ‚Äî large ensembles of specialized agents that coordinate dynamically, forming emergent intelligence beyond a single model‚Äôs scope.


### üîÄ Static vs Dynamic Query Routing

In our earlier ‚ÄúManager‚Äù agent, the routing (which specialists to call) was **static** ‚Äî we hard-coded: ‚Äúalways call Tactician, call Risk if danger > 0.6, call Scout every 3 turns‚Äù.

**Dynamic routing**, enabled by Pydantic Graphs, makes this adaptive:

* Each agent‚Äôs outputs (or intermediate metadata like *uncertainty*, *cost*, or *confidence*) can dynamically decide the next edges to traverse.
* If the planner returns low-confidence moves, the graph might automatically trigger the *Risk Officer* or *Critic* path.
* If confidence is high, it can skip extra steps to save latency or tokens.

üß© **Benefit:** Resource-aware, self-adapting workflows that scale gracefully ‚Äî the system ‚Äúthinks harder‚Äù only when needed.


### ‚úèÔ∏è Query Rewriting

Another advanced feature is **query rewriting** ‚Äî when incoming queries or contexts are **transformed before being passed** to downstream agents. In Pok√©mon terms, before the planner decides, a *context rewriter* might:

* Simplify redundant details (‚Äúignore irrelevant side conditions‚Äù), or
* Add derived features (‚Äúthis move is likely super-effective against Water‚Äù).

This lets different specialists receive **domain-specific representations** of the same state, improving efficiency and interpretability.

**üöÄ Why it matters**

Together, **dynamic routing** and **query rewriting** turn a static, hand-crafted pipeline into a **living cognitive graph**:

* üí° *Adaptive*: reasoning depth scales with uncertainty or stakes.
* üß† *Modular*: new skills or evaluators can be plugged in as new nodes.
* ‚öñÔ∏è *Efficient*: token and time budgets are managed intelligently.
* üîç *Transparent*: every decision path and intermediate output is traceable.

By using *Pydantic Graphs*, we can finally move from ‚Äúprompt chains‚Äù to **structured, interpretable agentic systems** ‚Äî the same architectural leap that turns simple agents into full-fledged, cooperative AI swarms.

In [6]:
from pydantic_graph import BaseNode, End, Graph, GraphRunContext

class PlanCandidate(BaseModel):
    kind: Literal["move", "switch"]
    move_id: Optional[str] = None
    switch_species: Optional[str] = None
    rationale: str

class Plan(BaseModel):
    candidates: List[PlanCandidate]

class EvalScore(BaseModel):
    score: float
    notes: Optional[str] = None

class GraphState(BaseModel):
    context: AgentContext
    plan: Optional[Plan] = None 
    tactician_scores: Optional[List[EvalScore]] = None
    risk_scores: Optional[List[EvalScore]] = None
    scout_scores: Optional[List[EvalScore]] = None
    final_decision: Optional[PlanCandidate] = None

## üß≠ The Manager‚ÄìCoordinator Paradigm in Agentic Swarms

Let's now implement our first multi-agent workflow: **managerial multi-agent coordination pattern** through a structured **Pydantic Graph** ‚Äî where each node acts as a specialized agent, and together they form a **coordinated decision-making swarm**.



### üï∏Ô∏è The Idea: Manager + Specialists = Smarter Decisions

Instead of relying on a single monolithic model, this setup distributes reasoning across multiple specialized roles ‚Äî just like a management hierarchy in human organizations:

* **PlannerNode (Coordinator)** üß† ‚Üí proposes candidate actions (moves/switches).
* **TacticianNode** ‚öîÔ∏è ‚Üí evaluates each candidate for *expected value* (damage, tempo).
* **RiskNode** üõ°Ô∏è ‚Üí evaluates *safety and survivability*.
* **ScoutNode** üîç ‚Üí evaluates *information gain* (learning about opponent‚Äôs hidden Pok√©mon).
* **DecisionNode (Manager)** üß© ‚Üí aggregates all scores and makes the final move selection.

Each node operates independently but shares a common state (`GraphState`) that persists through the workflow ‚Äî this gives the system **continuity**, **explainability**, and **structured memory** across reasoning steps.



### ‚öôÔ∏è How Pydantic Graphs Enable This

Pydantic Graphs make this **explicitly declarative**:

* Each node inherits from `BaseNode` and defines an **`async run()`** method that updates a shared `GraphState`.
* Nodes specify their **next node** ‚Äî e.g., `PlannerNode ‚Üí TacticianNode ‚Üí RiskNode ‚Üí ScoutNode ‚Üí DecisionNode ‚Üí End`.
* The `Graph` object (`planner_graph`) defines the entire workflow and its **state type** (`GraphState`), ensuring all data between nodes remains valid and typed.
* The graph runtime (`GraphRunContext`) automatically handles **execution order**, **state persistence**, and **error handling/retries**.

This means the graph acts as an **orchestration layer** over multiple LLMs ‚Äî a mini ‚Äúswarm intelligence‚Äù system where reasoning flows like information through an organization chart.



### üß© Why the Manager-Coordinator Model Matters

1. **Decomposition of reasoning:**
   Each agent focuses on a narrow cognitive skill ‚Äî simplifying prompts, improving interpretability, and reducing hallucinations.

2. **Parallelism and composability:**
   Multiple evaluators can be executed concurrently, and new agents (e.g., ‚ÄúHealer Advisor‚Äù, ‚ÄúWeather Analyst‚Äù) can be plugged in without refactoring the graph.

3. **Explainability:**
   Every step is transparent ‚Äî you can inspect the planner‚Äôs candidates, each specialist‚Äôs scores, and the rationale behind the final decision.

4. **Dynamic scalability:**
   The manager can later evolve to **dynamic routing**, consulting only relevant specialists based on battle context or uncertainty ‚Äî enabling true **adaptive swarms**.



In [14]:
@dataclass
class PlannerNode(BaseNode):
    planner_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You propose 2-4 legal actions for the given Pok√©mon battle context. "
            "Prefer super-effective, high-accuracy moves; consider switches if HP is low or risk is high. "
            "Do NOT invent illegal actions."
        ),
        output_type=Plan,
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> TacticianNode:
        state = context.state
        plans = (await self.planner_agent.run(agent_context_to_string(state.context))).output
        state.plan = plans

        return TacticianNode()
    

@dataclass
class TacticianNode(BaseNode):
    tactician_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You are a Pok√©mon battle tactician. "
            "Score each candidate 0..1 for expected value (damage + board advantage)."
        ),
        output_type=List[EvalScore],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> RiskNode:
        state = context.state
        assert state.plan is not None, "Plan must be set before TacticianNode runs."
        prompt = agent_context_to_string(state.context) + "\n\n" + state.plan.model_dump_json(indent=2)
        scores = (await self.tactician_agent.run(prompt)).output
        state.tactician_scores = scores

        return RiskNode()
    
@dataclass
class RiskNode(BaseNode):
    risk_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You are a Pok√©mon battle risk assessor. "
            "Score each candidate 0..1 for risk (chance of failure, negative outcomes)."
        ),
        output_type=List[EvalScore],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> ScoutNode:
        state = context.state
        assert state.plan is not None, "Plan must be set before RiskNode runs."
        prompt = agent_context_to_string(state.context) + "\n\n" + state.plan.model_dump_json(indent=2)
        scores = (await self.risk_agent.run(prompt)).output
        state.risk_scores = scores

        return ScoutNode()
    
@dataclass
class ScoutNode(BaseNode):
    scout_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You are a Pok√©mon battle scout. "
            "Score each candidate 0..1 for information gain (revealing opponent's unknowns)."
        ),
        output_type=List[EvalScore],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> DecisionNode:
        state = context.state
        assert state.plan is not None, "Plan must be set before ScoutNode runs."
        prompt = agent_context_to_string(state.context) + "\n\n" + state.plan.model_dump_json(indent=2)
        scores = (await self.scout_agent.run(prompt)).output
        state.scout_scores = scores

        return DecisionNode()
    
@dataclass
class DecisionNode(BaseNode):
    decision_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You are a Pok√©mon battle decision maker. "
            "Using the provided scores from tactician, risk, and scout, "
            "select the best candidate action to take."
        ),
        output_type=PlanCandidate,
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> End:
        state = context.state
        assert state.tactician_scores is not None, "Tactician scores must be set before DecisionNode runs."
        assert state.risk_scores is not None, "Risk scores must be set before DecisionNode runs."
        assert state.scout_scores is not None, "Scout scores must be set before DecisionNode runs."

        prompt = agent_context_to_string(state.context) + "\n\n"
        prompt += "Planned Candidates:\n" + state.plan.model_dump_json(indent=2) + "\n\n"
        prompt += "Tactician Scores:\n" + str([es.model_dump_json(indent=2) for es in state.tactician_scores]) + "\n\n"
        prompt += "Risk Scores:\n" + str([es.model_dump_json(indent=2) for es in state.risk_scores]) + "\n\n"
        prompt += "Scout Scores:\n" + str([es.model_dump_json(indent=2) for es in state.scout_scores]) + "\n\n"

        decision = (await self.decision_agent.run(prompt)).output
        state.final_decision = decision

        return End(state.final_decision)
    
planner_graph = Graph(nodes=[PlannerNode, TacticianNode, RiskNode, ScoutNode, DecisionNode], state_type=GraphState)  

example_agent_context = AgentContext(
    turn=1,
    weather={},
    you_active="Pikachu",
    you_team=[TeamMon(species="Pikachu", hp=1.0, fainted=False, types=["Electric"], boosts={"atk": 1}, status=None, must_recharge=False)],
    opp_active="Bulbasaur",
    opp_known=[TeamMon(species="Bulbasaur", hp=1.0, fainted=False, types=["Grass", "Poison"], boosts={"def": -1}, status="paralyzed", must_recharge=False)],
    legal_moves=[MoveOption(move_id="Thunderbolt", base_power=90, accuracy=100, priority=0)],
    legal_switches=[SwitchOption(species="Charizard", hp=1.0, fainted=False, types=["Fire", "Flying"])],
    past_actions=[],
)

result = await planner_graph.run(PlannerNode(), state=GraphState(context=example_agent_context))
rprint(result.state.final_decision)

23:16:39.013 run graph planner_graph
23:16:39.014   run node PlannerNode
23:17:22.155   run node TacticianNode
23:17:43.325   run node RiskNode
23:18:17.161   run node ScoutNode
23:18:53.984   run node DecisionNode


Let's see it in action!

In [7]:
from rich import print as rprint
import nest_asyncio
import logfire
import asyncio

logfire.configure()
nest_asyncio.apply()

class PydanticGraphAgent(Player):
    def __init__(self, name: str, agentic_graph: Graph, first_node: BaseNode, **kwargs):
        super().__init__(**kwargs)
        self.name = name
        self._past_actions: List[str] = []
        self.graph = agentic_graph
        self.first_node = first_node

    def choose_move(self, battle: Battle):
        # Build structured context for the agent
        ctx = build_context(battle, past_actions=self._past_actions)
        # Run agent to get decision
        result = asyncio.run(self.graph.run(self.first_node, state=GraphState(context=ctx)))
        decision = result.state.final_decision
        if battle.turn <= 1:
            rprint(f"CONTEXT:", ctx)
            rprint(f"DECISION:", decision)
        else:
            rprint(f"T{ctx.turn} DECISION:", decision.rationale)

        # Map Decision ‚Üí poke-env action
        if decision.kind == "move":
            # find the matching legal move
            for m in battle.available_moves:
                if m.id == decision.move_id:
                    self._past_actions.append(
                        f"T{ctx.turn}: MOVE {decision.move_id} ({ctx.you_active} vs {ctx.opp_active})"
                    )
                    return self.create_order(m)
        elif decision.kind == "switch":
            # find the matching legal switch
            for p in battle.available_switches:
                if p.species == decision.switch_species:
                    self._past_actions.append(
                        f"T{ctx.turn}: SWITCH to {decision.switch_species} (from {ctx.you_active})"
                    )
                    return self.create_order(p)

        # Fallback: if agent suggested an illegal action (shouldn't happen), choose random
        self._past_actions.append(f"T{ctx.turn}: FALLBACK random")
        return self.choose_random_move(battle)


In [None]:
coordination_graph = Graph(nodes=[PlannerNode, TacticianNode, RiskNode, ScoutNode, DecisionNode], state_type=GraphState)  

coordination_player = PydanticGraphAgent(name="LLM Agent", agentic_graph=coordination_graph, first_node=PlannerNode())

await coordination_player.battle_against(RandomPlayer(), n_battles=1)

22:31:19.383 run graph None
22:31:19.386   run node PlannerNode
22:31:47.861   run node TacticianNode
22:32:17.785   run node RiskNode
22:32:49.956   run node ScoutNode
22:33:20.686   run node DecisionNode


22:33:37.220 run graph None
22:33:37.221   run node PlannerNode
22:34:16.439   run node TacticianNode
22:34:56.803   run node RiskNode
22:36:16.135   run node ScoutNode
22:37:08.924   run node DecisionNode
22:37:30.422 run graph None
22:37:30.422   run node PlannerNode
22:37:59.223   run node TacticianNode
22:38:48.199   run node RiskNode
22:39:21.331   run node ScoutNode
22:39:50.271   run node DecisionNode
22:40:11.907 run graph None
22:40:11.907   run node PlannerNode
22:40:56.972   run node TacticianNode
22:41:35.116   run node RiskNode
22:42:19.953   run node ScoutNode
22:42:57.951   run node DecisionNode
22:43:12.405 run graph None
22:43:12.406   run node PlannerNode
22:44:02.596   run node TacticianNode
22:44:43.567   run node RiskNode
22:45:19.134   run node ScoutNode
22:45:56.038   run node DecisionNode
22:46:04.602 run graph None
22:46:04.603   run node PlannerNode
22:46:35.056   run node TacticianNode
22:47:13.838   run node RiskNode
22:47:48.703   run node ScoutNode
22:48:2

Result:

<iframe src="https://shreshthtuli.github.io/build-your-own-super-agents/assets/coordinator.html" width="100%" height="480" style="border:none; overflow:hidden;"></iframe>

*You can also view the full battle [here](https://shreshthtuli.github.io/build-your-own-super-agents/assets/coordinator.html).*

## üó≥Ô∏è Democratic multi-agent swarms 

Now let's move onto the next agentic swarm model: **democratic orchestration**. Herein,

1. a **Planner** proposes several **legal candidates** (moves/switches),
2. multiple **independent voters** each judge every candidate from a different lens, and
3. a **Tally** node picks the action with the **most YES votes**.

**Nodes & roles**

* **`PlannerNode`** ‚Üí produces 3‚Äì4 legal, diverse candidates for the current battle context.
* **Voters** (parallelizable):

  * `AccuracyVoterNode` ‚Äì prefer high-reliability actions (‚â•90% accuracy or safe switch).
  * `TypeMatchupVoterNode` ‚Äì reward good type effectiveness or improved matchup after switch.
  * `TempoVoterNode` ‚Äì prefer momentum (threaten KO, force a switch, safe setup).
  * `PPVoterNode` ‚Äì favor conserving scarce PP/resources.
  * `DiversityVoterNode` ‚Äì encourage non-redundant options (coverage/status/switch variety).
* **`TallyNode`** ‚Üí sums 0/1 votes per candidate and returns the majority winner (ties break by first max).

Each voter returns a **list of 0/1** (YES/NO) aligned with `plan.candidates`, keeping the interface simple and debuggable.


### üß† When to use Democratic swarms vs Manager‚ÄìCoordinator

Use **Democratic** when:

* You want **robustness via diversity**: many simple judges smooth out any one agent‚Äôs bias.
* The task benefits from **ensemble wisdom** and **parallel scoring** of options.
* You need **transparent preference profiles** (‚Äúwhy did we pick this?‚Äù ‚Üí look at voter tallies).
* Latency budget allows **fan-out** to several voters.

Use **Manager‚ÄìCoordinator** when:

* You need **budget-aware routing** (call specialists only when danger/uncertainty is high).
* The task has a **clear decision funnel** (plan ‚Üí specific specialists ‚Üí decision).
* You want **conditional depth** (think harder only when needed) for tighter SLAs.
* You prefer a single **final authority** aggregating nuanced scores/metrics.

**Rule of thumb:**

* **Exploration, variety, early prototyping** ‚Üí start with **Democracy**.
* **Production with SLAs, cost constraints** ‚Üí move to **Manager/Coordinator** (dynamic routing, early exit).


### üß¨ Mixed-model ensembles (per-agent LLMs)

Each node can use a **different LLM** (as shown: OpenAI, Anthropic, Google, xAI, Qwen) to **specialize strengths**:

* Models with **longer context** or **stronger reasoning** can power the **Planner** or **Type** voter.
* **Faster/cheaper** models can handle **Accuracy/PP** voters at scale.
* Mixing providers reduces **correlated failure modes** and improves **ensemble reliability**.

*Benefit:* You get a **portfolio effect**‚Äîdiverse models + diverse criteria ‚Üí **more stable decisions** under uncertainty.


**üßæ Why this pattern is nice to teach & extend**

* **Simple contract:** voters return `[0/1, ‚Ä¶]`; the tally is trivial to audit.
* **Parallel-friendly:** voters can run concurrently for low wall-time.
* **Composable:** add/remove voters without touching the rest of the graph.
* **Explainable:** log `plan.candidates` + each voter‚Äôs vector to visualize support per option.

Next steps: try replacing 0/1 votes with **ranked ballots** (Borda/Condorcet), or add **confidence-weighted voting** to blend democratic and managerial ideas.

In [16]:
from __future__ import annotations
from typing import List, Optional, Literal, Dict
from dataclasses import dataclass

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_graph import BaseNode, End, Graph, GraphRunContext
from rich import print as rprint

class PlanCandidate(BaseModel):
    kind: Literal["move", "switch"]
    move_id: Optional[str] = None
    switch_species: Optional[str] = None
    rationale: str

class Plan(BaseModel):
    candidates: List[PlanCandidate]

class GraphState(BaseModel):
    context: AgentContext
    plan: Optional[Plan] = None
    accuracy_votes: Optional[List[int]] = None
    type_votes: Optional[List[int]] = None
    tempo_votes: Optional[List[int]] = None
    pp_votes: Optional[List[int]] = None
    diversity_votes: Optional[List[int]] = None
    final_decision: Optional[PlanCandidate] = None

@dataclass
class PlannerNode(BaseNode):
    planner_agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "You are a Pok√©mon move planner. From the given context, propose 3-4 LEGAL actions "
            "(moves or switches). Prefer super-effective, high-accuracy moves; include at least "
            "one safe SWITCH if current matchup looks poor. Do NOT invent illegal actions."
        ),
        output_type=Plan,
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "AccuracyVoterNode":
        state = context.state
        plan = (await self.planner_agent.run(agent_context_to_string(state.context))).output
        if len(plan.candidates) < 2:
            plan.candidates = plan.candidates * 2
            plan.candidates[1].rationale = "Fallback duplicate to enable voting."
        state.plan = plan
        return AccuracyVoterNode()


@dataclass
class AccuracyVoterNode(BaseNode):
    agent = Agent(
        model="openrouter:openai/gpt-5-mini",
        system_prompt=(
            "Accuracy Voter: For each candidate, vote 1 if the action is high-reliability "
            "(move accuracy >= 90% or a SWITCH that avoids a likely miss/KO), else 0. "
            "Return a Python list of 0/1 of the same length as candidates, no extra text."
        ),
        output_type=List[int],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "TypeMatchupVoterNode":
        state = context.state
        assert state.plan is not None
        prompt = (
            agent_context_to_string(state.context)
            + "\n\nCANDIDATES:\n"
            + state.plan.model_dump_json(indent=2)
        )
        votes = (await self.agent.run(prompt)).output
        state.accuracy_votes = votes
        return TypeMatchupVoterNode()


@dataclass
class TypeMatchupVoterNode(BaseNode):
    agent = Agent(
        model="openrouter:anthropic/claude-sonnet-4.5",
        system_prompt=(
            "Type Matchup Voter: For each candidate, vote 1 if the MOVE is likely super-effective "
            "or at least neutral (avoid not-very-effective/immunity), or if a SWITCH improves the type matchup; "
            "otherwise 0. Return a Python list of 0/1, same length as candidates."
        ),
        output_type=List[int],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "TempoVoterNode":
        state = context.state
        assert state.plan is not None
        prompt = (
            agent_context_to_string(state.context)
            + "\n\nCANDIDATES:\n"
            + state.plan.model_dump_json(indent=2)
        )
        votes = (await self.agent.run(prompt)).output
        state.type_votes = votes
        return TempoVoterNode()


@dataclass
class TempoVoterNode(BaseNode):
    agent = Agent(
        model="openrouter:google/gemini-2.5-flash",
        system_prompt=(
            "Tempo Voter: Vote 1 for candidates that are likely to seize or keep momentum this turn "
            "(e.g., fast KO, force a switch, gain setup safely); otherwise 0. "
            "Return a Python list of 0/1 aligned with candidates."
        ),
        output_type=List[int],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "PPVoterNode":
        state = context.state
        assert state.plan is not None
        prompt = (
            agent_context_to_string(state.context)
            + "\n\nCANDIDATES:\n"
            + state.plan.model_dump_json(indent=2)
        )
        votes = (await self.agent.run(prompt)).output
        state.tempo_votes = votes
        return PPVoterNode()


@dataclass
class PPVoterNode(BaseNode):
    agent = Agent(
        model="openrouter:x-ai/grok-4-fast",
        system_prompt=(
            "PP Conservation Voter: Prefer conserving scarce PP; vote 1 when the candidate either "
            "uses a common PP move for chip damage or SWITCHES to preserve a key low-PP move; else 0. "
            "Return a Python list of 0/1 aligned with candidates."
        ),
        output_type=List[int],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "DiversityVoterNode":
        state = context.state
        assert state.plan is not None
        prompt = (
            agent_context_to_string(state.context)
            + "\n\nCANDIDATES:\n"
            + state.plan.model_dump_json(indent=2)
        )
        votes = (await self.agent.run(prompt)).output
        state.pp_votes = votes
        return DiversityVoterNode()


@dataclass
class DiversityVoterNode(BaseNode):
    agent = Agent(
        model="openrouter:qwen/qwen3-next-80b-a3b-thinking",
        system_prompt=(
            "Diversity Voter: Encourage non-redundant options. Vote 1 when the candidate adds "
            "coverage not present in other candidates this turn (e.g., different target, status vs raw damage, "
            "or SWITCH to change matchup); else 0. Return a Python list of 0/1 aligned with candidates."
        ),
        output_type=List[int],
        retries=3,
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "TallyNode":
        state = context.state
        assert state.plan is not None
        prompt = (
            agent_context_to_string(state.context)
            + "\n\nCANDIDATES:\n"
            + state.plan.model_dump_json(indent=2)
        )
        votes = (await self.agent.run(prompt)).output
        state.diversity_votes = votes
        return TallyNode()


@dataclass
class TallyNode(BaseNode):
    """Aggregate votes across voters and pick the candidate with the most YES votes."""
    async def run(self, context: GraphRunContext[GraphState]) -> End:
        s = context.state
        assert s.plan is not None

        buckets: List[List[int]] = []
        for name in ["accuracy_votes", "type_votes", "tempo_votes", "pp_votes", "diversity_votes"]:
            votes = getattr(s, name)
            if votes is not None:
                buckets.append(votes)

        n_candidates = len(s.plan.candidates)
        totals = [0] * n_candidates
        for bucket in buckets:
            if len(bucket) != n_candidates:
                # Defensive: truncate/pad to align
                bucket = (bucket + [0] * n_candidates)[:n_candidates]
            for i, v in enumerate(bucket):
                totals[i] += int(v)

        # Majority pick (max votes); deterministic tiebreak = first max
        best_idx = max(range(n_candidates), key=lambda i: totals[i])
        s.final_decision = s.plan.candidates[best_idx]

        # Optional: print a quick audit
        rprint({"totals": totals, "chosen_index": best_idx, "chosen": s.final_decision})

        return End(s.final_decision)

democracy_graph = Graph(
    nodes=[PlannerNode, AccuracyVoterNode, TypeMatchupVoterNode, TempoVoterNode, PPVoterNode, DiversityVoterNode, TallyNode],
    state_type=GraphState,
)

example_agent_context = AgentContext(
    turn=1,
    weather={},
    you_active="Pikachu",
    you_team=[TeamMon(species="Pikachu", hp=1.0, fainted=False, types=["Electric"], boosts={"atk": 1}, status=None, must_recharge=False)],
    opp_active="Bulbasaur",
    opp_known=[TeamMon(species="Bulbasaur", hp=1.0, fainted=False, types=["Grass", "Poison"], boosts={"def": -1}, status="paralyzed", must_recharge=False)],
    legal_moves=[MoveOption(move_id="Thunderbolt", base_power=90, accuracy=100, priority=0)],
    legal_switches=[SwitchOption(species="Charizard", hp=1.0, fainted=False, types=["Fire", "Flying"])],
    past_actions=[],
)

result = await democracy_graph.run(PlannerNode(), state=GraphState(context=example_agent_context))
rprint("Final Democratic Decision:", result.state.final_decision)


23:20:42.677 run graph democracy_graph
23:20:42.677   run node PlannerNode
23:21:17.224   run node AccuracyVoterNode
23:21:44.775   run node TypeMatchupVoterNode
23:21:48.946   run node TempoVoterNode
23:21:50.458   run node PPVoterNode
23:21:59.919   run node DiversityVoterNode
23:24:35.191   run node TallyNode


### üé≠ Actor‚ÄìCritic Multi-Agent Workflow

Our final agentic swarm model is the **actor‚Äìcritic** style workflow, again implemented using `pydantic-graph`. This model is inspired by reinforcement learning but adapted for multi-LLM reasoning. Here, we explicitly separate **proposal** and **evaluation**, allowing the agent swarm to reason iteratively about *value* and *risk* before acting.


### üß© Roles and flow

The graph proceeds through four main nodes:

1. **`ActorNode` ‚Äì The policy generator**

   * Proposes 3‚Äì4 **legal moves or switches** based on the current Pok√©mon context.
   * Behaves like a policy network, outputting candidate actions with rationales.

2. **`CriticNode` ‚Äì The value estimator**

   * Evaluates each candidate with a **Q-value** (expected outcome), **risk**, and **confidence** score.
   * This acts as a ‚Äúvalue network‚Äù estimating how good each candidate really is.

3. **`ImproveNode` ‚Äì The policy improver (optional)**

   * If the best candidate is too risky or has low Q-value, the improver agent asks the actor to **refine** its plan.
   * The critic is then re-invoked to rescore the improved candidates.
   * This mimics the *actor‚Äìcritic policy improvement loop* in RL.

4. **`SelectorNode` ‚Äì The final decision layer**

   * Combines critic outputs into an adjusted score:
     $\text{AdjustedScore} = Q \times (1 - \lambda \cdot \text{risk}) \times (0.5 + 0.5 \times \text{confidence})$
   * Picks the candidate with the highest adjusted value and terminates the graph with an `End` node.


### üß† What makes this pattern powerful

* **Iterative refinement** ‚Äì Unlike the manager-coordinator (hierarchical) or democratic (ensemble) designs, the actor‚Äìcritic loop *learns from its own evaluation*.
* **Value-based reasoning** ‚Äì The critic explicitly quantifies the *expected reward* of each move, enabling long-term strategic play rather than greedy local choices.
* **Adaptive depth** ‚Äì The `ImproveNode` only triggers refinement when quality or safety drops, giving us dynamic compute allocation.
* **Interpretability** ‚Äì Q-values, risk, and confidence are visible for each decision, so you can trace *why* the agent preferred one move over another.


### ‚öôÔ∏è Comparing paradigms

| Workflow Type           | Nature             | Example Use                          | Pros                                  | Trade-offs                        |
| ----------------------- | ------------------ | ------------------------------------ | ------------------------------------- | --------------------------------- |
| **Manager‚ÄìCoordinator** | Hierarchical       | Strategic planning under constraints | Modular, dynamic routing              | Slight overhead for routing logic |
| **Democratic**          | Ensemble           | Collective judgment / robustness     | High diversity, fault tolerance       | Higher latency, no feedback loop  |
| **Actor‚ÄìCritic**        | Iterative feedback | Adaptive value-based control         | Learns/refines actions, interpretable | Slightly more compute per turn    |


### üöÄ Why it fits Pok√©mon and beyond

* Battles require balancing **expected gain vs. survivability**, just like value-based RL tasks.
* The critic captures **contextual trade-offs** (damage, tempo, risk), while the actor continuously learns what kinds of proposals score best.
* This same structure can generalize to *decision-making agents* in finance, robotics, or multi-stage planning ‚Äî anywhere feedback-driven refinement is useful.


In [None]:
from __future__ import annotations
from typing import List, Optional, Literal
from dataclasses import dataclass

from pydantic import BaseModel, Field
from pydantic_ai import Agent
from pydantic_graph import BaseNode, End, Graph, GraphRunContext
from rich import print as rprint

class CriticScore(BaseModel):
    index: int = Field(description="Index into Plan.candidates[]")
    q_value: float = Field(ge=0.0, description="Estimated value; higher is better")
    risk: float = Field(ge=0.0, le=1.0, description="0 safe, 1 very risky")
    confidence: float = Field(ge=0.0, le=1.0, description="Critic confidence in this score")
    notes: Optional[str] = None

class GraphState(BaseModel):
    context: AgentContext
    plan: Optional[Plan] = None
    critic_scores: Optional[List[CriticScore]] = None
    refined: bool = False
    final_decision: Optional[PlanCandidate] = None

@dataclass
class ActorNode(BaseNode):
    actor = Agent(
        model="openrouter:google/gemini-2.5-pro",
        system_prompt=(
            "ACTOR: Propose 3-4 LEGAL actions (moves or switches) for the current Pok√©mon context.\n"
            "Favor super-effective, high-accuracy moves; include a safe SWITCH if matchup is bad.\n"
            "Do NOT invent illegal actions. Keep rationales concise."
        ),
        output_type=Plan,
        retries=3
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "CriticNode":
        s = context.state
        plan = (await self.actor.run(agent_context_to_string(s.context))).output
        s.plan = plan
        return CriticNode()


@dataclass
class CriticNode(BaseNode):
    critic = Agent(
        model="openrouter:anthropic/claude-sonnet-4.5",
        system_prompt=(
            "CRITIC: For each candidate, estimate a Q-value in [0, +inf) capturing expected outcome "
            "(damage, survival, tempo) for THIS TURN and near future. Also output risk in [0,1] "
            "(0=safe,1=dangerous) and confidence in [0,1]. Keep notes brief. "
            "Return a list aligned with candidates using fields: index, q_value, risk, confidence, notes."
        ),
        output_type=List[CriticScore],
        retries=3
    )

    async def run(self, context: GraphRunContext[GraphState]) -> "ImproveNode":
        s = context.state
        assert s.plan is not None
        prompt = agent_context_to_string(s.context) + "\n\nCANDIDATES:\n" + s.plan.model_dump_json(indent=2)
        scores = (await self.critic.run(prompt)).output
        # Defensive: clamp and align indices
        n = len(s.plan.candidates)
        clean = []
        for sc in scores:
            i = max(0, min(n - 1, int(sc.index)))
            clean.append(CriticScore(
                index=i,
                q_value=max(0.0, float(sc.q_value)),
                risk=min(1.0, max(0.0, float(sc.risk))),
                confidence=min(1.0, max(0.0, float(sc.confidence))),
                notes=sc.notes,
            ))
        s.critic_scores = clean
        return ImproveNode()


@dataclass
class ImproveNode(BaseNode):
    """Optional one-step policy improvement: if best Q is weak or risk is high, ask actor to refine once."""
    improver = Agent(
        model="openrouter:openai/gpt-5",
        system_prompt=(
            "IMPROVER: Given context, current candidates, and critic feedback, produce up to 2 REFINED "
            "legal alternatives that address the critic's concerns (e.g., too risky, low value). "
            "If current best is already strong, return an empty list to keep it."
        ),
        output_type=List[PlanCandidate],
        retries=2
    )

    # thresholds for triggering refinement
    min_good_q: float = 0.75
    max_ok_risk: float = 0.65

    async def run(self, context: GraphRunContext[GraphState]) -> "SelectorNode":
        s = context.state
        assert s.plan is not None and s.critic_scores is not None

        # Determine if refinement is needed
        best = max(s.critic_scores, key=lambda x: x.q_value)
        need_refine = (best.q_value < self.min_good_q) or (best.risk > self.max_ok_risk)

        if not need_refine or s.refined:
            return SelectorNode()

        prompt = (
            agent_context_to_string(s.context)
            + "\n\nCANDIDATES:\n" + s.plan.model_dump_json(indent=2)
            + "\n\nCRITIC:\n" + "\n".join(f"[{c.index}] Q={c.q_value:.2f} risk={c.risk:.2f} conf={c.confidence:.2f} {c.notes or ''}"
                                          for c in s.critic_scores)
        )
        new_opts = (await self.improver.run(prompt)).output

        if new_opts:
            # merge refined options (append, keep old too)
            s.plan = Plan(candidates=(s.plan.candidates + new_opts)[:6])  # cap to avoid prompt bloat
            s.refined = True
            return CriticNode()  # re-score with critic after refinement
        else:
            return SelectorNode()


@dataclass
class SelectorNode(BaseNode):
    """Pick argmax of an adjusted score: Q * (1 - Œª*risk) with confidence weighting."""
    lambda_risk: float = 0.35

    async def run(self, context: GraphRunContext[GraphState]) -> End:
        s = context.state
        assert s.plan is not None and s.critic_scores is not None

        adjusted = []
        for sc in s.critic_scores:
            adj = sc.q_value * (1.0 - self.lambda_risk * sc.risk) * (0.5 + 0.5 * sc.confidence)
            adjusted.append((sc.index, adj))

        best_idx, _ = max(adjusted, key=lambda t: t[1])
        s.final_decision = s.plan.candidates[best_idx]

        return End(s.final_decision)

actor_critic_graph = Graph(
    nodes=[ActorNode, CriticNode, ImproveNode, SelectorNode],
    state_type=GraphState,
)

example_agent_context = AgentContext(
    turn=1,
    weather={},
    you_active="Pikachu",
    you_team=[TeamMon(species="Pikachu", hp=1.0, fainted=False, types=["Electric"], boosts={"atk": 1}, status=None, must_recharge=False)],
    opp_active="Bulbasaur",
    opp_known=[TeamMon(species="Bulbasaur", hp=1.0, fainted=False, types=["Grass", "Poison"], boosts={"def": -1}, status="paralyzed", must_recharge=False)],
    legal_moves=[MoveOption(move_id="Thunderbolt", base_power=90, accuracy=100, priority=0)],
    legal_switches=[SwitchOption(species="Charizard", hp=1.0, fainted=False, types=["Fire", "Flying"])],
    past_actions=[],
)

result = await actor_critic_graph.run(ActorNode(), state=GraphState(context=example_agent_context))
rprint("Actor‚ÄìCritic Decision:", result.state.final_decision)

23:33:07.115 run graph actor_critic_graph
23:33:07.117   run node ActorNode
23:33:22.450   run node CriticNode
23:33:30.335   run node ImproveNode
23:33:30.337   run node SelectorNode


In [None]:
actor_critic_graph = Graph(
    nodes=[ActorNode, CriticNode, ImproveNode, SelectorNode],
    state_type=GraphState,
)

actor_critic_agent = PydanticGraphAgent(name="LLM Agent", agentic_graph=actor_critic_graph, first_node=ActorNode())

await actor_critic_agent.battle_against(RandomPlayer(), n_battles=1)

23:33:35.277 run graph None
23:33:35.280   run node ActorNode
23:33:49.671   run node CriticNode
23:34:00.227   run node ImproveNode
23:34:00.229   run node SelectorNode


23:34:00.237 run graph None
23:34:00.238   run node ActorNode
23:34:30.396   run node CriticNode
23:34:40.744   run node ImproveNode
23:34:40.744   run node SelectorNode


23:34:40.752 run graph None
23:34:40.752   run node ActorNode
23:34:57.234   run node CriticNode
23:35:03.734   run node ImproveNode
23:35:03.734   run node SelectorNode


23:35:03.741 run graph None
23:35:03.742   run node ActorNode
23:35:16.440   run node CriticNode
23:35:28.183   run node ImproveNode
23:35:28.183   run node SelectorNode


23:35:28.199 run graph None
23:35:28.199   run node ActorNode
23:35:46.445   run node CriticNode
23:35:54.060   run node ImproveNode
23:35:54.061   run node SelectorNode


23:35:54.073 run graph None
23:35:54.074   run node ActorNode
23:36:07.962   run node CriticNode
23:36:15.468   run node ImproveNode
23:36:15.468   run node SelectorNode


23:36:15.474 run graph None
23:36:15.474   run node ActorNode
23:36:36.471   run node CriticNode
23:36:45.996   run node ImproveNode
23:36:45.996   run node SelectorNode


23:36:46.004 run graph None
23:36:46.004   run node ActorNode
23:37:05.105   run node CriticNode
23:37:14.771   run node ImproveNode
23:37:14.771   run node SelectorNode


23:37:14.779 run graph None
23:37:14.780   run node ActorNode
23:37:29.325   run node CriticNode
23:37:35.950   run node ImproveNode
23:37:35.950   run node SelectorNode


23:37:35.955 run graph None
23:37:35.956   run node ActorNode
23:37:54.459   run node CriticNode
23:38:02.806   run node ImproveNode
23:38:02.806   run node SelectorNode


23:38:02.816 run graph None
23:38:02.816   run node ActorNode
23:38:19.945   run node CriticNode
23:38:31.231   run node ImproveNode
23:38:31.231   run node SelectorNode


23:38:31.235 run graph None
23:38:31.235   run node ActorNode
23:38:49.396   run node CriticNode
23:38:57.798   run node ImproveNode
23:38:57.799   run node SelectorNode


23:38:57.804 run graph None
23:38:57.805   run node ActorNode
23:39:15.425   run node CriticNode
23:39:25.445   run node ImproveNode
23:39:25.446   run node SelectorNode


23:39:25.454 run graph None
23:39:25.454   run node ActorNode
23:39:41.633   run node CriticNode
23:39:51.240   run node ImproveNode
23:39:51.240   run node SelectorNode


23:39:51.250 run graph None
23:39:51.250   run node ActorNode
23:40:13.728   run node CriticNode
23:40:24.176   run node ImproveNode
23:40:24.177   run node SelectorNode


23:40:24.184 run graph None
23:40:24.185   run node ActorNode
23:40:43.283   run node CriticNode
23:40:51.208   run node ImproveNode
23:40:51.208   run node SelectorNode


23:40:51.217 run graph None
23:40:51.218   run node ActorNode
23:41:17.314   run node CriticNode
23:41:26.949   run node ImproveNode
23:41:26.949   run node SelectorNode


23:41:26.951 run graph None
23:41:26.951   run node ActorNode
23:41:47.572   run node CriticNode
23:41:56.741   run node ImproveNode
23:41:56.741   run node SelectorNode


23:41:56.750 run graph None
23:41:56.750   run node ActorNode


23:42:10.500   run node CriticNode
23:42:18.964   run node ImproveNode
23:42:18.967   run node SelectorNode


23:42:19.033 run graph None
23:42:19.034   run node ActorNode
23:42:40.585   run node CriticNode
23:42:50.718   run node ImproveNode
23:42:50.719   run node SelectorNode


23:42:50.731 run graph None
23:42:50.732   run node ActorNode
23:43:10.551   run node CriticNode
23:43:18.699   run node ImproveNode
23:43:18.700   run node SelectorNode


23:43:18.717 run graph None
23:43:18.719   run node ActorNode
23:43:38.243   run node CriticNode
23:43:49.050   run node ImproveNode
23:43:49.051   run node SelectorNode


23:43:49.066 run graph None
23:43:49.069   run node ActorNode
23:44:12.956   run node CriticNode
23:44:23.324   run node ImproveNode
23:44:23.325   run node SelectorNode


23:44:23.338 run graph None
23:44:23.339   run node ActorNode
23:44:38.240   run node CriticNode
23:44:47.859   run node ImproveNode
23:44:47.861   run node SelectorNode


23:44:47.886 run graph None
23:44:47.888   run node ActorNode
23:45:08.940   run node CriticNode
23:45:18.377   run node ImproveNode
23:45:18.378   run node SelectorNode


23:45:18.384 run graph None
23:45:18.386   run node ActorNode
23:45:37.245   run node CriticNode
23:45:44.577   run node ImproveNode
23:45:44.577   run node SelectorNode


23:45:44.585 run graph None
23:45:44.586   run node ActorNode
23:46:01.809   run node CriticNode
23:46:09.654   run node ImproveNode
23:46:09.655   run node SelectorNode


23:46:09.671 run graph None
23:46:09.673   run node ActorNode
23:46:26.144   run node CriticNode
23:46:35.060   run node ImproveNode
23:46:35.061   run node SelectorNode


23:46:35.074 run graph None
23:46:35.076   run node ActorNode
23:46:53.692   run node CriticNode
23:47:01.924   run node ImproveNode
23:47:01.925   run node SelectorNode


23:47:01.935 run graph None
23:47:01.935   run node ActorNode
23:47:23.878   run node CriticNode
23:47:32.369   run node ImproveNode
23:47:32.370   run node SelectorNode


23:47:32.384 run graph None
23:47:32.385   run node ActorNode
23:47:51.924   run node CriticNode
23:48:01.713   run node ImproveNode
23:48:01.713   run node SelectorNode


23:48:01.729 run graph None
23:48:01.730   run node ActorNode
23:48:24.346   run node CriticNode
23:48:35.492   run node ImproveNode
23:48:35.492   run node SelectorNode


23:48:35.532 run graph None
23:48:35.534   run node ActorNode
23:49:01.947   run node CriticNode
23:49:10.117   run node ImproveNode
23:49:10.117   run node SelectorNode


23:49:10.154 run graph None
23:49:10.155   run node ActorNode
23:49:35.530   run node CriticNode
23:49:44.780   run node ImproveNode
23:49:44.780   run node SelectorNode


23:49:44.800 run graph None
23:49:44.802   run node ActorNode
23:50:09.285   run node CriticNode
23:50:17.974   run node ImproveNode
23:50:17.974   run node SelectorNode


23:50:17.982 run graph None
23:50:17.982   run node ActorNode
23:50:35.423   run node CriticNode
23:50:42.465   run node ImproveNode
23:50:42.466   run node SelectorNode


23:50:42.473 run graph None
23:50:42.474   run node ActorNode
23:51:06.331   run node CriticNode
23:51:17.538   run node ImproveNode
23:51:17.538   run node SelectorNode


23:51:17.565 run graph None
23:51:17.565   run node ActorNode
23:51:40.014   run node CriticNode
23:51:51.301   run node ImproveNode
23:51:51.302   run node SelectorNode


23:51:51.311 run graph None
23:51:51.311   run node ActorNode
23:52:08.270   run node CriticNode
23:52:19.718   run node ImproveNode
23:52:19.718   run node SelectorNode


23:52:19.761 run graph None
23:52:19.761   run node ActorNode
23:52:38.907   run node CriticNode
23:52:48.803   run node ImproveNode
23:52:48.804   run node SelectorNode


23:52:48.810 run graph None
23:52:48.810   run node ActorNode
23:53:06.655   run node CriticNode
23:53:14.731   run node ImproveNode
23:53:14.731   run node SelectorNode


23:53:14.741 run graph None
23:53:14.741   run node ActorNode
23:53:40.216   run node CriticNode
23:53:52.998   run node ImproveNode
23:53:53.000   run node SelectorNode


23:53:53.018 run graph None
23:53:53.020   run node ActorNode
23:54:13.774   run node CriticNode
23:54:22.119   run node ImproveNode
23:54:22.120   run node SelectorNode


23:54:22.126 run graph None
23:54:22.126   run node ActorNode
23:54:41.783   run node CriticNode
23:54:50.920   run node ImproveNode
23:54:50.920   run node SelectorNode


23:54:50.926 run graph None
23:54:50.927   run node ActorNode
23:55:02.354   run node CriticNode
23:55:11.527   run node ImproveNode
23:55:11.527   run node SelectorNode


23:55:11.548 run graph None
23:55:11.551   run node ActorNode
23:55:28.942   run node CriticNode
23:55:37.917   run node ImproveNode
23:55:37.917   run node SelectorNode


23:55:37.941 run graph None
23:55:37.944   run node ActorNode
23:56:00.686   run node CriticNode
23:56:11.033   run node ImproveNode
23:56:11.033   run node SelectorNode


23:56:11.055 run graph None
23:56:11.057   run node ActorNode
23:56:25.091   run node CriticNode
23:56:33.863   run node ImproveNode
23:56:33.863   run node SelectorNode


23:56:33.876 run graph None
23:56:33.877   run node ActorNode
23:56:52.833   run node CriticNode
23:57:03.311   run node ImproveNode
23:57:03.312   run node SelectorNode


23:57:03.318 run graph None
23:57:03.318   run node ActorNode
23:57:23.176   run node CriticNode
23:57:33.817   run node ImproveNode
23:57:33.818   run node SelectorNode


23:57:33.833 run graph None
23:57:33.834   run node ActorNode
23:57:51.153   run node CriticNode


2025-11-10 23:57:55,507 - PydanticGraphAge 1 - ERROR - Unhandled exception raised while handling message:
>battle-gen9randombattle-279
|request|{"active":[{"moves":[{"move":"Drain Punch","id":"drainpunch","pp":15,"maxpp":16,"target":"normal","disabled":false},{"move":"Supercell Slam","id":"supercellslam","pp":24,"maxpp":24,"target":"normal","disabled":false},{"move":"Coil","id":"coil","pp":32,"maxpp":32,"target":"self","disabled":false},{"move":"Fire Punch","id":"firepunch","pp":23,"maxpp":24,"target":"normal","disabled":false}],"canTerastallize":"Fighting"}],"side":{"name":"PydanticGraphAge 1","id":"p1","pokemon":[{"ident":"p1: Eelektross","details":"Eelektross, L87, F","condition":"127/290","active":true,"stats":{"atk":250,"def":189,"spa":232,"spd":189,"spe":137},"moves":["drainpunch","supercellslam","coil","firepunch"],"baseAbility":"levitate","item":"leftovers","pokeball":"pokeball","ability":"levitate","commanding":false,"reviving":false,"teraType":"Fighting","terastallized":""},{

23:57:55.504   run node ImproveNode


Let's see this battle in action:

<iframe src="https://shreshthtuli.github.io/build-your-own-super-agents/assets/actor_critic.html" width="100%" height="480" style="border:none; overflow:hidden;"></iframe>

*You can also view the full battle [here](https://shreshthtuli.github.io/build-your-own-super-agents/assets/actor_critic.html).*

### üèÅ Conclusion

In this tutorial, we explored how **multi-agent workflows** enable structured, interpretable reasoning in complex environments ‚Äî using Pok√©mon battles as our sandbox.

We built and compared three distinct coordination paradigms:

* üß† **Manager‚ÄìCoordinator:** hierarchical and resource-aware ‚Äî perfect for structured, goal-driven orchestration.
* üó≥Ô∏è **Democratic Swarms:** ensemble-style decision-making ‚Äî robust and diverse, harnessing the collective judgment of many agents.
* üé≠ **Actor‚ÄìCritic:** feedback-driven refinement ‚Äî adaptive and value-based, blending planning with learned evaluation.

Each architecture has strengths depending on your **problem structure**, **latency budget**, and **decision uncertainty**. Together, they form the foundation of *Agentic AI systems* ‚Äî not just single agents, but entire **cognitive ecosystems** working in concert.

### üåê Beyond these: other agentic swarm paradigms

Modern agent research spans several other fascinating coordination styles that extend or hybridize these ideas:

* üï∏Ô∏è **Blackboard Systems / Shared Memory Agents** ‚Äî agents read and write to a common ‚Äúknowledge blackboard‚Äù (e.g., **HUB**, **LangGraph Shared Memory**, **ReAct with tool state**).
* ü™∂ **Evolutionary or Genetic Agent Swarms** ‚Äî agents mutate and compete (e.g., **EvoAgents**, [Yuan et al. (2024)](https://evo-agent.github.io/)) to explore strategy space efficiently.
* üîÅ **Agentic Swarms** ‚Äî multiple agents monitor and fine-tune their own reasoning, using reflection or secondary evaluators (e.g., **SwarmAgentic**, [Zhang et al. (2025)](https://arxiv.org/pdf/2506.15672)).

Together, these paradigms form a growing ecosystem of **Agentic Swarms** ‚Äî distributed reasoning systems where intelligence emerges from structured interaction rather than single-model dominance.

### üîÆ What‚Äôs next

So far, we‚Äôve focused on *how agents reason and collaborate*.
In the **next tutorial**, we‚Äôll explore *how to select the optimal model for each agent automatically* ‚Äî
using **meta-evaluation**, **model profiling**, and **cost‚Äìquality trade-offs** to dynamically assign the best LLM to each sub-task.

We‚Äôll essentially teach our system to **self-optimize its model choices** ‚Äî the first step toward *autonomous orchestration* in large multi-agent setups.
