# Tutorial: Planner Playground for AgentFlow

Audience:
- PhD students and AI Researchers working on LLM planning.

Prerequisites:
- Python, basic RLHF/PPO concepts, and familiarity with agent tool-use loops.

Learning goals:
- Build a planner validation harness.
- Stress-test planner outputs before plugging into full orchestration.
- Understand where planner errors propagate to worker/verifier failures.


## Outline

1. Planner contract and schema checks
2. Tool-selection and context-completeness checks
3. Batch stress test on candidate planner outputs
4. Optional hook to the real AgentFlow source


In [None]:
from __future__ import annotations

from dataclasses import dataclass
from typing import List, Dict, Any
import re

AVAILABLE_TOOLS = [
    "Google_Search_Tool",
    "Wikipedia_Search_Tool",
    "Python_Coder_Tool",
    "Base_Generator_Tool",
]

AVAILABLE_TOOLS


## Step 1 - Define a planner output contract

Planner output must minimally provide:
- `context`
- `sub_goal`
- `tool_name`

We add a lightweight validator to catch common failures early.


In [None]:
@dataclass
class PlannerStep:
    context: str
    sub_goal: str
    tool_name: str

def validate_step(step: PlannerStep, available_tools: List[str]) -> Dict[str, Any]:
    errors: List[str] = []

    if not step.context.strip():
        errors.append("empty context")
    if not step.sub_goal.strip():
        errors.append("empty sub_goal")
    if step.tool_name not in available_tools:
        errors.append(f"invalid tool: {step.tool_name}")

    # Heuristic: if sub_goal asks for calculation, prefer Python tool.
    if re.search(r"calculate|compute|equation|solve", step.sub_goal.lower()):
        if step.tool_name != "Python_Coder_Tool":
            errors.append("sub_goal suggests math/code but non-python tool selected")

    return {"ok": len(errors) == 0, "errors": errors}


## Step 2 - Stress test candidate planner outputs

In practice, you should run this on hundreds of planner outputs sampled from your model variants.


In [None]:
candidates = [
    PlannerStep(
        context="Need latest GDP growth for country X and trusted source links",
        sub_goal="Search authoritative sources for GDP growth and capture citation",
        tool_name="Google_Search_Tool",
    ),
    PlannerStep(
        context="Equation: x^2 - 5x + 6 = 0",
        sub_goal="Compute the roots of the equation",
        tool_name="Wikipedia_Search_Tool",
    ),
    PlannerStep(
        context="",
        sub_goal="Summarize findings",
        tool_name="Base_Generator_Tool",
    ),
]

report = []
for i, step in enumerate(candidates, start=1):
    result = validate_step(step, AVAILABLE_TOOLS)
    report.append({"id": i, "step": step, **result})

for row in report:
    print(f"Case {row['id']} -> ok={row['ok']}")
    if not row['ok']:
        print("  errors:", row['errors'])

report


## Step 3 - Optional: connect to real AgentFlow planner

This cell is optional and environment-dependent.
Set `AGENTFLOW_LOCAL_ROOT` if your local path differs.


In [None]:
import os
import sys
from pathlib import Path

AGENTFLOW_LOCAL_ROOT = Path(os.environ.get("AGENTFLOW_LOCAL_ROOT", "/Users/admin/TuanDung/repos/AgentFlow"))
if AGENTFLOW_LOCAL_ROOT.exists():
    sys.path.insert(0, str(AGENTFLOW_LOCAL_ROOT / "agentflow"))
    print("AgentFlow path added:", AGENTFLOW_LOCAL_ROOT)
else:
    print("AgentFlow local path not found. Skip real-planner integration.")


## Exercises

1. Add 20 planner outputs from your current checkpoint and compute invalid-action rate.
2. Add domain-specific validation rules (finance, coding, web-retrieval).
3. Compare two planner prompts and report failure taxonomy differences.

Common pitfall:
- Treating final accuracy as the only planner metric.

Extension:
- Convert validation harness into CI checks for every planner prompt revision.
