A focused reliability workbench that demonstrates why a plausible assistant response is not proof that the right tool call or a real side effect occurred.
Built for a blog post about AI evals, tool contracts, orchestration, and drift.
- Orchestrator routing — a central router delegates to specialized agents
- Tool contracts — each agent has defined tools with explicit schemas
- Approval gates — destructive actions (refund_order) require human approval
- False success detection — if the response claims "refunded" but no
refund_orderwas executed, a mismatch alert fires - Tool description drift — editable prompts/tool descriptions show how small changes break routing
- Deterministic evals — 20 seed cases scored on route, tools, approval, side effects, and mismatch detection
API keys are stored in browser localStorage and sent per-request to local Next.js API routes. Keys are never logged or persisted server-side.
Do not deploy this as-is. It is a local demo tool.
pnpm install
pnpm devOpen http://localhost:3000 and configure your API key in the Settings tab.
Supported providers: OpenAI, Anthropic.
src/
├── app/
│ ├── api/chat/route.ts # Chat endpoint (workflow engine)
│ ├── api/evals/route.ts # Eval runner endpoint
│ ├── layout.tsx
│ └── page.tsx # Main app shell with tabs
├── components/
│ ├── header.tsx # App header with badges
│ ├── playground-tab.tsx # Chat + live trace summary
│ ├── trace-tab.tsx # Full trace timeline viewer
│ ├── evals-tab.tsx # Eval runner + results table
│ ├── contracts-tab.tsx # Editable prompts & tool catalog
│ ├── backend-state-tab.tsx # Orders, customers, refund events
│ ├── settings-tab.tsx # Provider & API key config
│ └── ui/ # shadcn/ui components
└── lib/
├── types.ts # All TypeScript types
├── seed-data.ts # 10 demo orders, 3 customers
├── default-prompts.ts # Agent instructions & tool catalog
├── eval-cases.ts # 20 eval seed cases
├── tools.ts # Pure tool functions over DemoState
├── workflow-engine.ts # Orchestrator → Agent → Tools → Trace
├── eval-runner.ts # Deterministic eval scoring
├── store.ts # React context + localStorage persistence
├── hooks.ts # useChat, useApproval, useEvalRunner
└── __tests__/ # Vitest tests
20 cases in src/lib/eval-cases.ts across 5 categories:
| Category | Count | Examples |
|---|---|---|
| Positive Refund | 4 | Refund USB-C cable (4711), late delivery webcam (4714) |
| Negative Refund | 4 | Expired deadline (4712), already refunded (4719), product info only |
| Lookup | 4 | Order status, delivery tracking |
| Ambiguity | 4 | Missing order number, unclear identity, vague multi-intent |
| Policy/Boundary | 4 | Return policy FAQ, password reset, contact info |
Scoring is fully deterministic: route match, required/forbidden tools, approval correctness, side effect validation, mismatch detection.
Use the Contracts tab to edit:
- Orchestrator instructions (routing logic)
- Agent instructions (refund, lookup, account/FAQ)
- Tool descriptions (simulate drift by changing tool semantics)
Changes apply immediately to the next run.
pnpm testTests cover tool functions (refund validation, FAQ search, password reset) and eval scoring logic.
- LLM-as-judge scoring (deterministic only in v1)
- Streaming responses (uses generateText, not streamText)
- Real database / persistence beyond localStorage
- Production auth, rate limiting, error recovery
- Mobile-responsive layout
- Multi-turn conversation memory across page reloads
- Parallel eval execution (sequential to avoid rate limits)