A/B test whether pre-gathered organizational context helps AI coding agents complete tasks faster, cheaper, and better.
The simulator runs the same task twice against a real repository — once as a baseline (the agent works from the task description alone) and once context-enhanced (the agent first gathers relevant context from the codebase and connected sources, then plans and implements from that briefing). It then scores both results against your acceptance criteria and reports the quality, speed, and cost difference.
It works with multiple coding-agent CLIs — Claude Code, OpenAI Codex, Cursor, and Grok — so you can compare how each one benefits from context.
Powered by Unblocked.
Each run executes two arms in parallel, each in its own isolated git worktree so they never interfere:
Plan ──▶ Review ──▶ Implement ──▶ Evaluate
The agent plans, self-reviews the plan, implements it, and the result is scored.
Gather Context ──▶ Extract Patterns ──▶ Plan ──▶ Gather Plan Context ──▶ Review ──▶ Implement ──▶ Evaluate + Attribute
- Gather Context — a read-only researcher agent searches the codebase and any connected MCP sources (issues, PRs, chat, docs) to produce a focused context bundle.
- Extract Patterns — distills the bundle into the codebase's coding-style conventions.
- Plan — plans the task using the bundle as a head start.
- Gather Plan Context — a targeted second pass for risks and style notes specific to the plan.
- Review — cross-checks the plan against the gathered context.
- Implement — writes the code.
Both arms are scored 0–100 by an independent evaluator agent against your acceptance criteria, judging primarily on the actual code diff. The context arm additionally produces an attribution analysis showing which gathered context actually influenced the implementation.
Results are written as a terminal table, a JSON file, and a standalone HTML report.
-
Node.js ≥ 18
-
git (the target must be a git repository)
-
At least one supported agent CLI installed and on your
PATH:Agent CLI binary --agentvalueClaude Code claudeclaude(default)OpenAI Codex codexcodexCursor agentcursorGrok grokgrok
Each agent must be authenticated per its own CLI. The simulator shells out to whichever one you select.
The whole point of the experiment is the context the agent can reach, so the agent under test must have its MCP servers configured for the target repository before you run anything. Worktrees inherit that configuration.
Set this up once by launching the agent interactively inside the target repo and confirming its MCP servers connect:
cd /path/to/target-repo
claude # or: codex, cursor, grok — whichever agent you'll testVerify every expected MCP server shows as connected (e.g. /mcp in Claude Code, the startup MCP status in the others). Authenticate any that prompt. Once they all connect cleanly here, the simulator's runs will have the same access. Skipping this means the context arm silently gathers less than it should.
git clone https://github.com/unblocked/context-engine-simulator.git
cd context-engine-simulator
npm installNo build step — it runs directly via tsx.
Create a YAML fixture (see fixture.sample.yaml):
repo: /path/to/target-repo
branch: main
task: |
Add a /health endpoint to the API server that returns the service
name and current timestamp as JSON.
criteria: |
1. GET /health returns 200 with a JSON body
2. Response includes "service" and "timestamp" fields
3. Endpoint is registered following existing router patterns
agent: claude # claude | codex | cursor | grok
model: sonnetRun it:
npm start -- --fixture my-experiment.yaml --verboseEverything in a fixture can be passed as a flag (flags override fixture values):
npm start -- \
--repo /path/to/repo \
--task "Add a /health endpoint returning service name and timestamp" \
--criteria "GET /health returns 200 JSON with service and timestamp fields" \
--agent claude \
--model opus \
--verbose| Flag | Description | Default |
|---|---|---|
--fixture <path> |
YAML fixture file | — |
--repo <path> |
Target git repository | (required) |
--task <string> / --task-file <path> |
Task description | (required) |
--criteria <string> / --criteria-file <path> |
Acceptance criteria for scoring | — |
--agent <name> |
claude | codex | cursor | grok |
claude |
--model <model> |
Model for task runs | sonnet |
--context-model <model> |
Model for context gathering | same as --model |
--eval-model <model> |
Model for evaluation | same as --model |
--timeout <seconds> |
Max seconds per task step | 3600 |
--context-timeout <seconds> |
Max seconds per context step | 600 |
--branch <name> |
Branch to base worktrees on | current HEAD |
--disable-mcp <servers...> |
MCP servers to block in both arms | — |
--keep-worktrees |
Keep worktrees after the run for inspection | false |
--verbose |
Stream agent activity live | false |
If --criteria is omitted, the evaluation and attribution steps are skipped.
A timestamped directory under results/ containing:
report.html— a self-contained visual report (open in a browser)result.json— full structured results for programmatic analysis
Plus a comparison table printed to the terminal: quality score, wall-clock time, cost, and token counts for each arm and each phase.
To measure the isolated effect of context gathering, the baseline arm must not have backdoor access to the same context sources. Use --disable-mcp (or disableMcp: in a fixture) to block specific MCP servers in both arms' worktrees; the simulator also detects and aborts a run if a blocked server is called, so contamination can't silently skew the comparison.
npm start -- --fixture my-experiment.yaml --disable-mcp some-context-server- Each arm runs in a throwaway git worktree (under your system temp dir, or agent-managed for Claude/Cursor). They're cleaned up automatically unless
--keep-worktreesis set. - Cost is computed from each agent's reported token usage where available. Subscription-billed agents (e.g. Grok) report token counts but no per-token price, so their cost shows as
$0. - The tool never commits or pushes; it only reads the target repo and works inside isolated worktrees.
MIT © Unblocked