A qualitative benchmark suite for evaluating AI coding agents and orchestration paradigms on realistic, complex development tasks.
Current agent benchmarks test narrow, artificial tasks — "Can it solve HumanEval problems?" or "Can it fix isolated bugs?" But real development is messier and more interesting. When you're choosing an agent for actual work, you want to know:
- How does it handle architectural decisions?
- Can it design good UX without explicit requirements?
- Does it understand creative constraints?
- How well does it integrate complex systems (LLMs + games, Git + web interfaces)?
- Does orchestration actually help, or is it overhead?
These questions don't have numerical answers. They require looking at real implementations and making informed judgments based on your priorities.
This suite provides:
- Realistic use cases that test architecture, creativity, and judgment (not just correctness)
- Diverse challenges spanning RAG systems, games, algorithms, distributed systems, and creative tooling
- Both one-shot and orchestration approaches so you can compare paradigms
- Structured evaluation that captures qualitative insights consistently
The goal: help developers (including myself) make informed decisions about which agents and approaches work best for different kinds of real-world tasks.
"Isn't this just vibes-based evaluation?"
Yes, partially. But vibes matter. If the code is a mess, tests are brittle, and you needed best-of-50 to get there, it's shit even if it gets you to 80% SWE-Bench.
This repository contains a curated series of development challenges designed to compare how different AI coding agents and orchestration approaches handle real-world software engineering tasks. Unlike automated benchmarks with scoring metrics, this series focuses on qualitative evaluation — allowing developers to review code quality, architectural decisions, UX implementation, and overall agent behavior to form their own informed opinions.
Different agents think differently, structure solutions differently, and excel at different aspects of development. This series embraces that diversity and provides rich, realistic tasks where those differences become visible and meaningful.
Each use case follows a consistent structure:
XX-use-case-name/
├── README.md # Use case description and evaluation criteria
├── _base/ # Starting environment (prompt + boilerplate)
│ ├── prompt.md # Task specification for the agent
│ ├── pyproject.toml # Initial Python setup (if applicable)
│ └── src/ # Minimal starter code
├── coding_agents/ # Results from individual coding agents
│ ├── agent-name-1/ # Implementation by agent 1
│ └── agent-name-2/ # Implementation by agent 2
└── orchestration/ # Results from orchestration paradigms
├── paradigm-1/ # e.g., bmad, spec-kit, openspec
│ ├── agent-a/ # Orchestration run with agent A
│ └── agent-b/ # Orchestration run with agent B
└── paradigm-2/
├── agent-a/
└── agent-b/
Note: Orchestration folders have an additional nesting level because the same orchestration paradigm (e.g., bmad) can be run with different underlying agents (e.g., Claude, Gemini, GPT-4). This allows comparing how the orchestration approach performs across different agent backends.
This is your starting point for each use case. It contains:
prompt.md- The complete task specification to give to the agent- Boilerplate code - Minimal project structure to get started
- Configuration files - Pre-configured tooling (uv, Python version, etc.)
To test an agent or orchestration approach, simply copy the _base folder contents into a new subfolder under coding_agents/ or orchestration/, then provide the prompt.md to your chosen agent.
Complex RAG application with autonomous data collection
Tests: Long-running background processes, GraphRAG implementation, multi-view UI (list/detail/dashboard), web scraping, LLM integration, real-time feedback, database design.
A research catalog database that autonomously discovers and analyzes AI research papers from arXiv and other sources. Features include filterable paper lists, detailed paper analysis views with relationship graphs, similarity search, and theory validation modes.
Educational web app with interactive visualizations
Tests: Educational UX design, algorithm visualization, client-side computation, modern SPA architecture, accessibility, pedagogical clarity.
An interactive playground teaching laypeople about early text generation methods (n-grams, genetic algorithms, and a third chosen concept) through hands-on experimentation.
ML training pipeline with research phase
Tests: Research capabilities, ML/AI knowledge, training infrastructure, real-time monitoring UI, hardware constraints awareness, test-driven development.
A complete image generation model training pipeline that includes web research to choose an appropriate architecture, then builds a polished UI for dataset management, live training monitoring, and inference.
Production-ready web game with original game design
Tests: Creative problem solving, game design knowledge, web-based research, procedural generation, cross-platform development, comprehensive testing, UI/UX polish.
Design and implement a completely novel puzzle game from scratch. The game must be original, addictive, easy to learn but hard to master, with procedurally generated levels and production-ready quality.
Realtime collaborative text editor with custom CRDT
Tests: Distributed systems knowledge, CRDT algorithms, concurrent operations, WebSocket realtime architecture, persistence & compaction, Node.js backend, test-driven development.
Implement a collaborative text editor from scratch with a custom CRDT implementation (no libraries), WebSocket realtime sync, persistence, and strict acceptance criteria testing.
Event-driven job queue with API, worker, and real-time dashboard
Tests: Distributed architecture, event-driven design, retry logic, WebSocket real-time updates, FastAPI backend, database-backed coordination, production patterns.
Build a job orchestration system with separate API and worker services, retry with exponential backoff, SQLite coordination, and a live monitoring dashboard—all without external queue services.
Automated Python 2 to 3 migration with AST transformation
Tests: Code analysis & comprehension, AST manipulation, test-driven refactoring, tool & CLI design, language feature knowledge, modular architecture.
Build a tool that automatically migrates Python 2.7 code to Python 3.12 using AST transformations (no 2to3), preserving functionality while modernizing syntax and verifying tests still pass.
Content management using Git as storage backend
Tests: Creative architecture patterns, Git API integration, diff visualization, version control UX design, collaborative editing workflows, full-text search, markdown processing.
Build a wiki where every page is a markdown file in Git, with full version history, visual diffs, branch-based drafts, conflict resolution UI, and search—all through a web interface.
ASCII roguelike with LLM-powered NPCs
Tests: Architecture decision-making, research capability, LLM integration, async handling, game design, procedural generation, creative problem-solving, interface design.
Design and build an ASCII roguelike where NPCs are controlled by LLMs. The first challenge: research and decide on programming language, game framework, LLM integration approach, and interface design. Then implement a POC showcasing emergent gameplay through dynamic NPC interactions.
Research, conceptualize, and build a groundbreaking AI library
Tests: Web research capability, ecosystem analysis, creative ideation, gap identification, critical evaluation, rapid prototyping, API design, developer empathy.
Research the AI/LLM Python library landscape via web search, identify gaps and opportunities, generate 5-7 novel library ideas focusing on creative/fun use cases, rate them rigorously, then implement a 100% functional MVP of your best idea that makes developers smile.
Sleek web app for managing video datasets for AI training
Tests: Full-stack web development, video processing with FFmpeg, modern UI/UX design, media file handling, async processing, performance optimization, masonry layouts.
Build a local web application featuring a masonry gallery with infinite scroll and auto-play, video upload with FPS conversion, and video splitting via scene detection or fixed frame intervals—all with a sleek, modern design.
Meta-app that transforms itself based on user descriptions
Tests: Meta-programming, code generation, runtime transformation, security sandboxing, dynamic persistence, LLM integration for code, research capability, creative system architecture.
Build an app that starts as a text box where users describe software ideas. The app uses LLM code generation to transform itself into the described application with UI, logic, and persistence—all executing safely in a sandbox. Mandatory research phase analyzing viability and security before implementation.
Comprehensive research reports via recursive workflows using specialized orchestration libraries
Tests: Learning unfamiliar libraries, recursive workflow design, multi-agent orchestration, LLM integration patterns, web search, content synthesis, complex state management.
Build a research system using one of 27 specialized orchestration libraries to generate multi-page reports. The system recursively decomposes topics into sub-tasks, gathers information from web searches, synthesizes findings, and produces polished reports matching target page counts.
Port terminal-based social deception game to modern web interface
Tests: Code comprehension, platform translation (terminal → web), full-stack development, real-time chatroom design, LLM integration, UI/UX design, game state management, WebSocket implementation.
Port Among LLMs from terminal to web. Build a browser-based chatroom where one human must survive among AI agents trying to identify and vote out the player. Implement real-time chat, DMs, voting, message manipulation (edit/delete/impersonate), and persona/scenario system with polished, intuitive UI.
Use the interactive setup tool to quickly configure a new comparison run:
python setup-run.pyThis will guide you through:
- Selecting a use case
- Choosing between coding agent or orchestration
- Specifying your agent harness and model
- Automatically copying
_baseto the correct location - Creating a git branch with a standardized name
- Choose a use case (e.g.,
01-research-scraper) - Copy the
_basefolder to a new location:cp -r 01-research-scraper/_base 01-research-scraper/coding_agents/my-agent-name
- Start your coding agent in the new folder
- Provide the
prompt.mdas the task specification - Let the agent work autonomously to complete the task
- Review the results qualitatively
Orchestration paradigms (like bmad, spec-kit, openspec) define structured workflows for how agents should work. These can be run with different underlying agents:
- Choose a use case (e.g.,
01-research-scraper) - Choose an orchestration paradigm (e.g.,
bmad) - Copy the
_basefolder to the orchestration paradigm's subfolder:cp -r 01-research-scraper/_base 01-research-scraper/orchestration/bmad/claude-sonnet
- Run the orchestration using your chosen agent backend
- Review the results qualitatively
This nested structure lets you compare:
- Same paradigm, different agents: How does bmad perform with Claude vs Gemini?
- Same agent, different paradigms: How does Claude perform in bmad vs spec-kit?
- Orchestration vs direct: Does bmad+Claude beat Claude alone?
After completing a run, maintain consistent documentation using the provided templates:
Copy the appropriate evaluation template into your run folder:
- Coding agents: Copy
EVALUATION_REPORT_CODING_AGENT.md→[run-folder]/EVALUATION.md - Orchestration: Copy
EVALUATION_REPORT_ORCHESTRATION.md→[run-folder]/EVALUATION.md
For use-case-specific evaluation templates (like 01-research-scraper), use the templates inside the use case folder:
01-research-scraper/EVALUATION_CODING_AGENT.md(has all 40+ requirements pre-filled)01-research-scraper/EVALUATION_ORCHESTRATION.md
These templates provide structured rubrics covering:
- Autonomy - How much human intervention was required?
- Code Quality - Is it clean, maintainable, and idiomatic?
- Architecture - Are design decisions sound and well-reasoned?
- Completeness - Did the agent fulfill all requirements?
- UX/Polish - Is the user experience smooth and well-thought-out?
- Understanding - Did the agent grasp the task's nuances?
Add a quick entry to the use case index for at-a-glance comparison:
- Coding agents: Update
[use-case]/INDEX_CODING_AGENTS.md - Orchestration: Update
[use-case]/INDEX_ORCHESTRATION.md
Minimal effort required:
- Add screenshot
- Fill in 5 fields: Status (✅/
⚠️ /❌), Time, Score, Quick summary (2-3 sentences), Rating - Check 4 boxes for core features
- Update summary table
Example:
## Cursor - Composer-1 - Nov 23, 2025

**Status:** ⚠️ Partial
**Time:** <30 minutes
**Score:** 18/30 ([detailed report](coding_agents/cursor_composer-1/EVALUATION.md))
**Quick Summary:**Fast initial implementation but needs handholding for final 20%. Amazing speed but gets lost without precise guidance.
**Core Features:**
- [x] Feature 1 working
- [ ] Feature 2 missing
- [ ] Feature 3 partial
- [ ] Feature 4 not tested
**Rating:** ⭐⭐⭐ 3/5 - Recommended for speed-critical prototypes
This two-tier system provides:
- Index files for quick scanning and comparison across runs
- Detailed evaluations for deep analysis when needed
Root Templates (use-case agnostic):
INDEX_CODING_AGENTS_TEMPLATE.md- Quick index template for coding agent runsINDEX_ORCHESTRATION_TEMPLATE.md- Quick index template for orchestration runsEVALUATION_REPORT_CODING_AGENT.md- Detailed evaluation for coding agentsEVALUATION_REPORT_ORCHESTRATION.md- Detailed evaluation for orchestration
Use-Case Templates (pre-filled with specific requirements):
[use-case]/INDEX_CODING_AGENTS.md- Active index (copy from root template, customize)[use-case]/INDEX_ORCHESTRATION.md- Active index (copy from root template, customize)[use-case]/EVALUATION_CODING_AGENT.md- Pre-filled with use-case requirements[use-case]/EVALUATION_ORCHESTRATION.md- Pre-filled with use-case requirements
Each run folder should contain:
- EVALUATION.md - Detailed evaluation report (copied from template, filled out)
- screenshot.png (or similar) - Main application screenshot for the index
- Optional: .report/ folder - Session logs, transcripts, screenshots (if comfortable sharing)
Example run structure:
coding_agents/cursor_composer-1/
├── EVALUATION.md # Detailed evaluation (required)
├── screenshot.png # Main screenshot for index (required)
├── .report/ # Optional evaluation assets
│ ├── chat-log.md
│ ├── transcript.json
│ └── screenshots/
├── src/ # The actual implementation
└── ...
If possible and if you're comfortable doing so, include a .report/ folder containing:
- Chat logs or conversation history with the agent
- Session transcripts showing the agent's reasoning process
- Any other interaction data that might be useful for analysis
This is purely optional due to potential privacy concerns and sensitive information that could be contained in such data (API keys, personal information, proprietary code snippets, etc.). Only include this if you've reviewed the data and are comfortable sharing it.
If .report/ session logs are not provided, include screenshots documenting critical points:
- Final application - Screenshots showing the completed app running (UI views, CLI output, key features)
- Notable struggles - Screenshots of errors, failed attempts, or areas where the agent struggled
- Interesting decisions - Moments where the agent made notable architectural or design choices
- Manual interventions - Context where you had to guide or correct the agent
Place screenshots in the .report/ folder within your run directory.
Example structure:
coding_agents/cursor_claude-sonnet/
├── .report/ # Optional session data
│ ├── chat-log.md
│ ├── transcript.json
│ └── screenshots/
│ ├── 01-finished-app.png
│ ├── 02-error-handling-struggle.png
│ └── 03-final-ui.png
├── src/
└── ...
- Your Workflow Match - Does this agent's approach align with how you'd want to work?
Each developer will weight these differently based on their needs, which is exactly the point.
After completing a run:
-
Fill out detailed evaluation (in run folder):
- Copy template:
EVALUATION_REPORT_CODING_AGENT.mdor use case-specific version - Save as:
[run-folder]/EVALUATION.md
- Copy template:
-
Add quick index entry (in use case root):
- Edit:
[use-case]/INDEX_CODING_AGENTS.md - Add: Screenshot, status, time, score, 2-3 sentence summary, 4 feature checkboxes, rating
- Edit:
-
Include required files (in run folder):
EVALUATION.md(detailed evaluation)screenshot.png(main screenshot)- Optional:
.report/(session logs, screenshots, transcripts)
Templates available:
- Root templates (generic):
INDEX_*_TEMPLATE.md,EVALUATION_REPORT_*.md - Use case templates (pre-filled):
[use-case]/EVALUATION_*.md,[use-case]/INDEX_*.md
Software engineering is inherently personal and context-dependent. What makes a "good" solution depends on:
- Your team's coding standards and practices
- Your domain and constraints
- Your priorities (speed vs. polish, innovation vs. reliability)
- Your workflow preferences
Automated metrics can't capture these nuances. This series gives you the raw material to make your own informed judgment about which agents best serve your specific needs.
This is a personal comparison suite, but if you'd like to suggest additional use cases or improvements:
- Each use case should be realistic and non-trivial
- Tasks should test different aspects of agent capabilities
- Prompts should be clear but not overly prescriptive
- Use cases should be completable but challenging
To add a new use case to the suite:
-
Create the directory structure:
mkdir -p XX-use-case-name/_base mkdir -p XX-use-case-name/coding_agents mkdir -p XX-use-case-name/orchestration
-
Use the prompt template: Copy
PROMPT_TEMPLATE.mdto your use case's_basefolder and fill it out:cp PROMPT_TEMPLATE.md XX-use-case-name/_base/prompt.md
-
Follow the template structure:
- Goal: 3-5 line pitch of what's being built
- Context & Constraints: Stack, hardware, scope, time estimate
- Requirements: Core (must have) and Stretch (nice to have)
- Quality Expectations: Architecture, testing, UX, documentation standards
- Process: Research or design steps (if applicable)
- Deliverables: Checklist of what to submit
- Success Criteria: 3-5 concrete indicators of success
-
Create the README: Add a
README.mdin your use case directory explaining:- What this use case tests
- Why it's interesting/challenging
- Evaluation criteria (qualitative aspects to review)
-
Add minimal scaffolding: Include only essential boilerplate in
_base:pyproject.tomlorpackage.jsonif stack is predetermined- Minimal config files (if needed)
- Do NOT include implementation code - agents start from scratch
-
Update the root README: Add your use case to the list in this file with a brief description.
Prompt writing tips:
- Be specific about requirements but flexible about implementation
- Include concrete acceptance criteria when relevant
- Provide context without prescribing solutions
- Balance clarity with creative freedom
- See existing use cases for examples
See LICENSE for details.

