Standardized test prompts for evaluating AI coding assistants. Reproducible, transparent, open.
By RunAICode.ai — making AI tool reviews verifiable.
Most AI coding tool reviews are subjective. We're changing that. Every benchmark in this repo is a concrete, reproducible task that any AI tool can attempt, with clear pass/fail criteria.
- Code Generation — Write functions from descriptions
- Refactoring — Improve existing code structure
- Debugging — Find and fix bugs in provided code
- Multi-file — Changes that span multiple files
- Testing — Generate meaningful test suites
- DevOps — Infrastructure and automation tasks
Each benchmark includes:
- Task description — What the AI needs to do
- Input files — Starting code state
- Expected output — What correct completion looks like
- Evaluation criteria — How we score (correctness, code quality, speed)
See our benchmark results page for the latest head-to-head comparisons.
Submit new benchmarks via PR. Include all four components listed above.
MIT
Part of the RunAICode collection