A deterministic benchmark for evaluating LLM code generation accuracy. 22 models tested across 20 progressive levels of Go coding tasks — scored by compilation, test execution, and pattern matching. No LLM judgment in scoring.
Six models qualified for all 20 levels:
| Model | Provider | Score | Pct |
|---|---|---|---|
| Claude Opus 4.6 | Anthropic | 109/118 | 92% |
| Claude Sonnet 4.5 | Anthropic | 106/118 | 89% |
| Claude Haiku 4.5 | Anthropic | 97/118 | 82% |
| GPT-4o | OpenAI | 78/118 | 66% |
| Qwen 2.5 72B | Ollama | 78/118 | 66% |
| Qwen 2.5 Coder 32B | Ollama | 76/118 | 64% |
Best local model: Qwen 2.5 Coder 32B at 64% grand total. Full results in REPORT.md.
20 specs across three tiers:
| Tier | Levels | Points | What's Tested |
|---|---|---|---|
| Core | L1-L7 | 40 | Bug finding, implementation, refactoring, test writing, ambiguous specs, constraint satisfaction |
| Advanced | L8-L13 | 36 | Deep bugs, contradictions, state machines, security audits, large refactors, adversarial tests |
| Expert | L14-L20 | 42 | Concurrency bugs (write skew, deadlock, goroutine leak, data race, split-brain, invariant violation) + unsolvable control |
All models get the same spec through the same streaming harness — eliminating delivery-bias that skews results when different models get different prompting contexts.
# Build
go build -buildvcs=false -o specgen .
# Run a single level (Ollama)
./specgen --model qwen2.5-coder:32b --spec specs/level1_bug_finding.md
# Run with an API provider
export ANTHROPIC_API_KEY=...
./specgen --provider anthropic --model claude-sonnet-4-5-20250929 --spec specs/level1_bug_finding.md
# Score all results
bash score.shOutput lands in output/<model>/<level>/ with response.md, metadata.json, and extracted .go files.
- specgen reads a spec file and sends it to a model via Ollama, Anthropic, or OpenAI APIs
- The response is streamed, code blocks are extracted, and everything is saved to
output/ - score.sh evaluates each response: compiles code, runs tests (including canonical harness tests for L14-19), and checks for bug identification via regex
- Results are deterministic and reproducible
| Script | What it runs |
|---|---|
run_all_models.sh |
14 Ollama models through L1-L7 |
run_api_models.sh |
Anthropic + OpenAI models through L1-L7 |
run_expert_v2.sh |
6 qualifying models through L14-L20 |
main.go CLI entry point
provider.go Provider interface
provider_{ollama,anthropic,openai}.go
extract.go Code block extraction from markdown
score.sh Automated scorer
specs/ 20 spec files (L1-L20)
specs/tests/ Canonical test files for L14-L20
output/ Results per model per level
REPORT.md Full benchmark report
RUNBOOK.md How to run and score benchmarks
Zero external Go dependencies. Scoring uses only bash, go build, and go test.
- Go 1.21+
- For local models: Ollama with models pulled
- For API models:
ANTHROPIC_API_KEYand/orOPENAI_API_KEYenvironment variables
- REPORT.md — Full results, per-level analysis, timing data
- RUNBOOK.md — Detailed instructions for running benchmarks
- specs/README.md — Level descriptions and scoring criteria
- design/project-summary.md — Architecture and design decisions