Skip to content

Deterministic benchmark for evaluating LLM code generation accuracy. 22 models tested across 20 levels of Go coding tasks.

License

Notifications You must be signed in to change notification settings

statherm/specgen

Repository files navigation

specgen

A deterministic benchmark for evaluating LLM code generation accuracy. 22 models tested across 20 progressive levels of Go coding tasks — scored by compilation, test execution, and pattern matching. No LLM judgment in scoring.

Results

Six models qualified for all 20 levels:

Model Provider Score Pct
Claude Opus 4.6 Anthropic 109/118 92%
Claude Sonnet 4.5 Anthropic 106/118 89%
Claude Haiku 4.5 Anthropic 97/118 82%
GPT-4o OpenAI 78/118 66%
Qwen 2.5 72B Ollama 78/118 66%
Qwen 2.5 Coder 32B Ollama 76/118 64%

Best local model: Qwen 2.5 Coder 32B at 64% grand total. Full results in REPORT.md.

What This Tests

20 specs across three tiers:

Tier Levels Points What's Tested
Core L1-L7 40 Bug finding, implementation, refactoring, test writing, ambiguous specs, constraint satisfaction
Advanced L8-L13 36 Deep bugs, contradictions, state machines, security audits, large refactors, adversarial tests
Expert L14-L20 42 Concurrency bugs (write skew, deadlock, goroutine leak, data race, split-brain, invariant violation) + unsolvable control

All models get the same spec through the same streaming harness — eliminating delivery-bias that skews results when different models get different prompting contexts.

Quick Start

# Build
go build -buildvcs=false -o specgen .

# Run a single level (Ollama)
./specgen --model qwen2.5-coder:32b --spec specs/level1_bug_finding.md

# Run with an API provider
export ANTHROPIC_API_KEY=...
./specgen --provider anthropic --model claude-sonnet-4-5-20250929 --spec specs/level1_bug_finding.md

# Score all results
bash score.sh

Output lands in output/<model>/<level>/ with response.md, metadata.json, and extracted .go files.

How It Works

  1. specgen reads a spec file and sends it to a model via Ollama, Anthropic, or OpenAI APIs
  2. The response is streamed, code blocks are extracted, and everything is saved to output/
  3. score.sh evaluates each response: compiles code, runs tests (including canonical harness tests for L14-19), and checks for bug identification via regex
  4. Results are deterministic and reproducible

Batch Scripts

Script What it runs
run_all_models.sh 14 Ollama models through L1-L7
run_api_models.sh Anthropic + OpenAI models through L1-L7
run_expert_v2.sh 6 qualifying models through L14-L20

Project Structure

main.go                    CLI entry point
provider.go                Provider interface
provider_{ollama,anthropic,openai}.go
extract.go                 Code block extraction from markdown
score.sh                   Automated scorer
specs/                     20 spec files (L1-L20)
specs/tests/               Canonical test files for L14-L20
output/                    Results per model per level
REPORT.md                  Full benchmark report
RUNBOOK.md                 How to run and score benchmarks

Zero external Go dependencies. Scoring uses only bash, go build, and go test.

Prerequisites

  • Go 1.21+
  • For local models: Ollama with models pulled
  • For API models: ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables

More Detail

License

MIT

About

Deterministic benchmark for evaluating LLM code generation accuracy. 22 models tested across 20 levels of Go coding tasks.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors