specgen

A deterministic benchmark for evaluating LLM code generation accuracy. 22 models tested across 20 progressive levels of Go coding tasks — scored by compilation, test execution, and pattern matching. No LLM judgment in scoring.

Results

Six models qualified for all 20 levels:

Model	Provider	Score	Pct
Claude Opus 4.6	Anthropic	109/118	92%
Claude Sonnet 4.5	Anthropic	106/118	89%
Claude Haiku 4.5	Anthropic	97/118	82%
GPT-4o	OpenAI	78/118	66%
Qwen 2.5 72B	Ollama	78/118	66%
Qwen 2.5 Coder 32B	Ollama	76/118	64%

Best local model: Qwen 2.5 Coder 32B at 64% grand total. Full results in REPORT.md.

What This Tests

20 specs across three tiers:

Tier	Levels	Points	What's Tested
Core	L1-L7	40	Bug finding, implementation, refactoring, test writing, ambiguous specs, constraint satisfaction
Advanced	L8-L13	36	Deep bugs, contradictions, state machines, security audits, large refactors, adversarial tests
Expert	L14-L20	42	Concurrency bugs (write skew, deadlock, goroutine leak, data race, split-brain, invariant violation) + unsolvable control

All models get the same spec through the same streaming harness — eliminating delivery-bias that skews results when different models get different prompting contexts.

Quick Start

# Build
go build -buildvcs=false -o specgen .

# Run a single level (Ollama)
./specgen --model qwen2.5-coder:32b --spec specs/level1_bug_finding.md

# Run with an API provider
export ANTHROPIC_API_KEY=...
./specgen --provider anthropic --model claude-sonnet-4-5-20250929 --spec specs/level1_bug_finding.md

# Score all results
bash score.sh

Output lands in output/<model>/<level>/ with response.md, metadata.json, and extracted .go files.

How It Works

specgen reads a spec file and sends it to a model via Ollama, Anthropic, or OpenAI APIs
The response is streamed, code blocks are extracted, and everything is saved to output/
score.sh evaluates each response: compiles code, runs tests (including canonical harness tests for L14-19), and checks for bug identification via regex
Results are deterministic and reproducible

Batch Scripts

Script	What it runs
`run_all_models.sh`	14 Ollama models through L1-L7
`run_api_models.sh`	Anthropic + OpenAI models through L1-L7
`run_expert_v2.sh`	6 qualifying models through L14-L20

Project Structure

main.go                    CLI entry point
provider.go                Provider interface
provider_{ollama,anthropic,openai}.go
extract.go                 Code block extraction from markdown
score.sh                   Automated scorer
specs/                     20 spec files (L1-L20)
specs/tests/               Canonical test files for L14-L20
output/                    Results per model per level
REPORT.md                  Full benchmark report
RUNBOOK.md                 How to run and score benchmarks

Zero external Go dependencies. Scoring uses only bash, go build, and go test.

Prerequisites

Go 1.21+
For local models: Ollama with models pulled
For API models: ANTHROPIC_API_KEY and/or OPENAI_API_KEY environment variables

More Detail

REPORT.md — Full results, per-level analysis, timing data
RUNBOOK.md — Detailed instructions for running benchmarks
specs/README.md — Level descriptions and scoring criteria
design/project-summary.md — Architecture and design decisions

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

specgen

Results

What This Tests

Quick Start

How It Works

Batch Scripts

Project Structure

Prerequisites

More Detail

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
design		design
specs		specs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REPORT.md		REPORT.md
RUNBOOK.md		RUNBOOK.md
extract.go		extract.go
extract_test.go		extract_test.go
go.mod		go.mod
go.sum		go.sum
main.go		main.go
provider.go		provider.go
provider_anthropic.go		provider_anthropic.go
provider_ollama.go		provider_ollama.go
provider_openai.go		provider_openai.go
run_all_models.sh		run_all_models.sh
run_api_models.sh		run_api_models.sh
run_benchmark.sh		run_benchmark.sh
run_expert_v2.sh		run_expert_v2.sh
score.sh		score.sh

License

statherm/specgen

Folders and files

Latest commit

History

Repository files navigation

specgen

Results

What This Tests

Quick Start

How It Works

Batch Scripts

Project Structure

Prerequisites

More Detail

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages