Reusable benchmarking scripts for comparing LLM inference harnesses and models. Go, no external dependencies.
Supported providers: Ollama, oMLX, Anthropic, OpenAI
- Go 1.22+
- At least one provider running (Ollama on
:11434, oMLX on:8080, or API keys set)
make build
# produces bin/bench-speed, bin/bench-quality, bin/bench-domainCompare harness performance for the same model. Local providers only (ollama, omlx).
./bin/bench-speed -provider ollama -model gemma4:latest
./bin/bench-speed -provider omlx -model mlx-community/gemma-3-4b-4bit -runs 5Runs three built-in prompts (short / medium / long) and reports TTFT, TPS, latency, and stddev across runs. See METRICS.md for definitions.
Flags:
-provider ollama|omlx (default: ollama)
-model model name (required)
-runs runs per prompt for variance (default: 3)
-timeout per-request timeout (default: 10m)
-ollama-url Ollama base URL (default: http://localhost:11434)
-omlx-url oMLX base URL (default: http://localhost:8080)
-results output directory (default: results)
-no-unload skip evicting model from memory after run
Evaluate output quality across any provider. Runs factual Q&A, format compliance, and reasoning prompts. Produces an overall_score.
./bin/bench-quality -provider ollama -model gemma4:latest
./bin/bench-quality -provider anthropic -model claude-haiku-4-5-20251001
./bin/bench-quality -provider openai -model gpt-4o-mini -v
./bin/bench-quality -provider ollama -model gemma4:latest -consistency -consistency-runs 3Flags:
-provider ollama|omlx|anthropic|openai (default: ollama)
-model model name (required)
-timeout per-request timeout (default: 10m)
-consistency run consistency test on factual prompts
-consistency-runs runs per prompt for consistency test (default: 3)
-v print model responses and failures
-ollama-url Ollama base URL (default: http://localhost:11434)
-omlx-url oMLX base URL (default: http://localhost:8080)
-results output directory (default: results)
-no-unload skip evicting model from memory after run
Environment variables:
ANTHROPIC_API_KEY required when -provider anthropic
OPENAI_API_KEY required when -provider openai
Test a model against your own prompts and pass/fail checks. Define tasks in a JSON config file.
./bin/bench-domain -config prompts/domain.example.json -provider ollama -model gemma4:latest
./bin/bench-domain -config prompts/my-tasks.json -provider anthropic -model claude-haiku-4-5-20251001 -vFlags:
-config path to domain JSON config (required)
-provider ollama|omlx|anthropic|openai (default: ollama)
-model model name (required)
-timeout per-request timeout (default: 10m)
-v print model responses and check failures
-ollama-url Ollama base URL (default: http://localhost:11434)
-omlx-url oMLX base URL (default: http://localhost:8080)
-results output directory (default: results)
-no-unload skip evicting model from memory after run
Config format — see prompts/domain.example.json for a full example:
{
"name": "My suite",
"system": "Optional global system prompt",
"prompts": [
{
"id": "my-task",
"prompt": "Write a Go function that...",
"max_tokens": 400,
"checks": [
{ "type": "contains", "value": "error" },
{ "type": "max_words", "value": "300" },
{ "type": "json_valid" },
{ "type": "regex", "value": "func \\w+" }
]
}
]
}Available check types: contains, not_contains, max_words, min_words, json_valid, starts_with, regex
Results written to results/YYYY-MM-DD/<benchmark>.jsonl — newline-delimited JSON, one object per prompt/run. Safe to append across multiple runs.
# View today's quality results
cat results/$(date +%F)/bench-quality.jsonl | grep '"type":"summary"'Each script unloads the model from memory after completing (Ollama only, via keep_alive: 0). This is the default. Use -no-unload to keep the model hot between runs of the same model.
On 16GB RAM, only one large model can be loaded at a time. Run scripts one model at a time; the unload step ensures the next model has full memory available.
scripts/
bench-speed/ Speed benchmarks (local harnesses)
bench-quality/ Quality benchmarks (all providers)
bench-domain/ Custom domain benchmarks (all providers)
shared/
provider/ Provider interface + Ollama, oMLX, Anthropic, OpenAI implementations
output/ JSONL result writer
prompts/
domain.example.json Example domain config
results/ Benchmark output (gitignored)
See METRICS.md for metric definitions. See CHOICES.md for design decisions.