ai-tester

Reusable benchmarking scripts for comparing LLM inference harnesses and models. Go, no external dependencies.

Supported providers: Ollama, oMLX, Anthropic, OpenAI

Prerequisites

Go 1.22+
At least one provider running (Ollama on :11434, oMLX on :8080, or API keys set)

Build

make build
# produces bin/bench-speed, bin/bench-quality, bin/bench-domain

Scripts

bench-speed

Compare harness performance for the same model. Local providers only (ollama, omlx).

./bin/bench-speed -provider ollama -model gemma4:latest
./bin/bench-speed -provider omlx   -model mlx-community/gemma-3-4b-4bit -runs 5

Runs three built-in prompts (short / medium / long) and reports TTFT, TPS, latency, and stddev across runs. See METRICS.md for definitions.

Flags:

-provider      ollama|omlx (default: ollama)
-model         model name (required)
-runs          runs per prompt for variance (default: 3)
-timeout       per-request timeout (default: 10m)
-ollama-url    Ollama base URL (default: http://localhost:11434)
-omlx-url      oMLX base URL (default: http://localhost:8080)
-results       output directory (default: results)
-no-unload     skip evicting model from memory after run

bench-quality

Evaluate output quality across any provider. Runs factual Q&A, format compliance, and reasoning prompts. Produces an overall_score.

./bin/bench-quality -provider ollama     -model gemma4:latest
./bin/bench-quality -provider anthropic  -model claude-haiku-4-5-20251001
./bin/bench-quality -provider openai     -model gpt-4o-mini -v
./bin/bench-quality -provider ollama     -model gemma4:latest -consistency -consistency-runs 3

Flags:

-provider          ollama|omlx|anthropic|openai (default: ollama)
-model             model name (required)
-timeout           per-request timeout (default: 10m)
-consistency       run consistency test on factual prompts
-consistency-runs  runs per prompt for consistency test (default: 3)
-v                 print model responses and failures
-ollama-url        Ollama base URL (default: http://localhost:11434)
-omlx-url          oMLX base URL (default: http://localhost:8080)
-results           output directory (default: results)
-no-unload         skip evicting model from memory after run

Environment variables:

ANTHROPIC_API_KEY   required when -provider anthropic
OPENAI_API_KEY      required when -provider openai

bench-domain

Test a model against your own prompts and pass/fail checks. Define tasks in a JSON config file.

./bin/bench-domain -config prompts/domain.example.json -provider ollama -model gemma4:latest
./bin/bench-domain -config prompts/my-tasks.json       -provider anthropic -model claude-haiku-4-5-20251001 -v

Flags:

-config        path to domain JSON config (required)
-provider      ollama|omlx|anthropic|openai (default: ollama)
-model         model name (required)
-timeout       per-request timeout (default: 10m)
-v             print model responses and check failures
-ollama-url    Ollama base URL (default: http://localhost:11434)
-omlx-url      oMLX base URL (default: http://localhost:8080)
-results       output directory (default: results)
-no-unload     skip evicting model from memory after run

Config format — see prompts/domain.example.json for a full example:

{
  "name": "My suite",
  "system": "Optional global system prompt",
  "prompts": [
    {
      "id": "my-task",
      "prompt": "Write a Go function that...",
      "max_tokens": 400,
      "checks": [
        { "type": "contains",  "value": "error" },
        { "type": "max_words", "value": "300"   },
        { "type": "json_valid"                  },
        { "type": "regex",     "value": "func \\w+" }
      ]
    }
  ]
}

Available check types: contains, not_contains, max_words, min_words, json_valid, starts_with, regex

Output

Results written to results/YYYY-MM-DD/<benchmark>.jsonl — newline-delimited JSON, one object per prompt/run. Safe to append across multiple runs.

# View today's quality results
cat results/$(date +%F)/bench-quality.jsonl | grep '"type":"summary"'

Memory management

Each script unloads the model from memory after completing (Ollama only, via keep_alive: 0). This is the default. Use -no-unload to keep the model hot between runs of the same model.

On 16GB RAM, only one large model can be loaded at a time. Run scripts one model at a time; the unload step ensures the next model has full memory available.

Project structure

scripts/
  bench-speed/     Speed benchmarks (local harnesses)
  bench-quality/   Quality benchmarks (all providers)
  bench-domain/    Custom domain benchmarks (all providers)
shared/
  provider/        Provider interface + Ollama, oMLX, Anthropic, OpenAI implementations
  output/          JSONL result writer
prompts/
  domain.example.json  Example domain config
results/           Benchmark output (gitignored)

See METRICS.md for metric definitions. See CHOICES.md for design decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
docs		docs
prompts		prompts
results/2026-06-07		results/2026-06-07
scripts		scripts
shared		shared
.gitignore		.gitignore
CHOICES.md		CHOICES.md
METRICS.md		METRICS.md
Makefile		Makefile
README.md		README.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ai-tester

Prerequisites

Build

Scripts

bench-speed

bench-quality

bench-domain

Output

Memory management

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ai-tester

Prerequisites

Build

Scripts

bench-speed

bench-quality

bench-domain

Output

Memory management

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages