Skip to content

shiva/ai-tester

Repository files navigation

ai-tester

Reusable benchmarking scripts for comparing LLM inference harnesses and models. Go, no external dependencies.

Supported providers: Ollama, oMLX, Anthropic, OpenAI


Prerequisites

  • Go 1.22+
  • At least one provider running (Ollama on :11434, oMLX on :8080, or API keys set)

Build

make build
# produces bin/bench-speed, bin/bench-quality, bin/bench-domain

Scripts

bench-speed

Compare harness performance for the same model. Local providers only (ollama, omlx).

./bin/bench-speed -provider ollama -model gemma4:latest
./bin/bench-speed -provider omlx   -model mlx-community/gemma-3-4b-4bit -runs 5

Runs three built-in prompts (short / medium / long) and reports TTFT, TPS, latency, and stddev across runs. See METRICS.md for definitions.

Flags:

-provider      ollama|omlx (default: ollama)
-model         model name (required)
-runs          runs per prompt for variance (default: 3)
-timeout       per-request timeout (default: 10m)
-ollama-url    Ollama base URL (default: http://localhost:11434)
-omlx-url      oMLX base URL (default: http://localhost:8080)
-results       output directory (default: results)
-no-unload     skip evicting model from memory after run

bench-quality

Evaluate output quality across any provider. Runs factual Q&A, format compliance, and reasoning prompts. Produces an overall_score.

./bin/bench-quality -provider ollama     -model gemma4:latest
./bin/bench-quality -provider anthropic  -model claude-haiku-4-5-20251001
./bin/bench-quality -provider openai     -model gpt-4o-mini -v
./bin/bench-quality -provider ollama     -model gemma4:latest -consistency -consistency-runs 3

Flags:

-provider          ollama|omlx|anthropic|openai (default: ollama)
-model             model name (required)
-timeout           per-request timeout (default: 10m)
-consistency       run consistency test on factual prompts
-consistency-runs  runs per prompt for consistency test (default: 3)
-v                 print model responses and failures
-ollama-url        Ollama base URL (default: http://localhost:11434)
-omlx-url          oMLX base URL (default: http://localhost:8080)
-results           output directory (default: results)
-no-unload         skip evicting model from memory after run

Environment variables:

ANTHROPIC_API_KEY   required when -provider anthropic
OPENAI_API_KEY      required when -provider openai

bench-domain

Test a model against your own prompts and pass/fail checks. Define tasks in a JSON config file.

./bin/bench-domain -config prompts/domain.example.json -provider ollama -model gemma4:latest
./bin/bench-domain -config prompts/my-tasks.json       -provider anthropic -model claude-haiku-4-5-20251001 -v

Flags:

-config        path to domain JSON config (required)
-provider      ollama|omlx|anthropic|openai (default: ollama)
-model         model name (required)
-timeout       per-request timeout (default: 10m)
-v             print model responses and check failures
-ollama-url    Ollama base URL (default: http://localhost:11434)
-omlx-url      oMLX base URL (default: http://localhost:8080)
-results       output directory (default: results)
-no-unload     skip evicting model from memory after run

Config format — see prompts/domain.example.json for a full example:

{
  "name": "My suite",
  "system": "Optional global system prompt",
  "prompts": [
    {
      "id": "my-task",
      "prompt": "Write a Go function that...",
      "max_tokens": 400,
      "checks": [
        { "type": "contains",  "value": "error" },
        { "type": "max_words", "value": "300"   },
        { "type": "json_valid"                  },
        { "type": "regex",     "value": "func \\w+" }
      ]
    }
  ]
}

Available check types: contains, not_contains, max_words, min_words, json_valid, starts_with, regex


Output

Results written to results/YYYY-MM-DD/<benchmark>.jsonl — newline-delimited JSON, one object per prompt/run. Safe to append across multiple runs.

# View today's quality results
cat results/$(date +%F)/bench-quality.jsonl | grep '"type":"summary"'

Memory management

Each script unloads the model from memory after completing (Ollama only, via keep_alive: 0). This is the default. Use -no-unload to keep the model hot between runs of the same model.

On 16GB RAM, only one large model can be loaded at a time. Run scripts one model at a time; the unload step ensures the next model has full memory available.


Project structure

scripts/
  bench-speed/     Speed benchmarks (local harnesses)
  bench-quality/   Quality benchmarks (all providers)
  bench-domain/    Custom domain benchmarks (all providers)
shared/
  provider/        Provider interface + Ollama, oMLX, Anthropic, OpenAI implementations
  output/          JSONL result writer
prompts/
  domain.example.json  Example domain config
results/           Benchmark output (gitignored)

See METRICS.md for metric definitions. See CHOICES.md for design decisions.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors