Releases · vitaliikapliuk/modelharness

modelharness distils Fable 5's documented working practices into a zero-config.
Claude Code plugin — and measures exactly what that buys, across four Claude models.

Benchmark

17 agentic-coding tasks × 8 configurations × 3 reps = 408 runs. Hidden binary grading, no LLM judge, a reference solution per task.

What the data supports (paired per-task, 95% CI — `bench/stats.py`)

Opus 4.8 — cost −12.0% [−17.3, −6.7] and time −16.5% [−25.3, −7.7]: statistically significant. The flagship subscription model gets a real win.
Fable 5 — significant speed-up (−11.4%), even against the model these patterns came from.
Sonnet 4.6 / Haiku 4.5 — cost and time within run-to-run noise: not a reliable saving, but never a reliable loss. Haiku additionally goes 98% → 100% pass.
Quality is exact, not sampled: 407/408 runs passed.

Install

/plugin marketplace add vitaliikapliuk/modelharness
/plugin install modelharness@modelharness

Full details in the README; version history in CHANGELOG.md; grading corrections in bench/GRADING.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Benchmark

What the data supports (paired per-task, 95% CI — `bench/stats.py`)

Install

Uh oh!

Releases: vitaliikapliuk/modelharness

v0.3.0 — first public release

Benchmark

What the data supports (paired per-task, 95% CI — bench/stats.py)

Install

Uh oh!

What the data supports (paired per-task, 95% CI — `bench/stats.py`)