Skip to content

vitaliikapliuk/modelharness

Repository files navigation

modelharness

Make every model cheaper or better. Measured on four Claude models — none got worse.

benchmark: 408 runs · 100% pass 4/4 models improved · none worse license: MIT

What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:

Grounded progress — only claims backed by a tool result; "tests fail" said plainly Act, don't overplan — enough information means act, not narrate options
🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions 🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done"
🔀 Delegation triggers — explicit rules for when to fan work out to subagents 📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work

What the 17 tasks test

🐛 Bug hunts 4 tasks · 96 runs
Find and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover
Features from spec 4 tasks · 96 runs
Build to a written spec: retry backoff, config merging, cursor pagination, slugify
♻️ Refactors 2 tasks · 48 runs
Restructure code with zero behavior change, verified structurally
🧠 Long-horizon builds 2 tasks · 48 runs
Multi-stage pipelines where later steps depend on earlier decisions
🧩 Spec-dense traps 3 tasks · 72 runs
18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading
🔁 Session handoffs 2 tasks · 48 runs
A fresh session must finish another session's work — memory is the only bridge

17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.

The results

Cost per task with and without modelharness

The numbers, exactly

Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).

What we measured Fable 5 Opus 4.8 ⭐ biggest gain Sonnet 4.6 Haiku 4.5
plain model+ modelharness plain model+ modelharness plain model+ modelharness plain model+ modelharness
Tasks completed successfully 100%100% 100%100% 100%100% 🔴 98%🟢 100%
Average cost per task $1.80🟢 $1.73 $0.89🟢 $0.77 $0.41🟢 $0.40 $0.24$0.24
· bug hunts $1.30🟢 $1.26 $0.63🟢 $0.55 $0.26🟢 $0.24 $0.16🟢 $0.13
· features from spec $1.44🔴 $1.49 $0.76🟢 $0.60 $0.34🟢 $0.32 $0.18🟢 $0.16
· refactors $0.91$0.91 $0.51$0.51 $0.21🔴 $0.22 $0.11$0.11
· long-horizon $1.90🟢 $1.28 $0.71🟢 $0.60 $0.35🟢 $0.26 $0.13🔴 $0.14
· spec-dense traps $2.13🔴 $2.21 $1.13🟢 $1.02 $0.65🔴 $0.74 $0.40🔴 $0.50
· session handoffs $3.80🟢 $3.74 $1.92🟢 $1.59 $0.79🟢 $0.66 $0.52🟢 $0.48
Average time per task, seconds 130🟢 118 114🟢 96 123🟢 118 104🟢 96
What the harness improved, on average 🟢 3.5% cheaper, 9% faster on average — even against the model these patterns came from. Pays a little extra on spec-dense tasks as verification insurance; wins it back big on long-horizon builds (−33%). 🟢 14% cheaper, 16% faster on average — the biggest win of all four. Cheaper or equal in every single category; nothing traded away. 🟢 4% cheaper and 4% faster on average. Pays +14% on spec-dense tasks for the same verification insurance — repaid by −26% on long-horizon and −17% on handoffs. 🟢 98% → 100% tasks solved. The extra spend on spec-dense tasks (+25%) is the self-checking that caught and fixed its own mistakes — and it still finished the benchmark 7.5% faster at the same average price.

The bottom line. modelharness packages the same working practices Fable 5 was trained on. The practices land hardest on Opus 4.8 — the flagship model available on every subscription — at −14% cost / −16% time, and that win is statistically significant (see below). Even Fable 5, competing against itself, runs significantly faster. On smaller models the average hides a trade: Haiku saves up to 19% on routine bugfixes but spends more on spec-dense tasks — extra verification work that is exactly what lifted its pass rate from 98% to 100%. Cheaper where it can be, more careful where it must be — and never significantly worse on any model.

How confident are we?

Averages can hide noise, so we ran the honest test: pair each model's plain vs +modelharness runs on the same task (3 reps averaged), take the per-task percentage delta, and put a 95% confidence interval around the mean across all 17 tasks. A CI that clears zero is a real effect; one that straddles zero is within run-to-run noise. Regenerate with python3 bench/stats.py.

Model Cost Δ (95% CI) Time Δ (95% CI) Tasks cheaper
Opus 4.8 −12.0% [−17.3, −6.7] · significant −16.5% [−25.3, −7.7] · significant 15 / 17
Fable 5 −3.2% [−10.6, +4.2] · within noise −11.4% [−20.1, −2.8] · significant 8 / 17
Sonnet 4.6 −4.0% [−11.3, +3.3] · within noise −7.8% [−15.7, +0.0] · within noise 10 / 17
Haiku 4.5 +0.3% [−8.7, +9.3] · within noise −4.5% [−17.6, +8.6] · within noise 9 / 17

What this means, stated plainly: the harness delivers a statistically significant cost-and-time reduction on Opus 4.8 — the model most people run on a subscription — and a significant speed-up on Fable 5. For Sonnet 4.6 and Haiku 4.5 the cost and time changes are within noise: not a reliable saving, but never a reliable loss either. Quality is not a sampled average — it is an exact binary count: 407 of 408 runs passed, and the one failure (bare Haiku 4.5 on a session-handoff task) is fixed 3/3 by the harness. So the defensible claim is narrow and true: Opus gets meaningfully cheaper and faster, every model gets a memory-driven reliability floor, and none is significantly worse.

⚡ Install — 30 seconds, zero config

/plugin marketplace add vitaliikapliuk/modelharness
/plugin install modelharness@modelharness

Restart Claude Code — active in every session, on whatever model you run.

Why this exists

Claude Fable 5 left subscription plans on June 23, 2026 — the most capable model became API-only, and most subscribers went back to Opus, Sonnet, or Haiku. That raised a question worth measuring rather than debating: how much of a frontier model's edge is weights, and how much is working practices — the documented behaviors like grounded progress reporting, self-verification, and file-based memory that Anthropic describes in its own migration guides?

So we distilled those practices into a plugin and built a benchmark to find out. The answer surprised us in both directions: on self-contained coding tasks the practices made every model cheaper or better — including Fable 5 itself — while raw correctness at benchmark scale turned out not to separate the models at all. The harness, not the weights, was the measurable difference.

How it works

A SessionStart hook injects a behavioral core (≈910 tokens — your entire context tax, measured, not estimated) implementing six patterns from Anthropic's official Fable 5 migration guide:

Pattern Source
Grounded progress claims Fable 5 migration guide → "Ground progress claims on long runs"
Act, don't overplan Fable 5 migration guide → "Longer turns by default"
Autonomy calibration Opus 4.8 notes → "More deliberate — asks more often"
Self-verification loops Fable 5 guide → "Make self-verification explicit"
Delegation triggers Opus 4.8 notes → "Under-utilization of subagents"
Memory surface Fable 5 guide → "Give it a memory surface"

Plus three on-demand skills (verification-loop, memory-discipline, delegation-triggers), a fresh-context verifier agent, and three optional power-user commands (/modelharness:goal, /modelharness:verify, /modelharness:retro).

The hook only appends context — it never intercepts or blocks anything. Tested alongside superpowers.

What this can NOT do

  • Raise raw reasoning ability or one-shot intelligence on hard problems.
  • Reproduce Fable 5's tokenizer or always-on protected thinking.
  • Separate the top models on correctness at benchmark scale: every configuration with modelharness scored 100%. Real differences in multi-hour messy sessions exist but are unmeasured here — those are Anthropic's documented claims, not our data.

Grading integrity: every failure was hand-audited; two grader fixes were made during capture (from-import delegation; __all__ dunder exemption), both in the models' favor — each one documented with its diff and rationale in bench/GRADING.md.

Reproduce it

bench/run.sh --config bare --reps 3        # any of 8 configs
python3 bench/report.py                    # category table
python3 bench/lift.py                      # per-model harness lift
python3 bench/stats.py                     # paired per-task deltas with 95% confidence intervals
python3 bench/chart.py                     # regenerate the hero chart from the CSV

Full 8-config capture measured ≈ $330 API-equivalent (per-config costs in bench/README.md). Hidden binary grading; bench/scripts/selfcheck.sh --all proves every task fails untouched and passes on its reference solution.

Contributing

The most valuable PR: a task where a bare model demonstrably fails and modelharness passes. The two-phase session-handoff format is in bench/TASK_FORMAT.md. See CONTRIBUTING.md.

License

MIT

About

Make every model cheaper or better. Zero-config behavioral harness for Claude Code, with a reproducible 408-run benchmark

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors