Make every model cheaper or better. Measured on four Claude models — none got worse.
What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:
| ✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly | ⚡ Act, don't overplan — enough information means act, not narrate options |
| 🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions | 🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done" |
| 🔀 Delegation triggers — explicit rules for when to fan work out to subagents | 📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work |
| 🐛 Bug hunts 4 tasks · 96 runs Find and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover |
✨ Features from spec 4 tasks · 96 runs Build to a written spec: retry backoff, config merging, cursor pagination, slugify |
♻️ Refactors 2 tasks · 48 runs Restructure code with zero behavior change, verified structurally |
| 🧠 Long-horizon builds 2 tasks · 48 runs Multi-stage pipelines where later steps depend on earlier decisions |
🧩 Spec-dense traps 3 tasks · 72 runs 18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading |
🔁 Session handoffs 2 tasks · 48 runs A fresh session must finish another session's work — memory is the only bridge |
17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.
Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).
| What we measured | Fable 5 | Opus 4.8 ⭐ biggest gain | Sonnet 4.6 | Haiku 4.5 | ||||
|---|---|---|---|---|---|---|---|---|
| plain model | + modelharness | plain model | + modelharness | plain model | + modelharness | plain model | + modelharness | |
| Tasks completed successfully | 100% | 100% | 100% | 100% | 100% | 100% | 🔴 98% | 🟢 100% |
| Average cost per task | $1.80 | 🟢 $1.73 | $0.89 | 🟢 $0.77 | $0.41 | 🟢 $0.40 | $0.24 | $0.24 |
| · bug hunts | $1.30 | 🟢 $1.26 | $0.63 | 🟢 $0.55 | $0.26 | 🟢 $0.24 | $0.16 | 🟢 $0.13 |
| · features from spec | $1.44 | 🔴 $1.49 | $0.76 | 🟢 $0.60 | $0.34 | 🟢 $0.32 | $0.18 | 🟢 $0.16 |
| · refactors | $0.91 | $0.91 | $0.51 | $0.51 | $0.21 | 🔴 $0.22 | $0.11 | $0.11 |
| · long-horizon | $1.90 | 🟢 $1.28 | $0.71 | 🟢 $0.60 | $0.35 | 🟢 $0.26 | $0.13 | 🔴 $0.14 |
| · spec-dense traps | $2.13 | 🔴 $2.21 | $1.13 | 🟢 $1.02 | $0.65 | 🔴 $0.74 | $0.40 | 🔴 $0.50 |
| · session handoffs | $3.80 | 🟢 $3.74 | $1.92 | 🟢 $1.59 | $0.79 | 🟢 $0.66 | $0.52 | 🟢 $0.48 |
| Average time per task, seconds | 130 | 🟢 118 | 114 | 🟢 96 | 123 | 🟢 118 | 104 | 🟢 96 |
| What the harness improved, on average | 🟢 3.5% cheaper, 9% faster on average — even against the model these patterns came from. Pays a little extra on spec-dense tasks as verification insurance; wins it back big on long-horizon builds (−33%). | 🟢 14% cheaper, 16% faster on average — the biggest win of all four. Cheaper or equal in every single category; nothing traded away. | 🟢 4% cheaper and 4% faster on average. Pays +14% on spec-dense tasks for the same verification insurance — repaid by −26% on long-horizon and −17% on handoffs. | 🟢 98% → 100% tasks solved. The extra spend on spec-dense tasks (+25%) is the self-checking that caught and fixed its own mistakes — and it still finished the benchmark 7.5% faster at the same average price. | ||||
The bottom line. modelharness packages the same working practices Fable 5 was trained on. The practices land hardest on Opus 4.8 — the flagship model available on every subscription — at −14% cost / −16% time, and that win is statistically significant (see below). Even Fable 5, competing against itself, runs significantly faster. On smaller models the average hides a trade: Haiku saves up to 19% on routine bugfixes but spends more on spec-dense tasks — extra verification work that is exactly what lifted its pass rate from 98% to 100%. Cheaper where it can be, more careful where it must be — and never significantly worse on any model.
Averages can hide noise, so we ran the honest test: pair each model's plain vs +modelharness runs on the same task (3 reps averaged), take the per-task percentage delta, and put a 95% confidence interval around the mean across all 17 tasks. A CI that clears zero is a real effect; one that straddles zero is within run-to-run noise. Regenerate with python3 bench/stats.py.
| Model | Cost Δ (95% CI) | Time Δ (95% CI) | Tasks cheaper |
|---|---|---|---|
| Opus 4.8 | −12.0% [−17.3, −6.7] · significant | −16.5% [−25.3, −7.7] · significant | 15 / 17 |
| Fable 5 | −3.2% [−10.6, +4.2] · within noise | −11.4% [−20.1, −2.8] · significant | 8 / 17 |
| Sonnet 4.6 | −4.0% [−11.3, +3.3] · within noise | −7.8% [−15.7, +0.0] · within noise | 10 / 17 |
| Haiku 4.5 | +0.3% [−8.7, +9.3] · within noise | −4.5% [−17.6, +8.6] · within noise | 9 / 17 |
What this means, stated plainly: the harness delivers a statistically significant cost-and-time reduction on Opus 4.8 — the model most people run on a subscription — and a significant speed-up on Fable 5. For Sonnet 4.6 and Haiku 4.5 the cost and time changes are within noise: not a reliable saving, but never a reliable loss either. Quality is not a sampled average — it is an exact binary count: 407 of 408 runs passed, and the one failure (bare Haiku 4.5 on a session-handoff task) is fixed 3/3 by the harness. So the defensible claim is narrow and true: Opus gets meaningfully cheaper and faster, every model gets a memory-driven reliability floor, and none is significantly worse.
/plugin marketplace add vitaliikapliuk/modelharness
/plugin install modelharness@modelharness
Restart Claude Code — active in every session, on whatever model you run.
Claude Fable 5 left subscription plans on June 23, 2026 — the most capable model became API-only, and most subscribers went back to Opus, Sonnet, or Haiku. That raised a question worth measuring rather than debating: how much of a frontier model's edge is weights, and how much is working practices — the documented behaviors like grounded progress reporting, self-verification, and file-based memory that Anthropic describes in its own migration guides?
So we distilled those practices into a plugin and built a benchmark to find out. The answer surprised us in both directions: on self-contained coding tasks the practices made every model cheaper or better — including Fable 5 itself — while raw correctness at benchmark scale turned out not to separate the models at all. The harness, not the weights, was the measurable difference.
A SessionStart hook injects a behavioral core (≈910 tokens — your entire context tax, measured, not estimated) implementing six patterns from Anthropic's official Fable 5 migration guide:
| Pattern | Source |
|---|---|
| Grounded progress claims | Fable 5 migration guide → "Ground progress claims on long runs" |
| Act, don't overplan | Fable 5 migration guide → "Longer turns by default" |
| Autonomy calibration | Opus 4.8 notes → "More deliberate — asks more often" |
| Self-verification loops | Fable 5 guide → "Make self-verification explicit" |
| Delegation triggers | Opus 4.8 notes → "Under-utilization of subagents" |
| Memory surface | Fable 5 guide → "Give it a memory surface" |
Plus three on-demand skills (verification-loop, memory-discipline, delegation-triggers), a fresh-context verifier agent, and three optional power-user commands (/modelharness:goal, /modelharness:verify, /modelharness:retro).
The hook only appends context — it never intercepts or blocks anything. Tested alongside superpowers.
- Raise raw reasoning ability or one-shot intelligence on hard problems.
- Reproduce Fable 5's tokenizer or always-on protected thinking.
- Separate the top models on correctness at benchmark scale: every configuration with modelharness scored 100%. Real differences in multi-hour messy sessions exist but are unmeasured here — those are Anthropic's documented claims, not our data.
Grading integrity: every failure was hand-audited; two grader fixes were made during capture (from-import delegation; __all__ dunder exemption), both in the models' favor — each one documented with its diff and rationale in bench/GRADING.md.
bench/run.sh --config bare --reps 3 # any of 8 configs
python3 bench/report.py # category table
python3 bench/lift.py # per-model harness lift
python3 bench/stats.py # paired per-task deltas with 95% confidence intervals
python3 bench/chart.py # regenerate the hero chart from the CSV
Full 8-config capture measured ≈ $330 API-equivalent (per-config costs in bench/README.md). Hidden binary grading; bench/scripts/selfcheck.sh --all proves every task fails untouched and passes on its reference solution.
The most valuable PR: a task where a bare model demonstrably fails and modelharness passes. The two-phase session-handoff format is in bench/TASK_FORMAT.md. See CONTRIBUTING.md.
MIT