modelharness

Make every model cheaper or better. Measured on four Claude models — none got worse.

What it actually is: a zero-config Claude Code plugin. On every session start it injects a ≈910-token behavioral core — six working practices distilled from how Fable 5 was trained to operate — plus three on-demand skills and a fresh-context verifier agent. No commands to learn; the model just starts working differently:


✅ Grounded progress — only claims backed by a tool result; "tests fail" said plainly	⚡ Act, don't overplan — enough information means act, not narrate options
🎯 Autonomy calibration — decides minor things itself, asks only on scope or destructive actions	🔍 Self-verification loops — a checkable definition of done, real checks on a cadence, a fresh-context verifier before "done"
🔀 Delegation triggers — explicit rules for when to fan work out to subagents	📝 Cross-session memory — writes lessons and plans to files, so the next session can pick up the work

What the 17 tasks test


🐛 Bug hunts _{4 tasks · 96 runs} Find and fix planted defects: TTL cache, CSV quoting, rate limiter, date rollover	✨ Features from spec _{4 tasks · 96 runs} Build to a written spec: retry backoff, config merging, cursor pagination, slugify	♻️ Refactors _{2 tasks · 48 runs} Restructure code with zero behavior change, verified structurally
🧠 Long-horizon builds _{2 tasks · 48 runs} Multi-stage pipelines where later steps depend on earlier decisions	🧩 Spec-dense traps _{3 tasks · 72 runs} 18+ interacting rules (discount engine, mini-interpreter) that punish shallow reading	🔁 Session handoffs _{2 tasks · 48 runs} A fresh session must finish another session's work — memory is the only bridge

17 tasks × 3 attempts × 8 configurations = 408 runs. Grading is hidden and binary: test suites the agent never sees decide pass/fail. No LLM judge. Every task ships with a reference solution proving it solvable.

The results

The numbers, exactly

Same 17 tasks, 3 runs per configuration. Higher pass rate and lower cost/time are better. 🟢 = better with modelharness, 🔴 = worse (explained in the last row).

What we measured	Fable 5		Opus 4.8 ⭐ _{biggest gain}		Sonnet 4.6		Haiku 4.5
What we measured	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}	_{plain model}	_{+ modelharness}
Tasks completed successfully	100%	100%	100%	100%	100%	100%	🔴 98%	🟢 100%
Average cost per task	$1.80	🟢 $1.73	$0.89	🟢 $0.77	$0.41	🟢 $0.40	$0.24	$0.24
_{· bug hunts}	_$1.30	_{🟢 $1.26}	_$0.63	_{🟢 $0.55}	_$0.26	_{🟢 $0.24}	_$0.16	_{🟢 $0.13}
_{· features from spec}	_$1.44	_{🔴 $1.49}	_$0.76	_{🟢 $0.60}	_$0.34	_{🟢 $0.32}	_$0.18	_{🟢 $0.16}
_{· refactors}	_$0.91	_$0.91	_$0.51	_$0.51	_$0.21	_{🔴 $0.22}	_$0.11	_$0.11
_{· long-horizon}	_$1.90	_{🟢 $1.28}	_$0.71	_{🟢 $0.60}	_$0.35	_{🟢 $0.26}	_$0.13	_{🔴 $0.14}
_{· spec-dense traps}	_$2.13	_{🔴 $2.21}	_$1.13	_{🟢 $1.02}	_$0.65	_{🔴 $0.74}	_$0.40	_{🔴 $0.50}
_{· session handoffs}	_$3.80	_{🟢 $3.74}	_$1.92	_{🟢 $1.59}	_$0.79	_{🟢 $0.66}	_$0.52	_{🟢 $0.48}
Average time per task, seconds	130	🟢 118	114	🟢 96	123	🟢 118	104	🟢 96
What the harness improved, on average	_{🟢 3.5% cheaper, 9% faster on average — even against the model these patterns came from. Pays a little extra on spec-dense tasks as verification insurance; wins it back big on long-horizon builds (−33%).}		_{🟢 14% cheaper, 16% faster on average — the biggest win of all four. Cheaper or equal in every single category; nothing traded away.}		_{🟢 4% cheaper and 4% faster on average. Pays +14% on spec-dense tasks for the same verification insurance — repaid by −26% on long-horizon and −17% on handoffs.}		_{🟢 98% → 100% tasks solved. The extra spend on spec-dense tasks (+25%) is the self-checking that caught and fixed its own mistakes — and it still finished the benchmark 7.5% faster at the same average price.}

The bottom line. modelharness packages the same working practices Fable 5 was trained on. The practices land hardest on Opus 4.8 — the flagship model available on every subscription — at −14% cost / −16% time, and that win is statistically significant (see below). Even Fable 5, competing against itself, runs significantly faster. On smaller models the average hides a trade: Haiku saves up to 19% on routine bugfixes but spends more on spec-dense tasks — extra verification work that is exactly what lifted its pass rate from 98% to 100%. Cheaper where it can be, more careful where it must be — and never significantly worse on any model.

How confident are we?

Averages can hide noise, so we ran the honest test: pair each model's plain vs +modelharness runs on the same task (3 reps averaged), take the per-task percentage delta, and put a 95% confidence interval around the mean across all 17 tasks. A CI that clears zero is a real effect; one that straddles zero is within run-to-run noise. Regenerate with python3 bench/stats.py.

Model	Cost Δ (95% CI)	Time Δ (95% CI)	Tasks cheaper
Opus 4.8	−12.0% [−17.3, −6.7] · significant	−16.5% [−25.3, −7.7] · significant	15 / 17
Fable 5	−3.2% [−10.6, +4.2] · within noise	−11.4% [−20.1, −2.8] · significant	8 / 17
Sonnet 4.6	−4.0% [−11.3, +3.3] · within noise	−7.8% [−15.7, +0.0] · within noise	10 / 17
Haiku 4.5	+0.3% [−8.7, +9.3] · within noise	−4.5% [−17.6, +8.6] · within noise	9 / 17

What this means, stated plainly: the harness delivers a statistically significant cost-and-time reduction on Opus 4.8 — the model most people run on a subscription — and a significant speed-up on Fable 5. For Sonnet 4.6 and Haiku 4.5 the cost and time changes are within noise: not a reliable saving, but never a reliable loss either. Quality is not a sampled average — it is an exact binary count: 407 of 408 runs passed, and the one failure (bare Haiku 4.5 on a session-handoff task) is fixed 3/3 by the harness. So the defensible claim is narrow and true: Opus gets meaningfully cheaper and faster, every model gets a memory-driven reliability floor, and none is significantly worse.

⚡ Install — 30 seconds, zero config

/plugin marketplace add vitaliikapliuk/modelharness
/plugin install modelharness@modelharness

Restart Claude Code — active in every session, on whatever model you run.

Why this exists

Claude Fable 5 left subscription plans on June 23, 2026 — the most capable model became API-only, and most subscribers went back to Opus, Sonnet, or Haiku. That raised a question worth measuring rather than debating: how much of a frontier model's edge is weights, and how much is working practices — the documented behaviors like grounded progress reporting, self-verification, and file-based memory that Anthropic describes in its own migration guides?

So we distilled those practices into a plugin and built a benchmark to find out. The answer surprised us in both directions: on self-contained coding tasks the practices made every model cheaper or better — including Fable 5 itself — while raw correctness at benchmark scale turned out not to separate the models at all. The harness, not the weights, was the measurable difference.

How it works

A SessionStart hook injects a behavioral core (≈910 tokens — your entire context tax, measured, not estimated) implementing six patterns from Anthropic's official Fable 5 migration guide:

Pattern	Source
Grounded progress claims	Fable 5 migration guide → "Ground progress claims on long runs"
Act, don't overplan	Fable 5 migration guide → "Longer turns by default"
Autonomy calibration	Opus 4.8 notes → "More deliberate — asks more often"
Self-verification loops	Fable 5 guide → "Make self-verification explicit"
Delegation triggers	Opus 4.8 notes → "Under-utilization of subagents"
Memory surface	Fable 5 guide → "Give it a memory surface"

Plus three on-demand skills (verification-loop, memory-discipline, delegation-triggers), a fresh-context verifier agent, and three optional power-user commands (/modelharness:goal, /modelharness:verify, /modelharness:retro).

The hook only appends context — it never intercepts or blocks anything. Tested alongside superpowers.

What this can NOT do

Raise raw reasoning ability or one-shot intelligence on hard problems.
Reproduce Fable 5's tokenizer or always-on protected thinking.
Separate the top models on correctness at benchmark scale: every configuration with modelharness scored 100%. Real differences in multi-hour messy sessions exist but are unmeasured here — those are Anthropic's documented claims, not our data.

Grading integrity: every failure was hand-audited; two grader fixes were made during capture (from-import delegation; __all__ dunder exemption), both in the models' favor — each one documented with its diff and rationale in bench/GRADING.md.

Reproduce it

bench/run.sh --config bare --reps 3        # any of 8 configs
python3 bench/report.py                    # category table
python3 bench/lift.py                      # per-model harness lift
python3 bench/stats.py                     # paired per-task deltas with 95% confidence intervals
python3 bench/chart.py                     # regenerate the hero chart from the CSV

Full 8-config capture measured ≈ $330 API-equivalent (per-config costs in bench/README.md). Hidden binary grading; bench/scripts/selfcheck.sh --all proves every task fails untouched and passes on its reference solution.

Contributing

The most valuable PR: a task where a bare model demonstrably fails and modelharness passes. The two-phase session-handoff format is in bench/TASK_FORMAT.md. See CONTRIBUTING.md.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.claude-plugin		.claude-plugin
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
agents		agents
assets		assets
bench		bench
commands		commands
core		core
docs/dev		docs/dev
hooks		hooks
scripts		scripts
skills		skills
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

modelharness

What the 17 tasks test

The results

The numbers, exactly

How confident are we?

⚡ Install — 30 seconds, zero config

Why this exists

How it works

What this can NOT do

Reproduce it

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

modelharness

What the 17 tasks test

The results

The numbers, exactly

How confident are we?

⚡ Install — 30 seconds, zero config

Why this exists

How it works

What this can NOT do

Reproduce it

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages