Outer-loop measurement for LLM-powered scoring policies. Two-level autoresearch pattern from arXiv 2605.30003.
Shipped 2026-05-29 by Workloft. Full write-up:
- Ship article: https://workloft.ai/ships/walt-weight-loop-2026-05-29.html
- Research Note: https://workloft.ai/labs/notes/two-level-loop-2026-05-29.html
You have a policy generator (a prompt, a scorer, a routing rule). Its outputs get filed somewhere downstream (tickets, todos, transactions). Each downstream item has an outcome (shipped, succeeded, killed).
This script answers: "are my policy's outputs predicting the outcomes I care about, on a per-category basis."
For each axis the policy scores against, it reports:
n_picks— how many policy outputs landed in that axismean_score— policy's average confidence for that axismoved/killed/open— downstream outcomesconversion= moved / n_picksaxis_health= conversion / (mean_score / 10) — 1.0 = perfectly calibrated
Outer loop measures. Inner loop tunes. The framework says outer goes first. Run this before you re-prompt anything.
The Workloft version reads:
/home/workloft/walt/data/hf-papers/hf-YYYY-MM-DD.top.json— Walt's daily HF paper picksgary_todosSupabase table — Gary's todo outcomes (status, stage, title)
Swap those two reads for your own policy outputs and your own outcome store and the rest of the script applies unchanged.
python3 -m weight_loop --days 30 [--json]Reports land in ./reports/walt-axis-health-<timestamp>.txt (or .json
with --json).
- Not RLHF. No model is retrained.
- Not the inner loop. No prompts are tuned.
- Not a substitute for human axis choice. The axes are still chosen by you.
It is the layer between "policy exists" and "policy is the right one." Most production stacks skip it.
MIT. See LICENSE.