Auto-generate evaluation rubrics from agent audit-log trajectories.
Pattern lifted from PhoneWorld (arXiv:2605.29486): real trajectories yield both controllable environments AND auto-generated verifiers. This module applies that to agent action logs.
Shipped 2026-05-29 by Workloft. Full write-up:
- Ship article: https://workloft.ai/ships/auto-rubrics-2026-05-29.html
- Research Note: https://workloft.ai/labs/notes/trajectories-write-tests-2026-05-29.html
Given an audit log of agent actions (each row carrying agent, action, tool, arguments, response, success), this script:
- Pulls the last N hours of rows.
- Clusters by
(agent, action)with a minimum sample count. - For each cluster, samples representative trajectories evenly across the time window.
- Asks an LLM to derive a Vera-shaped
criteriastring per cluster ("A PASS means... KILL on..." structure, specific and falsifiable). - Persists the rubric to disk as JSON.
Future calls to your evaluator (Workloft's case: vera.poll.evaluate) can
load the rubric for a given (agent, action) cluster and use it as the
criteria argument without anyone hand-writing one.
The Workloft version of this module imports:
ruby— Workloft's model router (categories, tiers, providers)workloft_audit_logSupabase table (PostgREST endpoint)
Both are documented in the Workloft Labs Notes corpus. For an
independent reimplementation, swap ruby.chat() for any LLM client and
the audit log fetch for your event store. The clustering + sampling +
prompt template are the load-bearing parts.
python3 -m rubric_gen \
--lookback-hours 24 \
--min-samples 3 \
--max-clusters 20 \
[--dry-run]Outputs land in ./rubrics/{agent}__{action}.json.
~$0.001 per rubric at DeepSeek v4 Flash via OpenRouter (the
reason_med/balanced tier in our model catalogue). Nine-cluster steady
state ≈ $0.30/month. Adjust --max-clusters to cap per-run spend.
MIT. See LICENSE.