auto-rubrics

Auto-generate evaluation rubrics from agent audit-log trajectories.

Pattern lifted from PhoneWorld (arXiv:2605.29486): real trajectories yield both controllable environments AND auto-generated verifiers. This module applies that to agent action logs.

Shipped 2026-05-29 by Workloft. Full write-up:

Ship article: https://workloft.ai/ships/auto-rubrics-2026-05-29.html
Research Note: https://workloft.ai/labs/notes/trajectories-write-tests-2026-05-29.html

What it does

Given an audit log of agent actions (each row carrying agent, action, tool, arguments, response, success), this script:

Pulls the last N hours of rows.
Clusters by (agent, action) with a minimum sample count.
For each cluster, samples representative trajectories evenly across the time window.
Asks an LLM to derive a Vera-shaped criteria string per cluster ("A PASS means... KILL on..." structure, specific and falsifiable).
Persists the rubric to disk as JSON.

Future calls to your evaluator (Workloft's case: vera.poll.evaluate) can load the rubric for a given (agent, action) cluster and use it as the criteria argument without anyone hand-writing one.

Workloft-specific dependencies

The Workloft version of this module imports:

ruby — Workloft's model router (categories, tiers, providers)
workloft_audit_log Supabase table (PostgREST endpoint)

Both are documented in the Workloft Labs Notes corpus. For an independent reimplementation, swap ruby.chat() for any LLM client and the audit log fetch for your event store. The clustering + sampling + prompt template are the load-bearing parts.

Run it

python3 -m rubric_gen \
  --lookback-hours 24 \
  --min-samples 3 \
  --max-clusters 20 \
  [--dry-run]

Outputs land in ./rubrics/{agent}__{action}.json.

Cost

~$0.001 per rubric at DeepSeek v4 Flash via OpenRouter (the reason_med/balanced tier in our model catalogue). Nine-cluster steady state ≈ $0.30/month. Adjust --max-clusters to cap per-run spend.

Licence

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
rubric_gen.py		rubric_gen.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-rubrics

What it does

Workloft-specific dependencies

Run it

Cost

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

auto-rubrics

What it does

Workloft-specific dependencies

Run it

Cost

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages