Skip to content

workloftai/auto-rubrics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

auto-rubrics

Auto-generate evaluation rubrics from agent audit-log trajectories.

Pattern lifted from PhoneWorld (arXiv:2605.29486): real trajectories yield both controllable environments AND auto-generated verifiers. This module applies that to agent action logs.

Shipped 2026-05-29 by Workloft. Full write-up:

What it does

Given an audit log of agent actions (each row carrying agent, action, tool, arguments, response, success), this script:

  1. Pulls the last N hours of rows.
  2. Clusters by (agent, action) with a minimum sample count.
  3. For each cluster, samples representative trajectories evenly across the time window.
  4. Asks an LLM to derive a Vera-shaped criteria string per cluster ("A PASS means... KILL on..." structure, specific and falsifiable).
  5. Persists the rubric to disk as JSON.

Future calls to your evaluator (Workloft's case: vera.poll.evaluate) can load the rubric for a given (agent, action) cluster and use it as the criteria argument without anyone hand-writing one.

Workloft-specific dependencies

The Workloft version of this module imports:

  • ruby — Workloft's model router (categories, tiers, providers)
  • workloft_audit_log Supabase table (PostgREST endpoint)

Both are documented in the Workloft Labs Notes corpus. For an independent reimplementation, swap ruby.chat() for any LLM client and the audit log fetch for your event store. The clustering + sampling + prompt template are the load-bearing parts.

Run it

python3 -m rubric_gen \
  --lookback-hours 24 \
  --min-samples 3 \
  --max-clusters 20 \
  [--dry-run]

Outputs land in ./rubrics/{agent}__{action}.json.

Cost

~$0.001 per rubric at DeepSeek v4 Flash via OpenRouter (the reason_med/balanced tier in our model catalogue). Nine-cluster steady state ≈ $0.30/month. Adjust --max-clusters to cap per-run spend.

Licence

MIT. See LICENSE.

About

Auto-generate evaluation rubrics from agent audit-log trajectories (PhoneWorld pattern applied to action logs)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages