Skip to content

workloftai/trajectory-compiler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

trajectory-compiler

Turn the audit log of a running multi-agent stack into structured long-context QA pairs you can use for evaluation or supervision.

We run eight agents at Workloft. Every action they take lands in an append-only audit log. After a few months of operations, that log contains thousands of multi-turn trajectories: each one is a small piece of implicit long-context reasoning that nobody is currently learning from.

This tool changes that.

Inspired by the Agent Context Compilation method described in arXiv:2605.21850. The paper trains long-context recall by converting agent trajectories into QA pairs. We do the same thing, but starting from a production audit log rather than a synthetic benchmark run.

What it does

Given a Postgres audit log table (we use Supabase, but any compatible schema works), trajectory-compiler:

  1. Extracts trajectories. Groups rows by session_id, orders by time, keeps sessions above a configurable turn threshold.
  2. Compiles QA pairs. Sends each trajectory to a small model (Gemini 2.5 Flash, by default) and asks for 1–5 questions whose answers require combining at least two non-adjacent turns. Returns structured JSON.
  3. Stores them. SQLite by default (./data/qa_pairs.sqlite); optional mirror into a Supabase table for shared eval infrastructure.

A typical run compiles a few dozen trajectories for well under a dollar.

Quick start

git clone https://github.com/workloftai/trajectory-compiler
cd trajectory-compiler
pip install pyyaml          # only dependency beyond stdlib

# Point the extractor at your audit log
export NEXT_PUBLIC_SUPABASE_URL=https://...
export SUPABASE_SERVICE_ROLE_KEY=...

# See what trajectories qualify
python3 cli.py extract --min-turns 10 --limit 20

# Compile the top 25 trajectories with the cheap tier
python3 cli.py compile --min-turns 10 --limit 25 --tier cheap

# Inspect what you got
python3 cli.py stats
python3 cli.py sample --limit 5 --qa-type composition

The default storage is a local SQLite file under data/. If you want to share the dataset across agents, run the migration in migrations/001_qa_pairs.sql against your Supabase project and pass --supabase-mirror.

Why bother

Three reasons we built this for Workloft, in order of how much we care:

  1. A real eval set, made from real work. Every model upgrade we consider (Opus 4.7 → 4.8, Sonnet 4.6 → 4.7, our local Qwen) gets the same question: does it still get the same answers from our own historical trajectories? Public benchmarks do not know how our agents work. Our audit log does.
  2. Cheap supervision for our local model. The pairs are usable as instruction-tuning data for the local Qwen we run on the VPS for sovereign workloads. We do not need to invent training data; our agents are already generating it.
  3. A way to grade memory recall. Our Hindsight memory layer is supposed to bring back useful context across sessions. The QA pairs let us check that directly: ask the question, see whether the retrieval surfaced the supporting turns.

Schema

QA pair row (SQLite):

field type meaning
id uuid primary key
session_id text source session in the audit log
agent text which agent ran the session
turn_count int total turns in the source trajectory
question text the distilled long-context question
answer text grounded answer
supporting_turns json array which turns the answer pulls from (≥2)
qa_type text factoid / reasoning / composition / temporal
source_model text model used by the compiler
cost_usd numeric per-pair compute cost
raw_extraction json the LLM's raw output for that pair
created_at timestamp row created

What we filter out

  • Trajectories with fewer than the threshold of turns.
  • Pairs the LLM produced where supporting_turns has fewer than two distinct references (we only want long-context examples, not single-turn lookups).
  • Empty or non-JSON LLM responses.

What we do not do

  • We do not redact the trajectory text before sending it to the LLM compiler. If your audit log contains PII or commercial sensitivity, run this only against a model and region you trust. Pair this with redaction at the audit-log boundary, not here.
  • We do not deduplicate pairs across runs. Re-running over the same session produces fresh rows. Use the session_id column to filter.
  • We do not score pair quality. The next iteration will add a verifier pass.

Licence

MIT. Steal it. Fork it. Ignore it.

Built by Alfred Churchill at Workloft, London.

About

Turn an agent audit log into long-context QA pairs. ACC applied to production trajectories. Open under MIT.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors