Turn the audit log of a running multi-agent stack into structured long-context QA pairs you can use for evaluation or supervision.
We run eight agents at Workloft. Every action they take lands in an append-only audit log. After a few months of operations, that log contains thousands of multi-turn trajectories: each one is a small piece of implicit long-context reasoning that nobody is currently learning from.
This tool changes that.
Inspired by the Agent Context Compilation method described in arXiv:2605.21850. The paper trains long-context recall by converting agent trajectories into QA pairs. We do the same thing, but starting from a production audit log rather than a synthetic benchmark run.
Given a Postgres audit log table (we use Supabase, but any compatible schema
works), trajectory-compiler:
- Extracts trajectories. Groups rows by
session_id, orders by time, keeps sessions above a configurable turn threshold. - Compiles QA pairs. Sends each trajectory to a small model (Gemini 2.5 Flash, by default) and asks for 1–5 questions whose answers require combining at least two non-adjacent turns. Returns structured JSON.
- Stores them. SQLite by default (
./data/qa_pairs.sqlite); optional mirror into a Supabase table for shared eval infrastructure.
A typical run compiles a few dozen trajectories for well under a dollar.
git clone https://github.com/workloftai/trajectory-compiler
cd trajectory-compiler
pip install pyyaml # only dependency beyond stdlib
# Point the extractor at your audit log
export NEXT_PUBLIC_SUPABASE_URL=https://...
export SUPABASE_SERVICE_ROLE_KEY=...
# See what trajectories qualify
python3 cli.py extract --min-turns 10 --limit 20
# Compile the top 25 trajectories with the cheap tier
python3 cli.py compile --min-turns 10 --limit 25 --tier cheap
# Inspect what you got
python3 cli.py stats
python3 cli.py sample --limit 5 --qa-type compositionThe default storage is a local SQLite file under data/. If you want to share
the dataset across agents, run the migration in migrations/001_qa_pairs.sql
against your Supabase project and pass --supabase-mirror.
Three reasons we built this for Workloft, in order of how much we care:
- A real eval set, made from real work. Every model upgrade we consider (Opus 4.7 → 4.8, Sonnet 4.6 → 4.7, our local Qwen) gets the same question: does it still get the same answers from our own historical trajectories? Public benchmarks do not know how our agents work. Our audit log does.
- Cheap supervision for our local model. The pairs are usable as instruction-tuning data for the local Qwen we run on the VPS for sovereign workloads. We do not need to invent training data; our agents are already generating it.
- A way to grade memory recall. Our Hindsight memory layer is supposed to bring back useful context across sessions. The QA pairs let us check that directly: ask the question, see whether the retrieval surfaced the supporting turns.
QA pair row (SQLite):
| field | type | meaning |
|---|---|---|
| id | uuid | primary key |
| session_id | text | source session in the audit log |
| agent | text | which agent ran the session |
| turn_count | int | total turns in the source trajectory |
| question | text | the distilled long-context question |
| answer | text | grounded answer |
| supporting_turns | json array | which turns the answer pulls from (≥2) |
| qa_type | text | factoid / reasoning / composition / temporal |
| source_model | text | model used by the compiler |
| cost_usd | numeric | per-pair compute cost |
| raw_extraction | json | the LLM's raw output for that pair |
| created_at | timestamp | row created |
- Trajectories with fewer than the threshold of turns.
- Pairs the LLM produced where
supporting_turnshas fewer than two distinct references (we only want long-context examples, not single-turn lookups). - Empty or non-JSON LLM responses.
- We do not redact the trajectory text before sending it to the LLM compiler. If your audit log contains PII or commercial sensitivity, run this only against a model and region you trust. Pair this with redaction at the audit-log boundary, not here.
- We do not deduplicate pairs across runs. Re-running over the same session
produces fresh rows. Use the
session_idcolumn to filter. - We do not score pair quality. The next iteration will add a verifier pass.
MIT. Steal it. Fork it. Ignore it.
Built by Alfred Churchill at Workloft, London.