trajectory-compiler

Turn the audit log of a running multi-agent stack into structured long-context QA pairs you can use for evaluation or supervision.

We run eight agents at Workloft. Every action they take lands in an append-only audit log. After a few months of operations, that log contains thousands of multi-turn trajectories: each one is a small piece of implicit long-context reasoning that nobody is currently learning from.

This tool changes that.

Inspired by the Agent Context Compilation method described in arXiv:2605.21850. The paper trains long-context recall by converting agent trajectories into QA pairs. We do the same thing, but starting from a production audit log rather than a synthetic benchmark run.

What it does

Given a Postgres audit log table (we use Supabase, but any compatible schema works), trajectory-compiler:

Extracts trajectories. Groups rows by session_id, orders by time, keeps sessions above a configurable turn threshold.
Compiles QA pairs. Sends each trajectory to a small model (Gemini 2.5 Flash, by default) and asks for 1–5 questions whose answers require combining at least two non-adjacent turns. Returns structured JSON.
Stores them. SQLite by default (./data/qa_pairs.sqlite); optional mirror into a Supabase table for shared eval infrastructure.

A typical run compiles a few dozen trajectories for well under a dollar.

Quick start

git clone https://github.com/workloftai/trajectory-compiler
cd trajectory-compiler
pip install pyyaml          # only dependency beyond stdlib

# Point the extractor at your audit log
export NEXT_PUBLIC_SUPABASE_URL=https://...
export SUPABASE_SERVICE_ROLE_KEY=...

# See what trajectories qualify
python3 cli.py extract --min-turns 10 --limit 20

# Compile the top 25 trajectories with the cheap tier
python3 cli.py compile --min-turns 10 --limit 25 --tier cheap

# Inspect what you got
python3 cli.py stats
python3 cli.py sample --limit 5 --qa-type composition

The default storage is a local SQLite file under data/. If you want to share the dataset across agents, run the migration in migrations/001_qa_pairs.sql against your Supabase project and pass --supabase-mirror.

Why bother

Three reasons we built this for Workloft, in order of how much we care:

A real eval set, made from real work. Every model upgrade we consider (Opus 4.7 → 4.8, Sonnet 4.6 → 4.7, our local Qwen) gets the same question: does it still get the same answers from our own historical trajectories? Public benchmarks do not know how our agents work. Our audit log does.
Cheap supervision for our local model. The pairs are usable as instruction-tuning data for the local Qwen we run on the VPS for sovereign workloads. We do not need to invent training data; our agents are already generating it.
A way to grade memory recall. Our Hindsight memory layer is supposed to bring back useful context across sessions. The QA pairs let us check that directly: ask the question, see whether the retrieval surfaced the supporting turns.

Schema

QA pair row (SQLite):

field	type	meaning
id	uuid	primary key
session_id	text	source session in the audit log
agent	text	which agent ran the session
turn_count	int	total turns in the source trajectory
question	text	the distilled long-context question
answer	text	grounded answer
supporting_turns	json array	which turns the answer pulls from (≥2)
qa_type	text	factoid / reasoning / composition / temporal
source_model	text	model used by the compiler
cost_usd	numeric	per-pair compute cost
raw_extraction	json	the LLM's raw output for that pair
created_at	timestamp	row created

What we filter out

Trajectories with fewer than the threshold of turns.
Pairs the LLM produced where supporting_turns has fewer than two distinct references (we only want long-context examples, not single-turn lookups).
Empty or non-JSON LLM responses.

What we do not do

We do not redact the trajectory text before sending it to the LLM compiler. If your audit log contains PII or commercial sensitivity, run this only against a model and region you trust. Pair this with redaction at the audit-log boundary, not here.
We do not deduplicate pairs across runs. Re-running over the same session produces fresh rows. Use the session_id column to filter.
We do not score pair quality. The next iteration will add a verifier pass.

Licence

MIT. Steal it. Fork it. Ignore it.

Built by Alfred Churchill at Workloft, London.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
migrations		migrations
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cli.py		cli.py
compile.py		compile.py
extract.py		extract.py
store.py		store.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

trajectory-compiler

What it does

Quick start

Why bother

Schema

What we filter out

What we do not do

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

trajectory-compiler

What it does

Quick start

Why bother

Schema

What we filter out

What we do not do

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages