A small benchmark and replay harness that argues agent workloads should be optimized at the serving layer before the model. Runs on one rented RTX 3090.
What's in here:
- A FastAPI trace proxy and replay harness.
- Six runtime policies: naive, exact-prefix-cache, prefix-aware-routing, prompt-compaction, max-token-prediction, and a learned policy model.
- 720 synthetic public agent traces, 40 SWE-bench Verified prompts, 120 OpenHands trajectory shapes.
- 13 JSON reports from real GPU runs and training jobs.
- A metadata MLP router that hits 87.5% held-out accuracy without touching private content.
- An interactive paper in
web/(Svelte 5 + Vite) that reads the same reports the Python code emits.
Public SWE-bench Verified prompts on Qwen3-4B-AWQ, one RTX 3090:
| Stack | Naive avg | Shared-prefix avg | Cut |
|---|---|---|---|
| vLLM | 644 ms | 198 ms | 69.33% |
| SGLang | 976 ms | 133 ms | 86.38% |
Same weights, same prompts. Difference is runtime policy. Full table and limitations in RESULTS.md and WRITEUP.md.
src/agent_trace_runtime/ Python package: proxy, replay, policies, router, schema, FastAPI app
templates/, static/ Server-rendered HTMX dashboard
web/ Svelte 5 + Vite static site
script/ Data prep, benchmark drivers, training entry points
data/ Synthetic + public agent traces (NDJSON / JSONL)
reports/ JSON evidence from every run
tests/ pytest tests for replay, privacy, web app
dashboard/ Static zero-runtime dashboard (fallback)
goal.md What was being built
WRITEUP.md Paper-style narrative
RESULTS.md Per-run measurements, including negative results
BENCHMARKS.md, RUNBOOK.md How to reproduce
DATA.md, PRIVACY.md Data sources and privacy stance
Tested on Python 3.12, macOS and Linux.
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
PYTHONPATH=src python3 -m pytest tests
PYTHONPATH=src python3 -m uvicorn agent_trace_runtime.web_app:app --host 127.0.0.1 --port 8081
# open http://127.0.0.1:8081For the GPU runs you need a CUDA box. RUNBOOK.md has the exact serving commands.
cd web
npm install
npm run dev # local
npm run build # static dist/, deploys anywhereSections: Overview (hero token race), Results (latency race + sweep), Trace Explorer (720 traces), Router Lab (live policy selection), Agent Shapes (OpenHands), Paper (text), Data Appendix.
LoRA adapters and the metadata MLP report their scores in reports/policy_*.json and reports/metadata_policy_classifier_report.json. Adapter weights are not committed; regenerate them from the public SFT data via the training scripts in script/.
Everything in data/ and reports/ is public or synthetic:
data/synthetic_traces.ndjson- 720 generated app/change traces.data/swe_bench_verified_tasks.ndjson- 40 tasks sampled from SWE-bench Verified.data/policy_sft_public.jsonl- 760 public-only SFT rows.reports/openhands_trace_shapes.json- shape stats over 120 public OpenHands trajectories.
No private content anywhere. See PRIVACY.md.
- 1x NVIDIA RTX 3090 (24 GB VRAM, ~805 GB/s)
- Xeon Gold 6138
- vLLM 0.21.0, SGLang 0.5.9
- Qwen3-4B-AWQ primary, Qwen3-8B-AWQ smoke
- One Vast.ai instance, single-GPU, single-process
MIT.