Skip to content

synthpolis/agent-trace-runtime-paper

Repository files navigation

Agent Trace Runtime

A small benchmark and replay harness that argues agent workloads should be optimized at the serving layer before the model. Runs on one rented RTX 3090.

What's in here:

  • A FastAPI trace proxy and replay harness.
  • Six runtime policies: naive, exact-prefix-cache, prefix-aware-routing, prompt-compaction, max-token-prediction, and a learned policy model.
  • 720 synthetic public agent traces, 40 SWE-bench Verified prompts, 120 OpenHands trajectory shapes.
  • 13 JSON reports from real GPU runs and training jobs.
  • A metadata MLP router that hits 87.5% held-out accuracy without touching private content.
  • An interactive paper in web/ (Svelte 5 + Vite) that reads the same reports the Python code emits.

Headline numbers

Public SWE-bench Verified prompts on Qwen3-4B-AWQ, one RTX 3090:

Stack Naive avg Shared-prefix avg Cut
vLLM 644 ms 198 ms 69.33%
SGLang 976 ms 133 ms 86.38%

Same weights, same prompts. Difference is runtime policy. Full table and limitations in RESULTS.md and WRITEUP.md.

Layout

src/agent_trace_runtime/   Python package: proxy, replay, policies, router, schema, FastAPI app
templates/, static/        Server-rendered HTMX dashboard
web/                       Svelte 5 + Vite static site
script/                    Data prep, benchmark drivers, training entry points
data/                      Synthetic + public agent traces (NDJSON / JSONL)
reports/                   JSON evidence from every run
tests/                     pytest tests for replay, privacy, web app
dashboard/                 Static zero-runtime dashboard (fallback)
goal.md                    What was being built
WRITEUP.md                 Paper-style narrative
RESULTS.md                 Per-run measurements, including negative results
BENCHMARKS.md, RUNBOOK.md  How to reproduce
DATA.md, PRIVACY.md        Data sources and privacy stance

Reproduce

Tested on Python 3.12, macOS and Linux.

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

PYTHONPATH=src python3 -m pytest tests

PYTHONPATH=src python3 -m uvicorn agent_trace_runtime.web_app:app --host 127.0.0.1 --port 8081
# open http://127.0.0.1:8081

For the GPU runs you need a CUDA box. RUNBOOK.md has the exact serving commands.

Interactive paper

cd web
npm install
npm run dev      # local
npm run build    # static dist/, deploys anywhere

Sections: Overview (hero token race), Results (latency race + sweep), Trace Explorer (720 traces), Router Lab (live policy selection), Agent Shapes (OpenHands), Paper (text), Data Appendix.

Models

LoRA adapters and the metadata MLP report their scores in reports/policy_*.json and reports/metadata_policy_classifier_report.json. Adapter weights are not committed; regenerate them from the public SFT data via the training scripts in script/.

Data

Everything in data/ and reports/ is public or synthetic:

  • data/synthetic_traces.ndjson - 720 generated app/change traces.
  • data/swe_bench_verified_tasks.ndjson - 40 tasks sampled from SWE-bench Verified.
  • data/policy_sft_public.jsonl - 760 public-only SFT rows.
  • reports/openhands_trace_shapes.json - shape stats over 120 public OpenHands trajectories.

No private content anywhere. See PRIVACY.md.

Hardware

  • 1x NVIDIA RTX 3090 (24 GB VRAM, ~805 GB/s)
  • Xeon Gold 6138
  • vLLM 0.21.0, SGLang 0.5.9
  • Qwen3-4B-AWQ primary, Qwen3-8B-AWQ smoke
  • One Vast.ai instance, single-GPU, single-process

License

MIT.

About

Agent inference is not chat inference. Trace proxy + replay harness + paper for runtime policy optimization on a single RTX 3090.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors