A deterministic replay system for AI agent interactions. Record, replay, and debug agent behaviors without consuming LLM tokens. Think VCR for HTTP, but for AI agent workflows.
Developing AI agents requires rapid iteration. Each debug cycle burns tokens and costs money. Agent Replay decouples agent debugging from live LLM calls by recording complete interaction traces that can be replayed deterministically.
| Capability | Description |
|---|---|
| Record & Replay | Capture every LLM call, tool invocation, and routing decision in a trace. |
| Token-Free Debugging | Replay traces with stubbed responses — zero API calls, zero cost. |
| Partial Replay | Replay up to step N, then go live. Debug "the agent was fine for 6 turns then went sideways." |
| Diff Mode | Run live LLM calls against a recorded trace and compare outputs to detect behavioral drift. |
| Deterministic Testing | Seeded PRNG, frozen clock, and snapshot environment for reproducible tests. |
Agent Replay is in pre-alpha (v0.1.0). Core data models, the monorepo structure, and CI/CD are in place. Package APIs are under active development and subject to change.
See DEV_PLAN.md for the roadmap and ARCHITECTURE.md for the technical design.
- Node.js >= 18.0.0
- pnpm >= 8.0.0
git clone https://github.com/reaatech/agent-replay.git
cd agent-replay
pnpm install
pnpm build
pnpm test| Command | Description |
|---|---|
pnpm build |
Build all packages |
pnpm test |
Run test suite |
pnpm test:watch |
Run tests in watch mode |
pnpm test:coverage |
Run tests with coverage report |
pnpm lint |
Lint all packages |
pnpm format |
Format code with Prettier |
pnpm type-check |
Type-check all packages |
pnpm clean |
Clean build artifacts |
| Package | npm Scope | Description | Status |
|---|---|---|---|
@reaatech/agent-replay-shared |
packages/shared |
Shared types, interfaces, and utilities | In progress |
@reaatech/agent-replay-core |
packages/core |
Recording, replay, and diff engines | In progress |
@reaatech/agent-replay-interceptors |
packages/interceptors |
LLM provider interceptors (OpenAI, Anthropic) | In progress |
@reaatech/agent-replay |
packages/agent-replay |
Convenience package — re-exports core + interceptors | In progress |
@reaatech/agent-replay-cli |
packages/cli |
Command-line interface | Planned |
@reaatech/agent-replay-integrations |
packages/integrations |
Framework integrations (LangChain, LangGraph) | Planned |
@reaatech/agent-replay-web-ui |
packages/web-ui |
Web-based trace viewer | Planned |
Agent Replay is built around a trace-based data model with hierarchical spans and events, inspired by distributed tracing systems like OpenTelemetry.
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Recording │ │ Trace Storage │ │ Replay │
│ Engine │ ──► │ (.artrace.json) │ ──► │ Engine │
│ │ │ │ │ │
│ • Interceptors │ │ • Local filesystem │ │ • Stubbed replay │
│ • Span builder │ │ • Future: SQLite │ │ • Partial replay │
│ • State capture │ │ • Future: S3 │ │ • Diff mode │
│ • Checkpoints │ │ │ │ • Debugger │
└─────────────────────┘ └─────────────────────┘ └─────────────────────┘
- Traces are stored as line-delimited JSON (
.artrace.json) for streaming and incremental processing. - Spans represent discrete operations (LLM calls, tool invocations, agent steps).
- Checkpoints capture agent state for partial replay and resumption.
- Diff engine performs semantic and structural comparisons between runs.
Read the full architecture in ARCHITECTURE.md.
Traces use the .artrace.json format — line-delimited JSON with a header, span lines, and a footer index. Each span contains typed events (request, response, error, state_snapshot, checkpoint, annotation) with provider-agnostic payloads.
const engine = new RecordingEngine();
const session = engine.startRecording({ name: 'my-agent-run' });
const spanId = engine.startSpan('llm-call', 'llm_call');
engine.captureEvent(
{
timestamp: Date.now(),
type: 'response',
name: 'llm-response',
attributes: {},
data: { content: 'Hello!' },
},
{ spanId }
);
engine.endSpan(spanId);
const trace = engine.stopRecording(session);| Mode | Behavior |
|---|---|
stubbed |
Returns recorded responses — zero LLM calls. |
live |
Executes real LLM calls alongside the trace. |
partial |
Replays to a checkpoint, then switches to live. |
diff |
Runs live and compares outputs against the recorded trace. |
Partial replay requires snapshotting agent state at checkpoints. The system supports:
- Default:
structuredClonewith fallback to JSON serialization. - Custom snapshotters: Register custom serializers for non-serializable objects (class instances, circular references).
- Framework adapters: LangChain, LangGraph, and AutoGen adapters for framework-specific state.
Functions, closures, and external resources (DB connections, file handles) are not captured — code changes require full re-recording.
| Phase | Focus | Timeline |
|---|---|---|
| 1 | Core recording/replay engine + trace format | In progress |
| 2 | Partial replay, diff mode, CLI debugger | Upcoming |
| 3 | Framework integrations, test suite | Upcoming |
| 4 | Performance, enterprise features, web UI | Upcoming |
See DEV_PLAN.md for the detailed plan with individual task breakdowns.
Contributions are welcome. Please read CONTRIBUTING.md for development workflow, coding standards, and pull request guidelines.
- TypeScript strict mode enabled
- Minimum 90% test coverage
- Conventional commits
- Prettier + ESLint formatting
- Architecture — System design and component overview
- Development Plan — Roadmap and milestone tracking
- Contributing — How to contribute
- Agent Guide — Agent development reference
MIT © Reaatech and contributors