Skip to content

Provider Plugin Contract

Ivan Seredkin edited this page May 23, 2026 · 2 revisions

Provider Plugin Contract

How budi supports a new AI coding agent. The current set is Claude Code, Codex CLI, Copilot CLI, Copilot Chat (VS Code-family), and Cursor; everything below is what a contributor needs to know to add a sixth.

This page describes the shape of the extension surface. The live-path rationale (why a Provider trait at all, why JSONL tailing is the only live path) is pinned in JSONL Tailing as Live Ingestion (ADR-0089). The attribution contract every row must uphold is in SOUL.md.

The Provider trait

Every supported agent implements Provider, defined in crates/budi-core/src/provider.rs. It owns four things and nothing else:

Method Returns Purpose
discover_files() Vec<DiscoveredFile> One-shot enumeration of all transcript files this agent has ever written. Used by budi db import for historical backfill.
parse_file(path, content, offset) Vec<ParsedMessage> (+ new offset) Incremental parse of one transcript file starting at the stored byte offset. Returns whatever new messages it found. The shared pipeline turns those into canonical rows.
watch_roots() Vec<PathBuf> The directories the daemon's filesystem tailer subscribes to via notify. New files matching this provider's transcript shape are picked up automatically.
sync_direct(...) (optional) (provider-specific) Only for agents with a real Usage API. Currently used by Cursor to pull per-request cost/token truth-up from the dashboard API (Cursor Usage API Contract).

Everything else — cost calculation, repo / branch / ticket attribution, tool-outcome inference, deduplication, cloud sync — runs in the shared pipeline. Providers are parsers, not full ingestion paths. That is the load-bearing rule: a new agent should be a one-file change under crates/budi-core/src/providers/ plus its registration.

The shared pipeline

After a Provider hands back ParsedMessages, the daemon runs them through an ordered enricher chain. Order matters — each enricher depends on prior enrichers.

  1. IdentityEnricher — stamps the row with session_id, provider, the daemon install ID, and a deterministic message identity.
  2. GitEnricher — resolves repo_id and git_branch from the message's cwd (or from the per-line gitBranch the transcript carries natively); extracts ticket_id from the branch name with companion ticket_source / ticket_prefix tags.
  3. ToolEnricher — extracts tool-call outcomes from tool-result blocks; emits tool_outcome (success / error / denied / retry) with tool_outcome_source and tool_outcome_confidence siblings.
  4. FileEnricher — extracts per-file attribution from file-aware tool arguments (Read / Write / Edit / MultiEdit / Grep / …); enforces the repo-root privacy boundary (no outside-of-repo paths, no file contents).
  5. CostEnricher — looks up the price for (model, provider) via the single pricing::lookup call and writes cost_cents + pricing_source. See Model Pricing – Embedded Baseline and Runtime Refresh for the manifest contract.
  6. TagEnricher — finalizes the row for SQLite write.

A provider whose transcript already carries some of these fields still goes through the chain; enrichers no-op on rows whose fields are already populated. The chain is also where the cross-message tool-outcome correlation lives, so the live tailer and budi db import produce identical rows from identical inputs.

The current set

Provider id Watch root(s) Source crate Notes
claude_code ~/.claude/projects/<project-hash>/ providers/claude_code.rs JSONL, one record per turn. Carries sessionId, model, token counts, cwd, gitBranch natively.
codex ~/.codex/sessions/<session-id>/ providers/codex.rs Same shape; slightly different field names. The provider normalizes them.
copilot_cli ~/.copilot/session-state/ providers/copilot.rs Standalone Copilot CLI; unrelated to the VS Code extension.
copilot_chat VS Code-family workspaceStorage/ + globalStorage/ across Code, Insiders, Exploration, VSCodium, Cursor, and remote-server installs providers/copilot_chat.rs Five envelope shapes, five token-key dispatches, plus the v3 output-only fallback for May-2026+ builds. See Copilot Chat Data Contract (ADR-0092).
cursor state.vscdb (cursorDiskKV bubbles) + ~/.cursor/projects/*/agent-transcripts/ providers/cursor.rs Bubbles are primary as of 2026-04-23; the Usage API is a supplementary overage signal. See Cursor Usage API Contract (ADR-0090).

A reference fixture lives next to each provider under the crate's tests/ directory. Fixtures are real (scrubbed) transcripts from the maintainer's own machine, not hand-written examples — regressions surface against shapes we actually see in production.

Adding a new agent

For a hypothetical new agent (call it gemini):

  1. Discover the transcript path and shape. Confirm it actually writes parseable JSONL (or that the on-disk shape is at least machine-readable). If it only persists conversation state in-memory or in an opaque binary blob, the agent is out of scope for the tailer path until it ships a transcript option. The Cursor case (where the on-disk state is state.vscdb) is the existing precedent for non-JSONL shapes.
  2. Add crates/budi-core/src/providers/gemini.rs implementing Provider. Map the agent's fields into ParsedMessage. Uphold the attribution contract: RFC3339 UTC timestamps, canonical session_id, normalized git_branch (no refs/heads/, no detached HEAD).
  3. Add a fixture under the crate's tests/ tree with a small real transcript (scrubbed). Write a parser test that runs parse_file against the fixture and asserts row counts and key fields.
  4. Register the provider in the daemon's startup wiring.
  5. If the agent's cost model is unusual (per-request flat fee, billing-API truth-up, etc.), extend the relevant downstream surface — CostEnricher for pricing, a sync_direct impl for billing-API reconciliation — rather than baking it into the provider. The Copilot Chat GitHub Billing API truth-up in sync/copilot_chat_billing.rs is the existing precedent for that shape.
  6. If the agent ships an undocumented upstream API (Cursor, Copilot Chat), pin the contract as a wiki ADR before merging the parser. The ADR is the place the next maintainer reaches for when the upstream shape shifts; the parser is updated in lockstep.

A new provider PR should not need to touch the cloud sync path, the schema, the auth code, or the dashboard UI. If it does, the design has drifted and we should fix the architecture before merging.

What parse_file must not do

  • No network calls. Providers run on the engineer's laptop; the only IO they do is reading transcript bytes from disk. Network truth-up belongs in sync_direct, which is a scheduled pull — not part of the live hot path.
  • No reading of prompt or response content. Token counts, model id, timestamps, cwd, gitBranch, tool-call arguments needed for file attribution — yes. Prompt text, response text, embedded code blocks, tool-result body — no. The privacy boundary is enforced in the parser, not at the upload boundary.
  • No mutation of the transcript file. Even fixing what looks like a malformed line gets us into territory where we are silently editing the engineer's data on disk.
  • No outside-of-repo file paths. file_attribution::attribute_files strips absolute paths against the message's cwd / resolved repo root and drops anything that cannot be proven to sit inside the repo root. This runs after the provider, but providers should hand back raw candidate paths from tool arguments rather than pre-resolving them.

Reference implementations

Out of scope here

Clone this wiki locally