Regression testing and reasoning observability for AI agents — catch the run where your agent silently dropped a step, and fail the PR that caused it.
When an agent quietly changes what it does between runs — a dropped tool call, a skipped step, a new loop — DProvenanceKit turns each execution into a queryable, diffable trace so you can see what changed and why — not just what happened. It works with LangChain/LangGraph, the OpenAI Agents SDK, LlamaIndex, CrewAI, plain Python — or any OpenTelemetry-instrumented stack, via built-in OTLP trace ingestion — and the core has zero third-party dependencies.
Run → Record → Query → Diff → Detect regressions → Gate in CI
Two runs of the same agent. The candidate dropped its verify step and looped search — the gate caught it and failed CI. (regenerate)
Guides: Regression testing for AI agents · A CI gate for LLM agents · DProvenanceKit vs LangSmith · OpenAI Agents SDK integration
It's not just the library — it ships the surfaces that make reasoning regressions actionable:
- Gate in CI — a server-less
dprovenancekit gateCLI, plus a drop-in GitHub Action (on the GitHub Marketplace) and GitLab CI template that fail a PR/MR when a run structurally diverges from its golden baseline — a dropped, added, or reordered step — and comment the diff. - Out-of-the-box anomaly rules — Tool Drop and Looping detection with a JSON rule registry, runnable locally or on every PR.
- A hosted visualizer — a web dashboard (single-run span tree, JSON payload inspector, side-by-side structural diff, shareable HTML reports) backed by a regression-gate API and multi-tenant control plane. Available as a separate commercial service — see dprovenance.dev.
See it all in one runnable script: python examples/end_to_end_demo.py.
From PyPI (released builds):
pip install dprovenancekit
pip install "dprovenancekit[langchain]" # + LangChain adapter
pip install "dprovenancekit[openai-agents]" # + OpenAI Agents adapterFrom a checkout (development):
pip install -e ".[dev]"Requires Python 3.9+; the core has zero third-party dependencies. Releasing is documented in RELEASING.md.
Record your execution, explain what happened, and diff it against a previous run to detect drift—all with a single import.
from dprovenancekit import trace
# 1. Record an execution
with trace("Agent Workflow"):
with trace("Retrieve Documents"):
# your retrieval code here
pass
with trace("Verify Claims"):
# your verification code here
pass
# 2. Save the trace
trace.save("golden_run.sqlite")
# 3. Print a structural explanation
trace.explain()
# --- Execution Trace (b4f8d2…) ---
# ▶ Started Agent Workflow
# ▶ Started Retrieve Documents
# ✔ Finished Retrieve Documents
# ▶ Started Verify Claims
# ✔ Finished Verify Claims
# ✔ Finished Agent Workflow
# 4. Catch regressions when the logic changes: rerun the (now buggy) workflow,
# then diff the current run against the saved golden baseline
trace.diff("golden_run.sqlite")
# --- Trace Diff (Golden vs Current) ---
# ❌ Missing step: Verify ClaimsIt’s that simple to get started. Under the hood, this powers a full suite of anomaly detection, CI gating, and visual trace analysis.
The library ships the same validation corpus as the Swift version. The headless CLI runs it through the real benchmark runner:
dprovenancekit evaluate # precision/recall/F1 over the standard + adversarial corpora
dprovenancekit diagnose # causal ranking of failure modes
dprovenancekit stability # determinism boundary: isolated vs perturbed F1 varianceBoth corpora score Precision 1.000 / Recall 1.000 / F1 1.000 — 8 standard scenarios (reordering, semantic evolution, noise injection, branch collapse, …) and 5 adversarial robustness traps (dependency inversion, partial truncation, semantic substitution, …) — matching the Swift implementation case-for-case. These are the project's own bundled validation vectors, held at parity with the Swift reference implementation: an internal regression/parity check, not an external benchmark or third-party evaluation.
| Component | Module |
|---|---|
| Event model, priority tiers, drop accounting | event, priority, drop_stats |
| Recording API + ambient context | kit, context |
Global trace facade (record / save / explain / diff) |
facade |
| Stores (in-memory, WAL SQLite, raw read) | store, sqlite_store, raw_store |
| Priority-aware write buffer | write_buffer |
| Query DSL + two backends (AST eval + SQL compiler) | query |
| Live querying + anomaly detection + rule library | live_engine, anomaly, rules |
| Structural diff + span-aware snapshot diff | diff, snapshot_diff |
| Deterministic replay | replay |
| Semantic alignment engine + evidence + verification | alignment_*, verification |
| Benchmark harness, failure diagnoser, corpus | benchmark, corpus |
| Conformance testing | conformance/ (repo directory, not a package module) |
| Regression gate + fingerprinting test helpers | testing, pytest_plugin |
| Visualizer (HTML rendering) | visualizer |
| Local trace viewer server | ui_server |
| Pure view models for a trace viewer | viewmodel |
| Framework-agnostic instrumentation (decorators) | instrument |
| Framework adapters | integrations.langchain, integrations.openai_agents, integrations.llama_index, integrations.crewai, integrations.google_genai, integrations.fastapi, integrations.jupyter, integrations.mcp |
| Shareable HTML regression report | report |
Headless CLI — gate, anomalies, runs, evaluate |
cli |
The SwiftUI DProvenanceUI target is intentionally not ported (it is Apple-platform UI); its
pure value-model layer (SpanViewModel, flattening) is ported in viewmodel.
DProvenanceKit began as a Swift library for
Apple-platform and on-device AI. This Python implementation brings the same reasoning-layer
observability to Python codebases — agent frameworks, LLM workflows, tool-using models — with
zero third-party dependencies (it uses only the standard library: sqlite3, contextvars,
threading, json, hashlib, uuid, urllib).
It is a faithful port, not a loose reimplementation: it keeps the same architecture and guarantees — synchronous non-blocking recording, priority-aware backpressure, one query language over two backends held at parity, structural diffing, formally-modeled semantic alignment, and by-tier drop accounting so load-shedding is never silent. The original Swift package is unchanged; the two are held equivalent by the conformance suite below.
Keeping the Swift and Python SDKs behaviorally equivalent is enforced, not hoped for.
conformance/ holds Trace Specification v1 — a language-neutral contract plus
frozen golden vectors that pin the run fingerprint, the alignment profile hash, canonical payload
encoding, query semantics, and alignment verdicts.
python -m pytest tests/test_conformance.py # the Python SDK's claim of conformance
python conformance/generate_vectors.py # intentionally re-freeze the contractThe committed conformance/vectors/*.json are the contract: any SDK — Swift today, Rust or
TypeScript later — proves equivalence by reproducing the same files. See
conformance/TRACE_SPEC_v1.md.
Framework adapters live in dprovenancekit.integrations and are the only parts of the package with
third-party dependencies — the core stays pure standard library, and nothing imports an adapter
unless you do.
pip install "dprovenancekit[langchain]"from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.langchain import DProvenanceTracer, LangChainTraceEvent
store = SQLiteTraceStore(LangChainTraceEvent, "traces.sqlite")
tracer = DProvenanceTracer(store)
with tracer.trace(context_id="customer-42") as cb:
answer = chain.invoke(question, config={"callbacks": [cb]})
# The run is now recorded — query it, diff it against a known-good run, or
# compare run fingerprints to detect when the agent took a different path.DProvenanceCallbackHandler translates LangChain's
callback stream into a trace: each on_llm_start / on_tool_start / on_retriever_start /
on_chain_start (and its completion) becomes a typed event in execution order, LangChain's
run_id/parent_run_id become the trace's span tree, the active model/tool/retriever becomes
the engine, and (by default) lifecycle provenance edges are emitted (DERIVED_FROM
start→completion, INFORMED parent→child). Because events flow through the same recording path as
hand-written ones, the whole toolkit applies: a run's fingerprint is the structural identity of
the agent's execution path, so two runs that diverge (a tool called in a different order, a
retrieval step skipped) produce different fingerprints — a cheap regression signal. Options:
capture_payloads (prompt/completion/IO previews), link_lifecycle (edges), record_chains
(LCEL/LangGraph chain noise).
Officially listed — this adapter appears in the OpenAI Agents SDK's own docs, in the external tracing processors list (merged upstream in openai/openai-agents-python#3726).
pip install "dprovenancekit[openai-agents]"from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.openai_agents import register, OpenAIAgentsTraceEvent
store = SQLiteTraceStore(OpenAIAgentsTraceEvent, "traces.sqlite")
register(store) # registers a global tracing processor
# ... run your agents normally; each run is recorded ...DProvenanceTracingProcessor implements the SDK's
TracingProcessor: each agent run becomes a trace-run (context_id = the trace name), and every
span start/end becomes a typed event — agent.start, generation.end, function.start,
guardrail.error, … — in execution order. The span's span_id/parent_id become the span
tree, the active agent/tool/model becomes the engine, errors and triggered guardrails are
recorded at CRITICAL, and lifecycle provenance edges are emitted (same
DERIVED_FROM/INFORMED model). One registered processor captures every run; the same
fingerprint/diff/align tooling then applies.
pip install "dprovenancekit[llama-index]"from llama_index.core import Settings
from dprovenancekit import DProvenanceKit, SQLiteTraceStore
from dprovenancekit.integrations.llama_index import (
DProvenanceLlamaIndexCallbackHandler,
LlamaIndexTraceEvent,
)
kit = DProvenanceKit(LlamaIndexTraceEvent)
store = SQLiteTraceStore(LlamaIndexTraceEvent, "traces.sqlite")
# The handler records into an active run
with kit.run(context_id="qa-session", store=store) as run:
handler = DProvenanceLlamaIndexCallbackHandler(run)
Settings.callback_manager.add_handler(handler)
# Execute queries normally; they are recorded to the trace store
response = index.as_query_engine().query("What did the author do growing up?")
# Flush buffered events so the recorded run is durable before the script exits
store.flush()pip install "dprovenancekit[crewai]"Modern CrewAI (0.85+, which dropped LangChain) emits its own lifecycle events on a global
event bus. DProvenanceKit hooks that bus with a BaseEventListener — constructing the
listener registers it, and every crew.kickoff() after that is recorded as a run. The
engine is the agent's role, the tool's name, or the model:
from crewai import Agent, Crew, Task
from dprovenancekit import SQLiteTraceStore
from dprovenancekit.integrations.crewai import (
CrewAITraceEvent,
DProvenanceKitEventListener,
)
store = SQLiteTraceStore(CrewAITraceEvent, "traces.sqlite")
DProvenanceKitEventListener(store) # registers on crewai's event bus
researcher = Agent(
role="Researcher",
goal="Find accurate information",
backstory="A meticulous researcher",
)
task = Task(
description="Summarize the latest developments in AI agent testing",
expected_output="A short summary",
agent=researcher,
)
crew = Crew(agents=[researcher], tasks=[task])
crew.kickoff() # recorded: crew / task / agent / tool / llm events
store.flush() # make the run durable before the process exitsEach kickoff becomes one diffable run — crew.start/.end, task.*, agent.*,
tool.*, and llm.* events in a span tree, with DERIVED_FROM / INFORMED edges.
Events are ordered by CrewAI's own emission_sequence (its handler bus is multithreaded,
so arrival order isn't emission order), which keeps the run — and its fingerprint —
stable. Two runs of the same crew therefore share a fingerprint, so the
fingerprint / diff / gate tooling flags a dropped task, a skipped tool, or a looping
agent.
If your agent is already instrumented for OpenTelemetry — the official GenAI semantic
conventions (gen_ai.*), OpenInference (everything Arize Phoenix instruments: LangChain,
LlamaIndex, CrewAI, the OpenAI Agents SDK, DSPy, …), OpenLLMetry/Traceloop, or the Vercel
AI SDK — you don't need to re-instrument anything. Export the traces as OTLP JSON (the
OTel Collector's file exporter, or any OTLP/HTTP JSON export) and ingest them:
dprovenancekit ingest golden.json candidate.json --db traces.sqlite
dprovenancekit gate --db traces.sqlite --golden <run-id> --candidate <run-id>Each OTel trace becomes a run; spans are normalized to vendor-neutral steps
(llm_call, tool_call, agent_invocation, …) with the model/tool/agent name as the
engine, so runs recorded by different instrumentation dialects diff cleanly against
each other. Run ids derive deterministically from OTel trace ids, so re-ingesting a file
never duplicates runs. Zero new dependencies — the parser is stdlib-only, like the rest
of the core. Python API: dprovenancekit.otel_ingest.ingest_otlp.
The live LangChain and OpenAI Agents integrations emit framework-native event names by
default (toolStarted, function.start, …). Pass canonical=True and they instead
record the same vendor-neutral vocabulary OTel ingestion produces
(tool_call.*, llm_call.*, agent_invocation.*), keeping the original name in a
native_type attribute:
register(store, canonical=True) # OpenAI Agents SDK
DProvenanceTracer(store).trace("case-1", canonical=True) # LangChainNow runs recorded from OpenAI Agents, LangChain, and OTel ingestion speak one step-type
vocabulary, so the bundled agent.json ruleset
(dprovenancekit anomalies --rules agent) fires on all of them, and a run from one
framework is comparable to the same agent recorded under another. (Canonical mode
rewrites event types, not engine/component names — so cross-framework runs become
comparable, not byte-identical: an LLM step's engine is gpt-4o under OpenAI Agents but
ChatOpenAI under LangChain.) It's opt-in, so existing golden baselines — keyed on the
native names — are unaffected.
dprovenancekit.testing turns "did my agent regress?" into one assertion you can drop into any test
or CI step. Give it a golden run (known-good) and a candidate run (what your current code
produced); it aligns them and fails with a readable diagnostic if the candidate diverged.
from dprovenancekit.testing import assert_no_regression
assert_no_regression(golden=golden_run, candidate=candidate_run)Strict by default — any removed, added, or changed (ambiguous) step fails, and a removed or
reordered CRITICAL step is additionally a HIGH-severity regression (reordering a critical step can
invert a dependency). Loosen with max_regression_level (gate only on severity) or
allow_divergent_steps (tolerate benign per-step changes), or pass a custom evaluator to define
what "equivalent" means (e.g. ignore volatile fields like token counts).
RegressionGate(...).check(...) returns a RegressionReport (no raise) for richer assertions.
Detecting reordered steps requires a span-aware profile (AlignmentProfile.developer_debug_v1);
the default linear profile treats a pure reorder as still-matching. Complements
AlignmentSnapshotValidator (an exact output-hash snapshot): the gate works on two runs and reasons
about regression severity.
The bundled pytest plugin turns the gate into snapshot testing for reasoning traces — no run ids, no store plumbing:
def test_research_agent(golden_trace):
with golden_trace("research-agent"):
run_my_agent() # anything using @traced / record_event / an adapterpytest --dprov-update-golden # record (or intentionally update) the baseline, then commit it
pytest # every run after gates against tests/goldens/research-agent.sqliteThe baseline is a SQLite file you commit next to your tests; the fixture records the block as a
candidate run and fails the test when its execution diverges from the baseline. Configure the directory with the
dprov_golden_dir ini option, pass gate options per test
(golden_trace("name", max_regression_level="high")), and use the context manager's .run to
wire a framework adapter inside the block.
dprovenancekit gate and the GitHub Action can select runs by context id
instead of run id — --golden-context / --candidate-context pick the newest run recorded with
that context, so CI scripts never extract run UUIDs:
dprovenancekit gate --db traces.sqlite --golden-context golden --candidate-context candidateCopy-paste CI setups live in examples/ci/: a cloud-sync
github-workflow.yml, a zero-dependency
artifact-baseline pair (record-baseline.yml +
agent-regression-gate.yml) with an
anomaly ruleset, and a
GitLab template — see the
examples/ci README for the baseline-management and
db-path routing notes.
examples/regression_testing.py is the end-to-end story in ~150
readable lines: record a golden run of a fact-checking agent (retrieve → verify → decide), then
catch a later run that skips its verification step — via both the fast fingerprint check and the
detailed alignment verdict (which flags the dropped claimVerified step as a HIGH regression).
python examples/regression_testing.pyIt self-asserts its verdicts, so it doubles as an executable test of the headline use case.
Not using a framework? Instrument a hand-written agent loop directly — no event type to define, zero
dependencies (ships in core as dprovenancekit.instrument):
from dprovenancekit import InMemoryTraceStore, traced, traced_run, record_event
@traced
def search(query): ...
@traced
def answer(question, sources): ...
store = InMemoryTraceStore()
with traced_run(store, context_id="ticket-42"):
sources = search(question)
record_event("plan.chosen", {"strategy": "rag"})
reply = answer(question, sources)@traced records a "<name>.start" / ".end" / ".error" event pair per call in its own span
(the function name is the engine), nests calls in the span tree, and emits the same
DERIVED_FROM / INFORMED provenance edges as the framework adapters. record_event(...) drops an
ad-hoc event (a decision, a chosen branch). Plain functions, async def, generators, and async
generators are all supported (for a generator, start/end bracket the full iteration).
Instrumentation never changes behavior — capture is failure-proof and exceptions pass through
unchanged. Outside a traced_run the decorators are transparent, so instrumented code is safe to
call untraced. The trace it produces is identical in shape to the adapter-produced ones, so
fingerprint / diff / align / the regression gate all apply.
python -m pytest380+ tests. A default run (core plus whichever adapters you have installed) is ~380; CI's full matrix is higher. Coverage spans Swift-parity tests ported from the original suite, cross-language conformance checks against the frozen Trace Specification v1 vectors, per-adapter integration tests (LangChain, OpenAI Agents SDK, LlamaIndex, CrewAI), instrumentation-layer tests, regression-gate tests, facade and visualizer tests, ecosystem integration tests (FastAPI, Jupyter, MCP), and the regression-testing example run as a self-asserting test. (Real-framework tests run only when the integrations are installed, otherwise skipped.)
Distributed under the Apache License 2.0. See LICENSE.
