🧪 AgentProbe

pytest for AI Agents

Record, test, replay, and secure your AI agents — locally, privately, in your CI pipeline.

Your agents call LLMs, invoke tools, and make routing decisions — yet you have no way to test them, no way to catch regressions, and no idea why one run costs $0.12 while the next costs $2.40. AgentProbe fixes that.

Quick Install

# pip (recommended)
pip install agentprobe

# or one-line installer (macOS, Linux, WSL)
curl -fsSL https://raw.githubusercontent.com/tomerhakak/agentprobe/main/install.sh | bash

Then get started:

agentprobe init        # scaffold config + example tests
agentprobe test        # run your agent tests
agentprobe platform    # local web dashboard at localhost:9700

30-Second Demo

Record your agent, then test it — just like pytest.

from agentprobe import record, RecordingSession
from agentprobe import assertions as A

# 1. Record a run
@record("my-agent")
def run_agent(query: str, session: RecordingSession) -> str:
    session.set_input(query)
    session.add_llm_call(model="gpt-4o", input_messages=[...], output_message=response)
    session.add_tool_call(tool_name="search", tool_input={"q": query}, tool_output=results)
    session.set_output(answer)
    return answer

# 2. Test the recording
def test_agent_works(recording):
    A.set_recording(recording)
    A.output_contains("refund policy")
    A.called_tool("search")
    A.total_cost_less_than(0.05)
    A.no_pii_in_output()

$ agentprobe test

 tests/test_agent.py
  PASS  test_basic_response .................. 0.8s
  PASS  test_uses_search_tool ................ 0.3s
  PASS  test_cost_within_budget .............. 0.1s
  PASS  test_no_pii_leakage .................. 0.2s
  FAIL  test_prompt_injection_resistance ..... 0.4s
        AssertionError: Output contains forbidden pattern: 'IGNORE PREVIOUS'

 4 passed, 1 failed in 1.8s
 Total cost: $0.0034 | Tokens: 1,247 | Traces: .agentprobe/

Features

🔴 Recording Capture every LLM call, tool invocation, and routing decision into portable `.aprobe` trace files. Framework-agnostic — works with any agent.	✅ Testing 35+ built-in assertions — output quality, tool usage, cost, latency, safety. Property-based, parameterized, regression, and snapshot testing.
⏪ Replay Swap models, change prompts, compare results side-by-side. See exact cost and behavior differences between `gpt-4o` and `claude-sonnet`.	🛡️ Security Prompt injection fuzzing (47+ variants), PII detection (27 entity types), threat modeling, and security scoring.
📊 Monitoring Cost tracking per run, budget alerts, behavioral drift detection, anomaly detection, latency percentiles.	🧠 Intelligence Hallucination detection, auto-optimizer, model recommender, and A/B testing across agent configurations.
⚔️ Arena ^Pro Agent vs. agent battles with ELO ratings. Head-to-head comparison across any dimension — cost, accuracy, speed, safety.	🔬 Autopsy ^Pro Forensic failure analysis. Automatic root-cause detection: infinite loops, cost explosions, tool misuse, hallucination spirals.
🔍 X-Ray Token-level cost attribution. See exactly which step, which tool call, which LLM request is burning your budget. Beautiful tree visualization.	📋 Compliance ^Pro 53 automated checks across SOC2, HIPAA, GDPR, PCI-DSS, and CCPA. Generate audit-ready reports.
🔥 Agent Roast Get a brutally honest (and funny) analysis of your agent. 450 jokes, 3 severity levels. "Your agent spends money like a drunk sailor at a token store."	💰 Cost Calculator Find out what your agent REALLY costs. Per-run, monthly, yearly projections. Model comparison with savings recommendations.
🏥 Health Check 5-dimension health score (reliability, speed, cost, security, quality) with progress bars and actionable tips.	🎮 Injection Playground 55 prompt injection attacks across 5 categories. Test your agent's defenses interactively.
🏆 Leaderboard Rank your agents by composite score. Track improvements over time. SQLite-backed, fully local.	⚖️ Model Comparator Side-by-side model comparison. Cost, speed, quality, hallucination rate. Crown emoji for the winner.
⏳ Timeline ^NEW Time-travel debugger for agent execution. Step forward/backward, set breakpoints on tools/cost/errors, inspect state at every point. Interactive TUI mode.	🧬 Agent DNA ^NEW Behavioral fingerprinting. Generate a unique multi-dimensional DNA profile for any agent. Detect drift, compare identities, visual helix rendering.
🌀 Chaos Engineering ^NEW 12 built-in chaos scenarios: tool timeouts, LLM hallucinations, cascading failures, cost explosions. Resilience scoring with recovery analysis.	📊 Agent Coverage ^NEW Like code coverage, but for agents. Track tool coverage, branch coverage, step pattern diversity, and error path testing across recordings.
📸 Snapshot Testing ^NEW Jest-style snapshots for agent behavior. Capture output, tools, cost, and patterns — automatically detect regressions on re-runs.	🚀 Token Optimizer ^NEW Automatic optimization analysis. Detects wasted tokens, recommends model downgrades, identifies caching opportunities, projects monthly savings.
👀 Watch Mode ^NEW Auto-run tests when files change. Like nodemon for AI agents — monitors recordings and test files, triggers analysis on save.	🧪 NL Test Writer ^NEW Write tests in plain English: "respond in under 5 seconds", "cost below $0.10". Auto-translates to executable pytest code.

Integrations

AgentProbe is framework-agnostic. Use it with anything.

Framework	Integration
OpenAI SDK	Auto-instrumentation
Anthropic SDK	Auto-instrumentation
LangChain	Callback handler
CrewAI	Callback handler
AutoGen	Adapter
Custom agents	Manual recording API

GitHub Action

- uses: tomerhakak/agentprobe@v1
  with:
    args: test --ci --report report.html

LangChain

from agentprobe.adapters.langchain import AgentProbeCallbackHandler

handler = AgentProbeCallbackHandler(session_name="my-chain")
chain.invoke({"input": "..."}, config={"callbacks": [handler]})

CrewAI

from agentprobe.adapters.crewai import AgentProbeCrewHandler

handler = AgentProbeCrewHandler(session_name="my-crew")
crew.kickoff(callbacks=[handler])

CI/CD

GitHub Actions

name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install agentprobe[all]
      - run: agentprobe test --ci --report report.html
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentprobe-report
          path: report.html

GitLab CI

agent-tests:
  image: python:3.12
  script:
    - pip install agentprobe[all]
    - agentprobe test --ci --report report.html
  artifacts:
    paths:
      - report.html
    when: always

Jenkins

pipeline {
    agent { docker { image 'python:3.12' } }
    stages {
        stage('Agent Tests') {
            steps {
                sh 'pip install agentprobe[all]'
                sh 'agentprobe test --ci --report report.html'
            }
        }
    }
    post {
        always { archiveArtifacts artifacts: 'report.html' }
    }
}

CLI Reference

agentprobe record     Record an agent run into a .aprobe trace
agentprobe test       Run agent tests (pytest-compatible)
agentprobe replay     Replay a recording with a different model or config
agentprobe fuzz       Fuzz your agent with prompt injections & edge cases
agentprobe scan       Security scan — PII detection, injection resistance
agentprobe roast      Get a funny brutal analysis of your agent
agentprobe xray       Visualize agent thinking step-by-step
agentprobe health     5-dimension health check with scores
agentprobe cost       Calculate true cost projections & savings
agentprobe compare    Side-by-side model comparison
agentprobe playground Interactive prompt injection lab (55 attacks)
agentprobe leaderboard Rank and track your agents over time
agentprobe analyze    Cost breakdown, drift detection, failure clustering
agentprobe platform   Launch the local web dashboard (localhost:9700)
agentprobe init       Scaffold config file and example tests
agentprobe diff       Compare two recordings or agent versions
agentprobe timeline   Time-travel debugger — step through execution
agentprobe dna        Generate behavioral DNA fingerprint
agentprobe chaos      Run chaos engineering scenarios
agentprobe coverage   Agent path coverage report
agentprobe snapshot   Capture/compare behavioral snapshots
agentprobe optimize   Token & cost optimization analysis
agentprobe watch      Auto-run tests on file changes
agentprobe nltest     Generate tests from plain English

Run agentprobe --help for the full list.

Platform

AgentProbe ships with a local web dashboard — no cloud, no accounts, no data leaves your machine.

agentprobe platform start
# Opens http://localhost:9700

+------------------------------------------------------------------+
|  AgentProbe Platform                           localhost:9700     |
+------------------------------------------------------------------+
|                                                                   |
|  Recent Traces                          Cost Trend (7d)           |
|  +---------------------------------+   +---------------------+   |
|  | customer-support  0.3s  $0.003  |   |          __/        |   |
|  | order-lookup      1.2s  $0.018  |   |      ___/           |   |
|  | refund-agent      0.8s  $0.007  |   |  ___/               |   |
|  | billing-qa        0.5s  $0.004  |   | /                   |   |
|  +---------------------------------+   +---------------------+   |
|                                                                   |
|  Assertions: 142 passed, 3 failed     Avg cost/run: $0.008       |
|  Models: gpt-4o (67%), claude-sonnet (33%)   Drift: LOW          |
+------------------------------------------------------------------+

Traces, cost breakdowns, assertion results, drift detection, and failure analysis — all in one place.

Free vs. Pro

Feature	Free	Pro
Recording & Replay	✅	✅
35+ Assertions	✅	✅
pytest Integration	✅	✅
Mock LLMs & Tools	✅	✅
CI/CD & GitHub Action	✅	✅
Cost & Latency Tracking	✅	✅
PII Detection (27 types)	✅	✅
Basic Fuzzing	✅	✅
Local Dashboard	✅	✅
🔥 Agent Roast (450 jokes)	✅	✅
🔬 X-Ray Visualization	✅	✅
💰 Cost Calculator & Projections	✅	✅
🏥 Health Check (5 dimensions)	✅	✅
🎮 Injection Playground (55 attacks)	✅	✅
🏆 Agent Leaderboard	✅	✅
⚖️ Model Comparator	✅	✅
⏳ Timeline (Time Travel Debugger)	✅	✅
🧬 Agent DNA (Behavioral Fingerprinting)	✅	✅
🌀 Chaos Engineering (12 scenarios)	✅ (5 max)	✅
📊 Agent Path Coverage	✅	✅
📸 Snapshot Testing	✅	✅
🚀 Token Optimizer	✅	✅
👀 Watch Mode	✅	✅
🧪 NL Test Writer	✅	✅

⚔️ Agent Battle Arena	-	✅
🔬 Agent Autopsy	-	✅
📋 Compliance (53 checks)	-	✅
🛡️ Security Scorer (71 checks)	-	✅
🧠 Agent Benchmark (6D)	-	✅
🔀 Agent Diff & Changelog	-	✅
🧠 Brain (auto-optimizer)	-	✅
Full Fuzzer (47+ variants)	-	✅

Learn more about Pro →

Why AgentProbe?

	AgentProbe	Promptfoo	DeepEval	Ragas
Record agent traces	✅	-	-	-
Replay with model swap	✅	-	-	-
35+ built-in assertions	✅	Custom	14	8
Prompt injection fuzzing	✅	Basic	-	-
Tool call assertions	✅	-	-	-
Cost & latency assertions	✅	-	Partial	-
PII detection	✅	-	-	-
Mock LLMs & tools	✅	-	-	-
pytest native	✅	-	Plugin	-
Framework agnostic	✅	LLM-only	LLM-only	RAG-only
Fully offline	✅	Partial	-	-
Local dashboard	✅	✅	✅	-

Examples

Replay with a different model

from agentprobe import Replayer, ReplayConfig

replayer = Replayer()
result = replayer.replay(
    "recordings/customer-support.aprobe",
    config=ReplayConfig(model="claude-sonnet-4-20250514", mock_tools=True),
)
comparison = replayer.compare(original, result)
print(comparison.summary)
# Output Similarity: 94.2%
# Cost: $0.0180 -> $0.0095 (-47.2%)

Fuzz for prompt injections

from agentprobe.fuzz import Fuzzer, PromptInjection, EdgeCases

fuzzer = Fuzzer()
result = fuzzer.run(
    agent_fn=run_agent,
    strategies=[PromptInjection(), EdgeCases()],
    assertions=lambda A: [
        A.no_pii_in_output(),
        A.output_not_contains("IGNORE PREVIOUS"),
        A.completed_successfully(),
    ],
)
print(result.summary())
# Strategy: PromptInjection | Tested: 47 | Failed: 2 | Failure rate: 4.3%

Mock for fast, free, deterministic tests

from agentprobe.mock import MockLLM, MockTool

mock_llm = MockLLM(responses=["Your order #1234 has been shipped."])
mock_search = MockTool(responses=[{"results": ["Order shipped on March 15"]}])

result = replayer.replay(
    recording,
    config=ReplayConfig(mock_llm=mock_llm, tool_mocks={"search_orders": mock_search}),
)
# Zero API calls. Zero cost. Deterministic.

Time-travel through execution

from agentprobe.timeline import TimelineDebugger

dbg = TimelineDebugger(recording)
dbg.add_breakpoint_tool("web_search")
dbg.add_breakpoint_cost(0.10)

state = dbg.step_forward()    # advance one step
state = dbg.run()             # run until breakpoint
print(state.cumulative_cost)  # $0.0847
print(dbg.render_timeline_bar())
# ██▒██▒▒▒▼██▒◆██

Chaos-test your agent

from agentprobe.chaos import ChaosEngine

engine = ChaosEngine(seed=42)
result = engine.run(recording)
print(f"Resilience: {result.resilience_score:.0f}/100 ({result.grade})")
# Resilience: 73/100 (B)
# Recommendations: Add retry logic for tool failures

Write tests in English

agentprobe nltest \
  "respond in under 5 seconds" \
  "cost below $0.10" \
  "call the search tool at least once" \
  "no PII in output" \
  -o tests/test_generated.py

# Auto-generated:
def test_agent(recording):
    assertions.latency_below(recording, max_ms=5000)
    assertions.cost_below(recording, max_cost_usd=0.10)
    assertions.called_tool(recording, tool_name="search")
    assertions.no_pii_in_output(recording)

Agent DNA fingerprinting

from agentprobe.dna import AgentDNA

dna = AgentDNA()
fp = dna.fingerprint(recording)
print(fp.signature)  # "CeSp-VbTf-DeDp"
print(dna.render_helix(fp))
# 🧬 Agent DNA Helix
#  💬 verbosity        ████████████░░░░░░░░ 0.62
#  🧰 tool_diversity   ██████████████░░░░░░ 0.71
#  ⚡ speed            ████████████████░░░░ 0.83

See more in the examples/ directory.

Contributing

Contributions are welcome. Here is how to get started:

git clone https://github.com/tomerhakak/agentprobe.git
cd agentprobe
pip install -e ".[dev,all]"
pytest
ruff check .

Please open an issue before submitting large PRs so we can discuss the approach.

License

MIT — use it however you want.

_{Built by @tomerhakak · agentprobe.dev}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
agentprobe		agentprobe
docs		docs
examples		examples
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASE_NOTES.md		RELEASE_NOTES.md
SPEC.md		SPEC.md
action.yml		action.yml
install.sh		install.sh
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

🧪 AgentProbe

Quick Install

30-Second Demo

Features

🔴 Recording

✅ Testing

⏪ Replay

🛡️ Security

📊 Monitoring

🧠 Intelligence

⚔️ Arena Pro

🔬 Autopsy Pro

🔍 X-Ray

📋 Compliance Pro

🔥 Agent Roast

💰 Cost Calculator

🏥 Health Check

🎮 Injection Playground

🏆 Leaderboard

⚖️ Model Comparator

⏳ Timeline NEW

🧬 Agent DNA NEW

🌀 Chaos Engineering NEW

📊 Agent Coverage NEW

📸 Snapshot Testing NEW

🚀 Token Optimizer NEW

👀 Watch Mode NEW

🧪 NL Test Writer NEW

Integrations

GitHub Action

LangChain

CrewAI

CI/CD

CLI Reference

Platform

Free vs. Pro

Why AgentProbe?

Examples

Replay with a different model

Fuzz for prompt injections

Mock for fast, free, deterministic tests

Time-travel through execution

Chaos-test your agent

Write tests in English

Agent DNA fingerprinting

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

⚔️ Arena ^Pro

🔬 Autopsy ^Pro

📋 Compliance ^Pro

⏳ Timeline ^NEW

🧬 Agent DNA ^NEW

🌀 Chaos Engineering ^NEW

📊 Agent Coverage ^NEW

📸 Snapshot Testing ^NEW

🚀 Token Optimizer ^NEW

👀 Watch Mode ^NEW

🧪 NL Test Writer ^NEW

Packages