pytest for AI Agents
Record, test, replay, and secure your AI agents — locally, privately, in your CI pipeline.
Docs · Pro · Examples · Discord
Your agents call LLMs, invoke tools, and make routing decisions — yet you have no way to test them, no way to catch regressions, and no idea why one run costs $0.12 while the next costs $2.40. AgentProbe fixes that.
# pip (recommended)
pip install agentprobe
# or one-line installer (macOS, Linux, WSL)
curl -fsSL https://raw.githubusercontent.com/tomerhakak/agentprobe/main/install.sh | bashThen get started:
agentprobe init # scaffold config + example tests
agentprobe test # run your agent tests
agentprobe platform # local web dashboard at localhost:9700Record your agent, then test it — just like pytest.
from agentprobe import record, RecordingSession
from agentprobe import assertions as A
# 1. Record a run
@record("my-agent")
def run_agent(query: str, session: RecordingSession) -> str:
session.set_input(query)
session.add_llm_call(model="gpt-4o", input_messages=[...], output_message=response)
session.add_tool_call(tool_name="search", tool_input={"q": query}, tool_output=results)
session.set_output(answer)
return answer
# 2. Test the recording
def test_agent_works(recording):
A.set_recording(recording)
A.output_contains("refund policy")
A.called_tool("search")
A.total_cost_less_than(0.05)
A.no_pii_in_output()$ agentprobe test
tests/test_agent.py
PASS test_basic_response .................. 0.8s
PASS test_uses_search_tool ................ 0.3s
PASS test_cost_within_budget .............. 0.1s
PASS test_no_pii_leakage .................. 0.2s
FAIL test_prompt_injection_resistance ..... 0.4s
AssertionError: Output contains forbidden pattern: 'IGNORE PREVIOUS'
4 passed, 1 failed in 1.8s
Total cost: $0.0034 | Tokens: 1,247 | Traces: .agentprobe/
|
Capture every LLM call, tool invocation, and routing decision into portable |
35+ built-in assertions — output quality, tool usage, cost, latency, safety. Property-based, parameterized, regression, and snapshot testing. |
|
Swap models, change prompts, compare results side-by-side. See exact cost and behavior differences between |
Prompt injection fuzzing (47+ variants), PII detection (27 entity types), threat modeling, and security scoring. |
|
Cost tracking per run, budget alerts, behavioral drift detection, anomaly detection, latency percentiles. |
Hallucination detection, auto-optimizer, model recommender, and A/B testing across agent configurations. |
|
Agent vs. agent battles with ELO ratings. Head-to-head comparison across any dimension — cost, accuracy, speed, safety. |
Forensic failure analysis. Automatic root-cause detection: infinite loops, cost explosions, tool misuse, hallucination spirals. |
|
Token-level cost attribution. See exactly which step, which tool call, which LLM request is burning your budget. Beautiful tree visualization. |
53 automated checks across SOC2, HIPAA, GDPR, PCI-DSS, and CCPA. Generate audit-ready reports. |
|
Get a brutally honest (and funny) analysis of your agent. 450 jokes, 3 severity levels. "Your agent spends money like a drunk sailor at a token store." |
Find out what your agent REALLY costs. Per-run, monthly, yearly projections. Model comparison with savings recommendations. |
|
5-dimension health score (reliability, speed, cost, security, quality) with progress bars and actionable tips. |
55 prompt injection attacks across 5 categories. Test your agent's defenses interactively. |
|
Rank your agents by composite score. Track improvements over time. SQLite-backed, fully local. |
Side-by-side model comparison. Cost, speed, quality, hallucination rate. Crown emoji for the winner. |
|
Time-travel debugger for agent execution. Step forward/backward, set breakpoints on tools/cost/errors, inspect state at every point. Interactive TUI mode. |
Behavioral fingerprinting. Generate a unique multi-dimensional DNA profile for any agent. Detect drift, compare identities, visual helix rendering. |
|
12 built-in chaos scenarios: tool timeouts, LLM hallucinations, cascading failures, cost explosions. Resilience scoring with recovery analysis. |
Like code coverage, but for agents. Track tool coverage, branch coverage, step pattern diversity, and error path testing across recordings. |
|
Jest-style snapshots for agent behavior. Capture output, tools, cost, and patterns — automatically detect regressions on re-runs. |
Automatic optimization analysis. Detects wasted tokens, recommends model downgrades, identifies caching opportunities, projects monthly savings. |
|
Auto-run tests when files change. Like nodemon for AI agents — monitors recordings and test files, triggers analysis on save. |
Write tests in plain English: "respond in under 5 seconds", "cost below $0.10". Auto-translates to executable pytest code. |
AgentProbe is framework-agnostic. Use it with anything.
| Framework | Integration |
|---|---|
| OpenAI SDK | Auto-instrumentation |
| Anthropic SDK | Auto-instrumentation |
| LangChain | Callback handler |
| CrewAI | Callback handler |
| AutoGen | Adapter |
| Custom agents | Manual recording API |
- uses: tomerhakak/agentprobe@v1
with:
args: test --ci --report report.htmlfrom agentprobe.adapters.langchain import AgentProbeCallbackHandler
handler = AgentProbeCallbackHandler(session_name="my-chain")
chain.invoke({"input": "..."}, config={"callbacks": [handler]})from agentprobe.adapters.crewai import AgentProbeCrewHandler
handler = AgentProbeCrewHandler(session_name="my-crew")
crew.kickoff(callbacks=[handler])GitHub Actions
name: Agent Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install agentprobe[all]
- run: agentprobe test --ci --report report.html
- uses: actions/upload-artifact@v4
if: always()
with:
name: agentprobe-report
path: report.htmlGitLab CI
agent-tests:
image: python:3.12
script:
- pip install agentprobe[all]
- agentprobe test --ci --report report.html
artifacts:
paths:
- report.html
when: alwaysJenkins
pipeline {
agent { docker { image 'python:3.12' } }
stages {
stage('Agent Tests') {
steps {
sh 'pip install agentprobe[all]'
sh 'agentprobe test --ci --report report.html'
}
}
}
post {
always { archiveArtifacts artifacts: 'report.html' }
}
}agentprobe record Record an agent run into a .aprobe trace
agentprobe test Run agent tests (pytest-compatible)
agentprobe replay Replay a recording with a different model or config
agentprobe fuzz Fuzz your agent with prompt injections & edge cases
agentprobe scan Security scan — PII detection, injection resistance
agentprobe roast Get a funny brutal analysis of your agent
agentprobe xray Visualize agent thinking step-by-step
agentprobe health 5-dimension health check with scores
agentprobe cost Calculate true cost projections & savings
agentprobe compare Side-by-side model comparison
agentprobe playground Interactive prompt injection lab (55 attacks)
agentprobe leaderboard Rank and track your agents over time
agentprobe analyze Cost breakdown, drift detection, failure clustering
agentprobe platform Launch the local web dashboard (localhost:9700)
agentprobe init Scaffold config file and example tests
agentprobe diff Compare two recordings or agent versions
agentprobe timeline Time-travel debugger — step through execution
agentprobe dna Generate behavioral DNA fingerprint
agentprobe chaos Run chaos engineering scenarios
agentprobe coverage Agent path coverage report
agentprobe snapshot Capture/compare behavioral snapshots
agentprobe optimize Token & cost optimization analysis
agentprobe watch Auto-run tests on file changes
agentprobe nltest Generate tests from plain English
Run agentprobe --help for the full list.
AgentProbe ships with a local web dashboard — no cloud, no accounts, no data leaves your machine.
agentprobe platform start
# Opens http://localhost:9700+------------------------------------------------------------------+
| AgentProbe Platform localhost:9700 |
+------------------------------------------------------------------+
| |
| Recent Traces Cost Trend (7d) |
| +---------------------------------+ +---------------------+ |
| | customer-support 0.3s $0.003 | | __/ | |
| | order-lookup 1.2s $0.018 | | ___/ | |
| | refund-agent 0.8s $0.007 | | ___/ | |
| | billing-qa 0.5s $0.004 | | / | |
| +---------------------------------+ +---------------------+ |
| |
| Assertions: 142 passed, 3 failed Avg cost/run: $0.008 |
| Models: gpt-4o (67%), claude-sonnet (33%) Drift: LOW |
+------------------------------------------------------------------+
Traces, cost breakdowns, assertion results, drift detection, and failure analysis — all in one place.
| Feature | Free | Pro |
|---|---|---|
| Recording & Replay | ✅ | ✅ |
| 35+ Assertions | ✅ | ✅ |
| pytest Integration | ✅ | ✅ |
| Mock LLMs & Tools | ✅ | ✅ |
| CI/CD & GitHub Action | ✅ | ✅ |
| Cost & Latency Tracking | ✅ | ✅ |
| PII Detection (27 types) | ✅ | ✅ |
| Basic Fuzzing | ✅ | ✅ |
| Local Dashboard | ✅ | ✅ |
| 🔥 Agent Roast (450 jokes) | ✅ | ✅ |
| 🔬 X-Ray Visualization | ✅ | ✅ |
| 💰 Cost Calculator & Projections | ✅ | ✅ |
| 🏥 Health Check (5 dimensions) | ✅ | ✅ |
| 🎮 Injection Playground (55 attacks) | ✅ | ✅ |
| 🏆 Agent Leaderboard | ✅ | ✅ |
| ⚖️ Model Comparator | ✅ | ✅ |
| ⏳ Timeline (Time Travel Debugger) | ✅ | ✅ |
| 🧬 Agent DNA (Behavioral Fingerprinting) | ✅ | ✅ |
| 🌀 Chaos Engineering (12 scenarios) | ✅ (5 max) | ✅ |
| 📊 Agent Path Coverage | ✅ | ✅ |
| 📸 Snapshot Testing | ✅ | ✅ |
| 🚀 Token Optimizer | ✅ | ✅ |
| 👀 Watch Mode | ✅ | ✅ |
| 🧪 NL Test Writer | ✅ | ✅ |
| ⚔️ Agent Battle Arena | - | ✅ |
| 🔬 Agent Autopsy | - | ✅ |
| 📋 Compliance (53 checks) | - | ✅ |
| 🛡️ Security Scorer (71 checks) | - | ✅ |
| 🧠 Agent Benchmark (6D) | - | ✅ |
| 🔀 Agent Diff & Changelog | - | ✅ |
| 🧠 Brain (auto-optimizer) | - | ✅ |
| Full Fuzzer (47+ variants) | - | ✅ |
| AgentProbe | Promptfoo | DeepEval | Ragas | |
|---|---|---|---|---|
| Record agent traces | ✅ | - | - | - |
| Replay with model swap | ✅ | - | - | - |
| 35+ built-in assertions | ✅ | Custom | 14 | 8 |
| Prompt injection fuzzing | ✅ | Basic | - | - |
| Tool call assertions | ✅ | - | - | - |
| Cost & latency assertions | ✅ | - | Partial | - |
| PII detection | ✅ | - | - | - |
| Mock LLMs & tools | ✅ | - | - | - |
| pytest native | ✅ | - | Plugin | - |
| Framework agnostic | ✅ | LLM-only | LLM-only | RAG-only |
| Fully offline | ✅ | Partial | - | - |
| Local dashboard | ✅ | ✅ | ✅ | - |
from agentprobe import Replayer, ReplayConfig
replayer = Replayer()
result = replayer.replay(
"recordings/customer-support.aprobe",
config=ReplayConfig(model="claude-sonnet-4-20250514", mock_tools=True),
)
comparison = replayer.compare(original, result)
print(comparison.summary)
# Output Similarity: 94.2%
# Cost: $0.0180 -> $0.0095 (-47.2%)from agentprobe.fuzz import Fuzzer, PromptInjection, EdgeCases
fuzzer = Fuzzer()
result = fuzzer.run(
agent_fn=run_agent,
strategies=[PromptInjection(), EdgeCases()],
assertions=lambda A: [
A.no_pii_in_output(),
A.output_not_contains("IGNORE PREVIOUS"),
A.completed_successfully(),
],
)
print(result.summary())
# Strategy: PromptInjection | Tested: 47 | Failed: 2 | Failure rate: 4.3%from agentprobe.mock import MockLLM, MockTool
mock_llm = MockLLM(responses=["Your order #1234 has been shipped."])
mock_search = MockTool(responses=[{"results": ["Order shipped on March 15"]}])
result = replayer.replay(
recording,
config=ReplayConfig(mock_llm=mock_llm, tool_mocks={"search_orders": mock_search}),
)
# Zero API calls. Zero cost. Deterministic.from agentprobe.timeline import TimelineDebugger
dbg = TimelineDebugger(recording)
dbg.add_breakpoint_tool("web_search")
dbg.add_breakpoint_cost(0.10)
state = dbg.step_forward() # advance one step
state = dbg.run() # run until breakpoint
print(state.cumulative_cost) # $0.0847
print(dbg.render_timeline_bar())
# ██▒██▒▒▒▼██▒◆██from agentprobe.chaos import ChaosEngine
engine = ChaosEngine(seed=42)
result = engine.run(recording)
print(f"Resilience: {result.resilience_score:.0f}/100 ({result.grade})")
# Resilience: 73/100 (B)
# Recommendations: Add retry logic for tool failuresagentprobe nltest \
"respond in under 5 seconds" \
"cost below $0.10" \
"call the search tool at least once" \
"no PII in output" \
-o tests/test_generated.py# Auto-generated:
def test_agent(recording):
assertions.latency_below(recording, max_ms=5000)
assertions.cost_below(recording, max_cost_usd=0.10)
assertions.called_tool(recording, tool_name="search")
assertions.no_pii_in_output(recording)from agentprobe.dna import AgentDNA
dna = AgentDNA()
fp = dna.fingerprint(recording)
print(fp.signature) # "CeSp-VbTf-DeDp"
print(dna.render_helix(fp))
# 🧬 Agent DNA Helix
# 💬 verbosity ████████████░░░░░░░░ 0.62
# 🧰 tool_diversity ██████████████░░░░░░ 0.71
# ⚡ speed ████████████████░░░░ 0.83See more in the examples/ directory.
Contributions are welcome. Here is how to get started:
git clone https://github.com/tomerhakak/agentprobe.git
cd agentprobe
pip install -e ".[dev,all]"
pytest
ruff check .Please open an issue before submitting large PRs so we can discuss the approach.
MIT — use it however you want.
Built by @tomerhakak · agentprobe.dev