Skip to content

tomerhakak/agentprobe

Repository files navigation

🧪 AgentProbe

pytest for AI Agents

Record, test, replay, and secure your AI agents — locally, privately, in your CI pipeline.

License: MIT PyPI version Python 3.10+ GitHub Actions GitHub Stars

Docs · Pro · Examples · Discord


Your agents call LLMs, invoke tools, and make routing decisions — yet you have no way to test them, no way to catch regressions, and no idea why one run costs $0.12 while the next costs $2.40. AgentProbe fixes that.


Quick Install

# pip (recommended)
pip install agentprobe

# or one-line installer (macOS, Linux, WSL)
curl -fsSL https://raw.githubusercontent.com/tomerhakak/agentprobe/main/install.sh | bash

Then get started:

agentprobe init        # scaffold config + example tests
agentprobe test        # run your agent tests
agentprobe platform    # local web dashboard at localhost:9700

30-Second Demo

Record your agent, then test it — just like pytest.

from agentprobe import record, RecordingSession
from agentprobe import assertions as A

# 1. Record a run
@record("my-agent")
def run_agent(query: str, session: RecordingSession) -> str:
    session.set_input(query)
    session.add_llm_call(model="gpt-4o", input_messages=[...], output_message=response)
    session.add_tool_call(tool_name="search", tool_input={"q": query}, tool_output=results)
    session.set_output(answer)
    return answer

# 2. Test the recording
def test_agent_works(recording):
    A.set_recording(recording)
    A.output_contains("refund policy")
    A.called_tool("search")
    A.total_cost_less_than(0.05)
    A.no_pii_in_output()
$ agentprobe test

 tests/test_agent.py
  PASS  test_basic_response .................. 0.8s
  PASS  test_uses_search_tool ................ 0.3s
  PASS  test_cost_within_budget .............. 0.1s
  PASS  test_no_pii_leakage .................. 0.2s
  FAIL  test_prompt_injection_resistance ..... 0.4s
        AssertionError: Output contains forbidden pattern: 'IGNORE PREVIOUS'

 4 passed, 1 failed in 1.8s
 Total cost: $0.0034 | Tokens: 1,247 | Traces: .agentprobe/

Features

🔴 Recording

Capture every LLM call, tool invocation, and routing decision into portable .aprobe trace files. Framework-agnostic — works with any agent.

✅ Testing

35+ built-in assertions — output quality, tool usage, cost, latency, safety. Property-based, parameterized, regression, and snapshot testing.

⏪ Replay

Swap models, change prompts, compare results side-by-side. See exact cost and behavior differences between gpt-4o and claude-sonnet.

🛡️ Security

Prompt injection fuzzing (47+ variants), PII detection (27 entity types), threat modeling, and security scoring.

📊 Monitoring

Cost tracking per run, budget alerts, behavioral drift detection, anomaly detection, latency percentiles.

🧠 Intelligence

Hallucination detection, auto-optimizer, model recommender, and A/B testing across agent configurations.

⚔️ Arena Pro

Agent vs. agent battles with ELO ratings. Head-to-head comparison across any dimension — cost, accuracy, speed, safety.

🔬 Autopsy Pro

Forensic failure analysis. Automatic root-cause detection: infinite loops, cost explosions, tool misuse, hallucination spirals.

🔍 X-Ray

Token-level cost attribution. See exactly which step, which tool call, which LLM request is burning your budget. Beautiful tree visualization.

📋 Compliance Pro

53 automated checks across SOC2, HIPAA, GDPR, PCI-DSS, and CCPA. Generate audit-ready reports.

🔥 Agent Roast

Get a brutally honest (and funny) analysis of your agent. 450 jokes, 3 severity levels. "Your agent spends money like a drunk sailor at a token store."

💰 Cost Calculator

Find out what your agent REALLY costs. Per-run, monthly, yearly projections. Model comparison with savings recommendations.

🏥 Health Check

5-dimension health score (reliability, speed, cost, security, quality) with progress bars and actionable tips.

🎮 Injection Playground

55 prompt injection attacks across 5 categories. Test your agent's defenses interactively.

🏆 Leaderboard

Rank your agents by composite score. Track improvements over time. SQLite-backed, fully local.

⚖️ Model Comparator

Side-by-side model comparison. Cost, speed, quality, hallucination rate. Crown emoji for the winner.

⏳ Timeline NEW

Time-travel debugger for agent execution. Step forward/backward, set breakpoints on tools/cost/errors, inspect state at every point. Interactive TUI mode.

🧬 Agent DNA NEW

Behavioral fingerprinting. Generate a unique multi-dimensional DNA profile for any agent. Detect drift, compare identities, visual helix rendering.

🌀 Chaos Engineering NEW

12 built-in chaos scenarios: tool timeouts, LLM hallucinations, cascading failures, cost explosions. Resilience scoring with recovery analysis.

📊 Agent Coverage NEW

Like code coverage, but for agents. Track tool coverage, branch coverage, step pattern diversity, and error path testing across recordings.

📸 Snapshot Testing NEW

Jest-style snapshots for agent behavior. Capture output, tools, cost, and patterns — automatically detect regressions on re-runs.

🚀 Token Optimizer NEW

Automatic optimization analysis. Detects wasted tokens, recommends model downgrades, identifies caching opportunities, projects monthly savings.

👀 Watch Mode NEW

Auto-run tests when files change. Like nodemon for AI agents — monitors recordings and test files, triggers analysis on save.

🧪 NL Test Writer NEW

Write tests in plain English: "respond in under 5 seconds", "cost below $0.10". Auto-translates to executable pytest code.


Integrations

AgentProbe is framework-agnostic. Use it with anything.

Framework Integration
OpenAI SDKAuto-instrumentation
Anthropic SDKAuto-instrumentation
LangChainCallback handler
CrewAICallback handler
AutoGenAdapter
Custom agentsManual recording API

GitHub Action

- uses: tomerhakak/agentprobe@v1
  with:
    args: test --ci --report report.html

LangChain

from agentprobe.adapters.langchain import AgentProbeCallbackHandler

handler = AgentProbeCallbackHandler(session_name="my-chain")
chain.invoke({"input": "..."}, config={"callbacks": [handler]})

CrewAI

from agentprobe.adapters.crewai import AgentProbeCrewHandler

handler = AgentProbeCrewHandler(session_name="my-crew")
crew.kickoff(callbacks=[handler])

CI/CD

GitHub Actions
name: Agent Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install agentprobe[all]
      - run: agentprobe test --ci --report report.html
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: agentprobe-report
          path: report.html
GitLab CI
agent-tests:
  image: python:3.12
  script:
    - pip install agentprobe[all]
    - agentprobe test --ci --report report.html
  artifacts:
    paths:
      - report.html
    when: always
Jenkins
pipeline {
    agent { docker { image 'python:3.12' } }
    stages {
        stage('Agent Tests') {
            steps {
                sh 'pip install agentprobe[all]'
                sh 'agentprobe test --ci --report report.html'
            }
        }
    }
    post {
        always { archiveArtifacts artifacts: 'report.html' }
    }
}

CLI Reference

agentprobe record     Record an agent run into a .aprobe trace
agentprobe test       Run agent tests (pytest-compatible)
agentprobe replay     Replay a recording with a different model or config
agentprobe fuzz       Fuzz your agent with prompt injections & edge cases
agentprobe scan       Security scan — PII detection, injection resistance
agentprobe roast      Get a funny brutal analysis of your agent
agentprobe xray       Visualize agent thinking step-by-step
agentprobe health     5-dimension health check with scores
agentprobe cost       Calculate true cost projections & savings
agentprobe compare    Side-by-side model comparison
agentprobe playground Interactive prompt injection lab (55 attacks)
agentprobe leaderboard Rank and track your agents over time
agentprobe analyze    Cost breakdown, drift detection, failure clustering
agentprobe platform   Launch the local web dashboard (localhost:9700)
agentprobe init       Scaffold config file and example tests
agentprobe diff       Compare two recordings or agent versions
agentprobe timeline   Time-travel debugger — step through execution
agentprobe dna        Generate behavioral DNA fingerprint
agentprobe chaos      Run chaos engineering scenarios
agentprobe coverage   Agent path coverage report
agentprobe snapshot   Capture/compare behavioral snapshots
agentprobe optimize   Token & cost optimization analysis
agentprobe watch      Auto-run tests on file changes
agentprobe nltest     Generate tests from plain English

Run agentprobe --help for the full list.


Platform

AgentProbe ships with a local web dashboard — no cloud, no accounts, no data leaves your machine.

agentprobe platform start
# Opens http://localhost:9700
+------------------------------------------------------------------+
|  AgentProbe Platform                           localhost:9700     |
+------------------------------------------------------------------+
|                                                                   |
|  Recent Traces                          Cost Trend (7d)           |
|  +---------------------------------+   +---------------------+   |
|  | customer-support  0.3s  $0.003  |   |          __/        |   |
|  | order-lookup      1.2s  $0.018  |   |      ___/           |   |
|  | refund-agent      0.8s  $0.007  |   |  ___/               |   |
|  | billing-qa        0.5s  $0.004  |   | /                   |   |
|  +---------------------------------+   +---------------------+   |
|                                                                   |
|  Assertions: 142 passed, 3 failed     Avg cost/run: $0.008       |
|  Models: gpt-4o (67%), claude-sonnet (33%)   Drift: LOW          |
+------------------------------------------------------------------+

Traces, cost breakdowns, assertion results, drift detection, and failure analysis — all in one place.


Free vs. Pro

Feature Free Pro
Recording & Replay
35+ Assertions
pytest Integration
Mock LLMs & Tools
CI/CD & GitHub Action
Cost & Latency Tracking
PII Detection (27 types)
Basic Fuzzing
Local Dashboard
🔥 Agent Roast (450 jokes)
🔬 X-Ray Visualization
💰 Cost Calculator & Projections
🏥 Health Check (5 dimensions)
🎮 Injection Playground (55 attacks)
🏆 Agent Leaderboard
⚖️ Model Comparator
⏳ Timeline (Time Travel Debugger)
🧬 Agent DNA (Behavioral Fingerprinting)
🌀 Chaos Engineering (12 scenarios)✅ (5 max)
📊 Agent Path Coverage
📸 Snapshot Testing
🚀 Token Optimizer
👀 Watch Mode
🧪 NL Test Writer
⚔️ Agent Battle Arena-
🔬 Agent Autopsy-
📋 Compliance (53 checks)-
🛡️ Security Scorer (71 checks)-
🧠 Agent Benchmark (6D)-
🔀 Agent Diff & Changelog-
🧠 Brain (auto-optimizer)-
Full Fuzzer (47+ variants)-

Learn more about Pro →


Why AgentProbe?

AgentProbe Promptfoo DeepEval Ragas
Record agent traces - - -
Replay with model swap - - -
35+ built-in assertions Custom 14 8
Prompt injection fuzzing Basic - -
Tool call assertions - - -
Cost & latency assertions - Partial -
PII detection - - -
Mock LLMs & tools - - -
pytest native - Plugin -
Framework agnostic LLM-only LLM-only RAG-only
Fully offline Partial - -
Local dashboard -

Examples

Replay with a different model

from agentprobe import Replayer, ReplayConfig

replayer = Replayer()
result = replayer.replay(
    "recordings/customer-support.aprobe",
    config=ReplayConfig(model="claude-sonnet-4-20250514", mock_tools=True),
)
comparison = replayer.compare(original, result)
print(comparison.summary)
# Output Similarity: 94.2%
# Cost: $0.0180 -> $0.0095 (-47.2%)

Fuzz for prompt injections

from agentprobe.fuzz import Fuzzer, PromptInjection, EdgeCases

fuzzer = Fuzzer()
result = fuzzer.run(
    agent_fn=run_agent,
    strategies=[PromptInjection(), EdgeCases()],
    assertions=lambda A: [
        A.no_pii_in_output(),
        A.output_not_contains("IGNORE PREVIOUS"),
        A.completed_successfully(),
    ],
)
print(result.summary())
# Strategy: PromptInjection | Tested: 47 | Failed: 2 | Failure rate: 4.3%

Mock for fast, free, deterministic tests

from agentprobe.mock import MockLLM, MockTool

mock_llm = MockLLM(responses=["Your order #1234 has been shipped."])
mock_search = MockTool(responses=[{"results": ["Order shipped on March 15"]}])

result = replayer.replay(
    recording,
    config=ReplayConfig(mock_llm=mock_llm, tool_mocks={"search_orders": mock_search}),
)
# Zero API calls. Zero cost. Deterministic.

Time-travel through execution

from agentprobe.timeline import TimelineDebugger

dbg = TimelineDebugger(recording)
dbg.add_breakpoint_tool("web_search")
dbg.add_breakpoint_cost(0.10)

state = dbg.step_forward()    # advance one step
state = dbg.run()             # run until breakpoint
print(state.cumulative_cost)  # $0.0847
print(dbg.render_timeline_bar())
# ██▒██▒▒▒▼██▒◆██

Chaos-test your agent

from agentprobe.chaos import ChaosEngine

engine = ChaosEngine(seed=42)
result = engine.run(recording)
print(f"Resilience: {result.resilience_score:.0f}/100 ({result.grade})")
# Resilience: 73/100 (B)
# Recommendations: Add retry logic for tool failures

Write tests in English

agentprobe nltest \
  "respond in under 5 seconds" \
  "cost below $0.10" \
  "call the search tool at least once" \
  "no PII in output" \
  -o tests/test_generated.py
# Auto-generated:
def test_agent(recording):
    assertions.latency_below(recording, max_ms=5000)
    assertions.cost_below(recording, max_cost_usd=0.10)
    assertions.called_tool(recording, tool_name="search")
    assertions.no_pii_in_output(recording)

Agent DNA fingerprinting

from agentprobe.dna import AgentDNA

dna = AgentDNA()
fp = dna.fingerprint(recording)
print(fp.signature)  # "CeSp-VbTf-DeDp"
print(dna.render_helix(fp))
# 🧬 Agent DNA Helix
#  💬 verbosity        ████████████░░░░░░░░ 0.62
#  🧰 tool_diversity   ██████████████░░░░░░ 0.71
#  ⚡ speed            ████████████████░░░░ 0.83

See more in the examples/ directory.


Contributing

Contributions are welcome. Here is how to get started:

git clone https://github.com/tomerhakak/agentprobe.git
cd agentprobe
pip install -e ".[dev,all]"
pytest
ruff check .

Please open an issue before submitting large PRs so we can discuss the approach.


License

MIT — use it however you want.


Built by @tomerhakak · agentprobe.dev