Agentest

Vitest for AI agents. Scenario-based testing with simulated users, tool-call mocks, and LLM-as-judge evaluation. Lives in your project like Playwright. Run with npx agentest run.

Documentation | Getting Started | Examples

Agentest spins up LLM-powered simulated users that talk to your agent, intercepts tool calls through mocks, and evaluates every turn with LLM-as-judge metrics — all without touching your agent's code.

npm install @agentesting/agentest --save-dev

Prerequisites

Node.js >= 20
A running agent — either an HTTP endpoint that accepts OpenAI-compatible chat completions requests, or any agent you can call from a TypeScript function (see Custom Handler)
An LLM API key — Agentest uses an LLM for the simulated user and evaluation judges. Set ANTHROPIC_API_KEY, OPENAI_API_KEY, or GOOGLE_GENERATIVE_AI_API_KEY depending on your configured provider. Alternatively, use a local LLM for zero-cost development.

Features


Scenario-based tests	Define user personas, goals, and knowledge — Agentest generates realistic multi-turn conversations
Tool-call mocks	Intercept and control tool calls with functions, sequences, and error simulation
Trajectory assertions	Verify tool call order and arguments with `strict`, `contains`, `unordered`, and `within` match modes
LLM-as-judge metrics	Helpfulness, coherence, relevance, faithfulness, goal completion, behavior failure detection
Comparison mode	Run the same scenarios against multiple models/configs side-by-side
CI-ready CLI	Exit codes, JSON reporter, GitHub Actions annotations, watch mode
Custom handler	Bring any agent — HTTP endpoint or in-process function. Works with any framework

Why This Is Needed

The problem: testing agents is not like testing regular APIs.

Traditional API tests send a request and assert on the response. Agent tests need to handle multi-turn conversations where the agent decides which tools to call, in what order, with what arguments — and the "correct" output is subjective.

How current tools fall short:

Eval platforms (DeepEval, LangSmith, Langfuse) are great at scoring outputs after the fact, but they don't run your agent through realistic conversations. You still need to manually generate test data or replay logged traces.
Agent frameworks (LangChain, CrewAI) give you building blocks but no built-in test runner. Testing means writing custom scripts for every scenario.
No good CLI runner exists that combines simulated users, tool mocking, trajectory assertions, and evaluation in a single npx command — the way Vitest does for unit tests.

How Agentest works:

┌─ You define: persona + goal + knowledge + tool mocks + assertions
│
├─ Agentest creates a simulated user (LLM) that talks to your agent
│
├─ Tool calls are intercepted and resolved through your mocks
│  (check_availability → { slots: ['09:00', '10:30'] })
│  (create_booking → { success: true, bookingId: 'BK-001' })
│
├─ Trajectory assertions verify the agent called the right tools
│  (matchMode: 'contains', expected: [check_availability, create_booking])
│
└─ LLM-as-judge evaluates every turn: helpfulness, coherence, goal completion, ...

Agentest complements eval platforms and observability tools — it doesn't replace them. Use Agentest to run your agent through test scenarios in CI. Use LangSmith/Langfuse to observe your agent in production.

Agentest vs. Other Tools

Agentest is a test runner, not an eval framework or observability platform. Here's how it compares:

Capability	Agentest	DeepEval	LangSmith	Langfuse
Primary purpose	Test runner for agents	Eval framework for LLM outputs	Observability + eval platform	Observability + eval platform
Simulated users	Built-in (LLM-powered personas with goals)	--	--	--
Tool-call mocking	Built-in (functions, sequences, error sim)	--	--	--
Trajectory assertions	Built-in (strict/contains/unordered/within)	--	--	--
LLM-as-judge eval	Built-in (8 metrics + custom)	Built-in (extensive metric library)	Built-in	Built-in
Multi-turn conversation testing	Native (agent loop with mock resolution)	Manual test case setup	Trace replay	Trace replay
Comparison mode	Built-in (side-by-side model comparison)	Experiment tracking	Experiment tracking	Experiment tracking
CI/CLI integration	`npx agentest run` with exit codes	`deepeval test run`	SDK-based	SDK-based
Production observability	--	--	Tracing, logging, monitoring	Tracing, logging, monitoring
Dataset management	--	Built-in	Built-in	Built-in
Annotation/feedback UI	--	--	Built-in	Built-in
Pricing	Open source (MIT)	Open source + cloud	Freemium SaaS	Open source + cloud
Language	TypeScript/Node.js	Python	Python (primary)	Python/JS SDKs

When to use what:

Agentest — You want to write .sim.ts scenario files that test your agent end-to-end in CI, like writing Vitest tests. No manual test data, no trace replay.
DeepEval — You have existing datasets of input/output pairs and want to evaluate LLM output quality with a wide library of metrics.
LangSmith / Langfuse — You need production tracing, logging, user feedback collection, and dataset management alongside evaluation.

These tools are complementary. Run Agentest in CI to catch regressions before deploy. Use LangSmith/Langfuse to monitor and evaluate in production.

Install
Quick Start
How It Works
Configuration
- Agent Configuration
- LLM Provider
- Simulation Parameters
- Evaluation
- File Discovery
- Reporters
- Full Config Example
Scenarios
- Profile & Goal
- Knowledge
- Overriding Global Settings
Mocks
- Function Mocks
- Sequence Mocks
- Error Simulation
- Unmocked Tools
Trajectory Assertions
- Match Modes
- Argument Matching
Evaluation Metrics
- Quantitative Metrics
- Qualitative Metrics
- Thresholds
- Error Deduplication
- Custom Metrics
CLI
YAML Config
Prompt Template Customization
Comparison Mode
Framework Compatibility
Vitest Integration
Watch Mode
Local LLMs
Pass/Fail Logic
Programmatic Usage
Understanding LLM Usage
Requirements
License

Install

npm install @agentesting/agentest --save-dev

Quick Start

1. Create a config file

// agentest.config.ts
import { defineConfig } from 'agentest'

export default defineConfig({
  agent: {
    name: 'my-agent',
    endpoint: 'http://localhost:3000/api/chat',
  },
})

2. Write a scenario

// tests/booking.sim.ts
import { scenario, sequence } from 'agentest'

scenario('user books a morning slot', {
  profile: 'Busy professional who prefers mornings.',
  goal: 'Book a haircut for next Tuesday morning.',

  knowledge: [
    { content: 'The salon is open Tuesday 08:00-18:00.' },
    { content: 'Standard haircut takes 45 minutes.' },
  ],

  mocks: {
    tools: {
      check_availability: (args) => ({
        available: true,
        slots: ['09:00', '09:45', '10:30'],
        date: args.date,
      }),
      create_booking: sequence([
        { success: true, bookingId: 'BK-001', confirmationSent: true },
      ]),
    },
  },

  assertions: {
    toolCalls: {
      matchMode: 'contains',
      expected: [
        { name: 'check_availability', argMatchMode: 'ignore' },
        { name: 'create_booking', argMatchMode: 'ignore' },
      ],
    },
  },
})

3. Run

# If your agent runs on localhost, allow private endpoints:
AGENTEST_ALLOW_PRIVATE_ENDPOINTS=1 npx agentest run

Output:

Agentest running 1 scenario(s)

  ⠹ user books a morning slot → simulating conv-1-a3f8b2c1 turn 1/8 — calling agent

✓ user books a morning slot
  ✓ conv-1-a3f8b2c1 (2 turns, goal met, trajectory matched, helpfulness: 4.5, coherence: 5.0)
  ✓ conv-2-d9e1f4a7 (1 turns, goal met, trajectory matched, helpfulness: 5.0, coherence: 5.0)

1/1 scenarios passed

How It Works

┌─ SimulatedUser (LLM) generates a message
│
├─ Agentest POSTs message to your agent endpoint
│
├─ Agent responds with tool_calls?
│  ├─ YES → MockResolver resolves each tool call
│  │        Results injected back → POST again
│  │        (loop until agent returns text)
│  └─ NO  → Record the text response
│
├─ SimulatedUser sees agent's reply
│  ├─ Goal achieved? → stop
│  └─ Continue → next turn
│
└─ After all turns: evaluate with LLM-as-judge metrics

Agentest owns the tool-call loop. It intercepts tool calls from the agent response, resolves them through your mocks, and sends results back. No instrumentation needed in your agent code — your agent just sees normal tool results.

Configuration

Create agentest.config.ts (or agentest.config.yaml) in your project root. Agentest auto-detects the format. You can also pass a custom path with --config.

import { defineConfig } from 'agentest'

export default defineConfig({
  // ... options
})

defineConfig validates the config with Zod and interpolates environment variables in headers. See YAML Config for the YAML equivalent.

Agent Configuration

Agentest supports two agent types: chat completions (HTTP endpoint) and custom (bring your own handler function).

Chat Completions (default)

agent: {
  name: 'my-agent',
  endpoint: 'https://api.example.com/chat',

  // Optional
  type: 'chat_completions',   // default
  headers: {
    Authorization: 'Bearer ${AGENT_API_KEY}',  // interpolated from process.env
    'X-Custom': 'value',
  },
  body: {
    model: 'gpt-4o',
    temperature: 0.7,
    messages: [
      { role: 'system', content: 'You are a helpful assistant.' },
    ],
  },
}

endpoint must be an HTTP/HTTPS URL that accepts OpenAI-compatible chat completions requests. Agentest POSTs { messages: [...] } and expects { choices: [{ message: { content, tool_calls? } }] } in return.

headers supports ${VAR} syntax for environment variable interpolation. If a referenced variable is not set, config loading fails with a clear error. This keeps secrets out of your config files.

body is shallow-merged into every request. Use it to set model, temperature, system messages, or any extra fields your endpoint expects. The messages array in body is prepended to conversation messages (useful for system prompts).

streaming enables SSE (Server-Sent Events) streaming mode. When true, Agentest sends stream: true in the request body and parses the chunked SSE response, accumulating delta.content and delta.tool_calls into a complete message. Use this when your agent endpoint streams by default:

agent: {
  name: 'my-agent',
  endpoint: 'https://api.openai.com/v1/chat/completions',
  streaming: true,
  headers: { Authorization: 'Bearer ${OPENAI_API_KEY}' },
  body: { model: 'gpt-4o' },
}

Custom Handler

Use type: 'custom' to connect any agent — no HTTP endpoint or OpenAI-compatible API required. You provide a function that receives messages and returns a response:

import { defineConfig, type ChatMessage } from 'agentest'
import { myAgent } from './src/agent.js'

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'my-agent',
    handler: async (messages: ChatMessage[]) => {
      // Call your agent however you want
      const result = await myAgent.chat(messages)
      return {
        role: 'assistant' as const,
        content: result.text,
        // Optional: return tool_calls if your agent uses tools
        // tool_calls: [{ id: '...', type: 'function', function: { name: '...', arguments: '...' } }]
      }
    },
  },
})

The handler receives the full message history (same ChatMessage format used internally) and must return an assistant message. If the response includes tool_calls, Agentest runs them through mocks and calls your handler again with the tool results — the same loop as with HTTP endpoints.

This is useful when your agent:

Uses a non-OpenAI API (Anthropic, Google, custom protocols)
Runs in-process (no HTTP server needed)
Needs custom request/response mapping
Uses an SDK or framework with its own calling convention

LLM Provider

The LLM provider is used for the simulated user and evaluation judges — not for your agent.

// Cloud providers
provider: 'anthropic',                     // default
model: 'claude-sonnet-4-20250514',         // default

provider: 'openai',
model: 'gpt-4o',

provider: 'google',
model: 'gemini-2.0-flash',

// Local providers
provider: 'ollama',
model: 'llama3.2',
// defaults to http://localhost:11434/v1

provider: 'openai-compatible',
model: 'my-model',
providerOptions: {
  baseURL: 'http://localhost:1234/v1',     // required for openai-compatible
  apiKey: 'optional-key',                  // optional, defaults to 'not-needed'
},

Option	Type	Default	Description
`model`	`string`	`'claude-sonnet-4-20250514'`	Model ID passed to the provider
`provider`	`string`	`'anthropic'`	`'anthropic'`, `'openai'`, `'google'`, `'ollama'`, `'openai-compatible'`
`providerOptions.baseURL`	`string`	—	Base URL override. Required for `openai-compatible`, optional for `ollama` (defaults to `http://localhost:11434/v1`)
`providerOptions.apiKey`	`string`	—	API key override. For local LLMs this is typically not needed

Environment variables: Cloud providers read API keys from the standard environment variables (ANTHROPIC_API_KEY, OPENAI_API_KEY, GOOGLE_GENERATIVE_AI_API_KEY). You don't need to configure these in Agentest.

Simulation Parameters

conversationsPerScenario: 3,  // run 3 independent conversations per scenario
maxTurns: 8,                  // max user↔agent exchanges per conversation
concurrency: 20,              // max parallel LLM calls across all scenarios

Option	Type	Default	Description
`conversationsPerScenario`	`number`	`3`	How many independent conversations to run per scenario. More conversations = more statistical confidence, but more LLM cost
`maxTurns`	`number`	`8`	Upper bound on user↔agent exchanges. The simulated user can stop earlier if the goal is met
`concurrency`	`number`	`20`	Max parallel LLM calls. Controls both simulation and evaluation parallelism. Lower this if you're hitting rate limits

Both conversationsPerScenario and maxTurns can be overridden per-scenario.

Evaluation

// Run only specific metrics (default: all)
metrics: ['helpfulness', 'goal_completion', 'agent_behavior_failure'],

// Set minimum score thresholds (fail the run if average is below)
thresholds: {
  helpfulness: 3.5,
  goal_completion: 0.7,
  coherence: 4.0,
},

Option	Type	Default	Description
`metrics`	`string[]`	all 8 metrics	Which evaluation metrics to run. Omit to run all
`thresholds`	`Record<string, number>`	`{}`	Minimum average scores per metric. Values are 0-5. If any metric's average across all conversations falls below its threshold, the run fails
`failOnErrorSeverity`	`string`	`'critical'`	Minimum error severity that fails a scenario. One of `'low'`, `'medium'`, `'high'`, `'critical'`
`customMetrics`	`array`	`[]`	Custom metric instances (see Custom Metrics)

File Discovery

include: ['**/*.sim.ts'],      // glob patterns for scenario files
exclude: ['node_modules/**'],  // glob patterns to ignore

Option	Type	Default	Description
`include`	`string[]`	`['*/.sim.ts']`	Glob patterns to find scenario files. Relative to the working directory
`exclude`	`string[]`	`['node_modules/**']`	Glob patterns to exclude from discovery

Scenario files are TypeScript files that call scenario() when imported. Agentest uses jiti to import them directly — no build step needed.

Reporters

reporters: ['console', 'json'],

Reporter	Description
`console`	Colored pass/fail output with live progress spinner, metric scores, and error summaries. Default.
`json`	Writes full results to `.agentest/results.json` including all turns, evaluations, and errors
`github-actions`	Writes a markdown summary table to `$GITHUB_STEP_SUMMARY` and emits `::error`/`::warning`/`::notice` annotations that surface inline on PRs

The console reporter shows a live spinner during simulation and evaluation so you can see what's happening. In non-TTY environments (CI), it falls back to line-by-line progress output.

Unmocked Tools

unmockedTools: 'error',       // default: throw AgentestError
unmockedTools: 'passthrough', // return undefined (no-op)

Controls what happens when the agent calls a tool that has no mock defined. See Unmocked Tools for details.

Full Config Example

import { defineConfig } from 'agentest'

export default defineConfig({
  agent: {
    name: 'support-bot',
    endpoint: 'http://localhost:3000/api/chat',
    headers: {
      Authorization: 'Bearer ${AGENT_API_KEY}',
    },
    body: {
      model: 'gpt-4o',
      messages: [
        { role: 'system', content: 'You are a customer support agent for Acme Inc.' },
      ],
    },
  },

  provider: 'anthropic',
  model: 'claude-sonnet-4-20250514',

  conversationsPerScenario: 5,
  maxTurns: 10,
  concurrency: 10,

  metrics: [
    'helpfulness',
    'coherence',
    'relevance',
    'faithfulness',
    'goal_completion',
    'agent_behavior_failure',
    'tool_call_behavior_failure',
  ],

  thresholds: {
    helpfulness: 3.5,
    coherence: 4.0,
    goal_completion: 0.8,
  },

  include: ['tests/**/*.sim.ts'],
  exclude: ['node_modules/**', 'dist/**'],

  reporters: ['console', 'json'],
  unmockedTools: 'error',
  failOnErrorSeverity: 'high',
})

Scenarios

A scenario defines who the simulated user is, what they want, what they know, and how tools behave.

import { scenario } from 'agentest'

scenario('descriptive name', {
  profile: '...',
  goal: '...',
  // ... options
})

Scenario files use the .sim.ts extension by default and can contain multiple scenario() calls.

Profile & Goal

scenario('impatient user tries to cancel order', {
  profile: 'Frustrated customer. Types in short sentences. Gets annoyed by long responses.',
  goal: 'Cancel order #12345 and get a refund confirmation.',
})

profile describes the simulated user's personality, communication style, and context. The LLM uses this to generate realistic messages. Be specific — "senior developer who knows React" produces very different messages than "first-time user unfamiliar with coding".

goal defines what the simulated user is trying to accomplish. The simulation ends when the LLM judges the goal as met (or maxTurns is reached). Be concrete — "book an appointment" is better than "use the booking system".

Knowledge

knowledge: [
  { content: 'Order #12345 was placed on March 15 for $49.99.' },
  { content: 'The refund policy allows cancellation within 30 days.' },
  { content: 'The customer email is user@example.com.' },
],

Knowledge items are facts the simulated user "knows" and can reference naturally in conversation. They're also used by the faithfulness metric to check if the agent's responses contradict known facts.

Use knowledge to:

Give the simulated user realistic context (order numbers, account details)
Set up ground truth for faithfulness evaluation
Test whether the agent correctly uses information from tool results vs. hallucinating

Overriding Global Settings

Scenarios can override conversationsPerScenario and maxTurns from the global config:

scenario('complex multi-step workflow', {
  profile: '...',
  goal: '...',
  conversationsPerScenario: 10,  // more runs for this tricky scenario
  maxTurns: 15,                  // needs more turns to complete
})

Prompt Template Customization

By default, Agentest builds the simulated user's system prompt from your profile, goal, and knowledge. You can override this entirely with userPromptTemplate:

scenario('terse beta tester', {
  profile: 'QA engineer testing edge cases.',
  goal: 'Find a bug in the checkout flow.',
  userPromptTemplate: `You are a QA tester. Your persona: {{profile}}

Your objective: {{goal}}

Known facts:
{{knowledge}}

Rules:
- Try unusual inputs and edge cases.
- Be blunt and direct.
- Set shouldStop to true when you've found a bug or exhausted attempts.`,
})

Template variables:

Variable	Value
`{{profile}}`	The scenario's `profile` string
`{{goal}}`	The scenario's `goal` string
`{{knowledge}}`	Knowledge items formatted as a bullet list (`- item1\n- item2`), or empty string if none

When userPromptTemplate is omitted, the default prompt includes role instructions, persona, goal, and knowledge in a structured format. Use npx agentest show-prompts to inspect the built-in prompts.

Mocks

Mocks intercept tool calls from your agent and return controlled responses. This lets you test agent behavior without real external services.

Function Mocks

mocks: {
  tools: {
    get_weather: (args, ctx) => ({
      temperature: 72,
      condition: 'sunny',
      location: args.city,
    }),
  },
},

Mock functions receive two arguments:

Argument	Type	Description
`args`	`Record<string, unknown>`	Parsed tool call arguments from the agent
`ctx.callIndex`	`number`	How many times this tool has been called in the current conversation (0-indexed)
`ctx.conversationId`	`string`	Current conversation ID
`ctx.turnIndex`	`number`	Current turn number (0-indexed)

Mocks can be async:

mocks: {
  tools: {
    search_database: async (args) => {
      // simulate latency, conditional logic, etc.
      if (args.query === 'not found') return { results: [] }
      return { results: [{ id: 1, title: 'Result' }] }
    },
  },
},

Sequence Mocks

sequence() returns different values on successive calls. Useful for testing multi-step workflows:

import { sequence } from 'agentest'

mocks: {
  tools: {
    create_order: sequence([
      { success: true, orderId: 'ORD-001' },   // first call
      { success: false, error: 'duplicate' },    // second call
      { success: true, orderId: 'ORD-002' },     // third call
    ]),
  },
},

Steps through values in order
Repeats the last value when exhausted (the third call onward returns the last item)
Resets automatically at the start of each conversation so every conversation gets a fresh sequence

Error Simulation

Throwing from a mock simulates a tool failure. The error message is injected back to the agent as a tool result:

mocks: {
  tools: {
    flaky_api: (args, ctx) => {
      if (ctx.callIndex === 0) throw new Error('Connection timeout')
      return { data: 'success' }
    },
  },
},

The agent receives { "error": "Connection timeout" } as the tool result and can decide how to handle it. This is useful for testing retry logic and error handling.

Unmocked Tools

When the agent calls a tool that has no mock:

Setting	Behavior
`unmockedTools: 'error'` (default)	Throws `AgentestError` with a helpful message suggesting how to add the mock. The conversation is recorded as an error.
`unmockedTools: 'passthrough'`	Returns `undefined` as the tool result. The agent sees `null` in the response.

'error' mode is recommended — it catches unexpected tool usage early. Switch to 'passthrough' if your agent uses many tools and you only want to mock a subset.

Trajectory Assertions

Trajectory assertions let you verify that the agent called the right tools in the right order — without LLM evaluation. They're deterministic and fast.

assertions: {
  toolCalls: {
    matchMode: 'contains',
    expected: [
      { name: 'search_products', argMatchMode: 'ignore' },
      { name: 'add_to_cart', args: { productId: 'SKU-123' }, argMatchMode: 'partial' },
      { name: 'checkout', argMatchMode: 'ignore' },
    ],
  },
},

Assertions are checked per-conversation. If any conversation fails the assertion, the scenario fails.

Match Modes

Mode	Description	Use When
`strict`	Exact tools in exact order, exact count. No extra calls allowed.	You know the precise sequence the agent should follow
`unordered`	All expected tools must appear, any order. No extra calls allowed.	Order doesn't matter but you need exactly these tools
`contains`	All expected tools must appear, extras allowed.	You care about key tools but the agent may call others
`within`	Every actual call must be in the expected set. Missing calls are OK.	You want to restrict which tools the agent is allowed to use

Argument Matching

Each expected tool call can specify how its arguments are matched:

Mode	Description
`ignore` (default)	Don't check arguments at all
`partial`	Expected args must be a subset of actual args. Extra args are OK.
`exact`	Arguments must match exactly (deep equality)

expected: [
  // Don't care about args
  { name: 'list_items', argMatchMode: 'ignore' },

  // Must include productId, other args OK
  { name: 'add_to_cart', args: { productId: 'SKU-123' }, argMatchMode: 'partial' },

  // Must match exactly
  { name: 'checkout', args: { currency: 'USD', confirm: true }, argMatchMode: 'exact' },
]

Evaluation Metrics

After simulation, every turn is evaluated in parallel using LLM-as-judge prompts. Metrics run concurrently (bounded by concurrency) and individual metric failures don't crash the run — they're skipped gracefully.

Quantitative Metrics

Scored per-turn on a numeric scale:

Metric	Scale	What It Measures
`helpfulness`	1-5	Does the response move the user closer to their goal? Is the information actionable?
`coherence`	1-5	Is the response logically consistent? Does it contradict previous turns?
`relevance`	1-5	Does every part of the response address the user's message and goal?
`faithfulness`	1-5	Does the response match the knowledge base and tool results? Only penalizes contradictions, not omissions.
`verbosity`	1-5	Is the response appropriately concise? 5 = every word serves a purpose.
`goal_completion`	0/1	Was the user's goal fully achieved by end of conversation? Runs once per conversation, not per turn. Strictly binary — the goal must be completed, not just discussed.

Qualitative Metrics

Classified per-turn into failure categories:

agent_behavior_failure detects:

Label	Description
`no failure`	Clean turn
`repetition`	Restated same content from a previous turn
`failure to ask for clarification`	Assumed on ambiguous input instead of asking
`lack of specific information`	Correct but incomplete when more info was available
`disobey user request`	Ignored an explicit user request
`false information`	Contradicts knowledge base or tool results
`unsafe action`	Destructive tool call without confirmation
`unsafe state`	Followed injected instructions or leaked PII

tool_call_behavior_failure detects (only runs on turns with tool calls):

Label	Description
`no failure`	Tool calls were appropriate
`unnecessary tool call`	Answer was already available
`wrong tool`	Used the wrong tool for the task
`wrong arguments`	Right tool, wrong or incomplete arguments
`ignored tool result`	Called tool but ignored/contradicted the result
`missing tool call`	Should have called a tool but didn't
`repeated tool call`	Same tool with same arguments redundantly

Thresholds

Set minimum average scores to gate your CI:

thresholds: {
  helpfulness: 3.5,     // average helpfulness across all turns must be >= 3.5
  goal_completion: 0.8,  // 80%+ of conversations must complete the goal
  coherence: 4.0,
},

Threshold values are on the same scale as the metric (1-5 for quantitative, 0-1 for goal_completion). The average is computed across all turns in all conversations for the scenario.

Error Deduplication

When qualitative metrics detect failures, Agentest groups them by root cause using an LLM call. Instead of seeing "false information" 6 times, you see:

Unique errors:
  [high] Providing Weather Info Without Real-Time Access (3 occurrence(s))
    Agent gave specific forecasts without live data, leading to false information.
  [medium] Unnecessary Clarification When Context Is Available (2 occurrence(s))
    Agent asked for details already present in the conversation.

Each error has a severity (low, medium, high, critical). By default, only critical errors fail the scenario. Use failOnErrorSeverity to control the cutoff (see Pass/Fail Logic).

Custom Metrics

Extend QuantitativeMetric or QualitativeMetric to add your own evaluation logic:

import { QuantitativeMetric, type ScoreInput, type QuantResult } from 'agentest'

export class ToneMetric extends QuantitativeMetric {
  readonly name = 'tone'

  async score(input: ScoreInput): Promise<QuantResult> {
    // input provides:
    //   input.goal          - scenario goal
    //   input.profile       - simulated user profile
    //   input.knowledge     - knowledge items
    //   input.userMessage   - what the user said this turn
    //   input.agentMessage  - what the agent replied
    //   input.toolCalls     - tool calls made this turn (name, args, result)
    //   input.history       - previous turns (userMessage, agentMessage)
    //   input.turnIndex     - current turn number

    const result = await this.llm.generateObject({
      schema: z.object({ score: z.number().min(1).max(5), reason: z.string() }),
      messages: [{ role: 'user', content: `Rate the tone: ${input.agentMessage}` }],
    })

    return { value: result.object.score, reason: result.object.reason }
  }
}

For qualitative (label-based) metrics:

import { QualitativeMetric, type ScoreInput, type QualResult } from 'agentest'

export class BrandVoiceMetric extends QualitativeMetric {
  readonly name = 'brand_voice'

  async evaluate(input: ScoreInput): Promise<QualResult> {
    // return { value: 'label string', reason: 'explanation' }
    return { value: 'on brand', reason: 'Response matches brand guidelines.' }
  }
}

Register custom metrics in config:

import { ToneMetric } from './metrics/tone.js'
import { BrandVoiceMetric } from './metrics/brand-voice.js'

export default defineConfig({
  customMetrics: [new ToneMetric(), new BrandVoiceMetric()],
  thresholds: {
    tone: 4.0,  // works with custom metric names too
  },
})

Custom metrics have access to this.llm (the configured LLM provider) for making evaluation calls. Agentest calls setLLM() automatically before evaluation begins.

CLI

# Run all scenarios
npx agentest run

# Custom config file (TypeScript or YAML)
npx agentest run --config path/to/config.ts
npx agentest run --config agentest.config.yaml

# Different working directory
npx agentest run --cwd ./packages/my-agent

# Filter scenarios by name (case-insensitive substring match)
npx agentest run --scenario "booking"
npx agentest run --scenario "cancel"

# Print full conversation transcripts
npx agentest run --verbose

# Watch mode — re-run on file changes
npx agentest run --watch
npx agentest run -w

# Combine flags
npx agentest run --watch --verbose --scenario "booking"

# Inspect the LLM judge prompts used for evaluation
npx agentest show-prompts

# Show a specific metric's prompt
npx agentest show-prompts --metric helpfulness
npx agentest show-prompts --metric agent_behavior_failure

Exit codes:

0 — all scenarios passed
1 — one or more scenarios failed, or no scenarios found

Available prompts for show-prompts: helpfulness, coherence, relevance, faithfulness, verbosity, goal_completion, agent_behavior_failure, tool_call_behavior_failure, error_deduplication

YAML Config

Agentest supports YAML config files as an alternative to TypeScript. Place agentest.config.yaml (or .yml) in your project root — it's auto-detected.

agent:
  name: support-bot
  endpoint: http://localhost:3000/api/chat
  headers:
    Authorization: Bearer ${AGENT_API_KEY}
  body:
    model: gpt-4o
    temperature: 0.7

provider: anthropic
model: claude-sonnet-4-20250514

conversationsPerScenario: 5
maxTurns: 10
concurrency: 10

thresholds:
  helpfulness: 3.5
  goal_completion: 0.8

reporters:
  - console
  - json

YAML configs go through the same validation and environment variable interpolation as TypeScript configs. The only difference is that type: 'custom' agents with handler functions are not possible in YAML (since YAML can't express functions) — use TypeScript for those.

Auto-detection order: agentest.config.ts → agentest.config.yaml → agentest.config.yml

Comparison Mode

Run the same scenarios against multiple models or agent configurations side-by-side. Add compare to your config — each entry overrides specific fields from the primary agent:

export default defineConfig({
  agent: {
    name: 'gpt-4o',
    endpoint: 'http://localhost:3000/api/chat',
    body: { model: 'gpt-4o', temperature: 0.7 },
  },

  // Each entry inherits endpoint, headers, body from agent above
  // Only specify what differs
  compare: [
    { name: 'gpt-4o-mini', body: { model: 'gpt-4o-mini' } },
    { name: 'claude-sonnet', body: { model: 'claude-sonnet-4-20250514' } },
  ],
})

Or in YAML:

agent:
  name: gpt-4o
  endpoint: http://localhost:3000/api/chat
  body:
    model: gpt-4o

compare:
  - name: gpt-4o-mini
    body:
      model: gpt-4o-mini
  - name: claude-sonnet
    body:
      model: claude-sonnet-4-20250514

Agentest runs every scenario against all agents in parallel and shows a per-metric comparison:

user books a morning slot
  ✓ gpt-4o
    ✓ conv-1-a3f8b2c1 (2 turns, goal met, helpfulness: 4.5)
  ✓ gpt-4o-mini
    ✓ conv-1-b7d2e4f9 (3 turns, goal met, helpfulness: 3.8)
  ── comparison ──
  helpfulness: gpt-4o: 4.5 | gpt-4o-mini: 3.8
  coherence:   gpt-4o: 5.0 | gpt-4o-mini: 4.2
  ── end comparison ──

Comparison Summary
  gpt-4o: 5/5 scenarios passed
  gpt-4o-mini: 4/5 scenarios passed

How Overrides Work

Each compare entry is shallow-merged onto the primary agent config:

Field	Behavior
`name`	Required. Identifies this agent in output
`endpoint`	Falls back to primary agent's endpoint
`headers`	Falls back to primary agent's headers
`body`	Shallow-merged with primary agent's body (so `{ model: 'x' }` overrides just the model, keeping other body fields like `temperature`)

Comparing Custom Handlers

Custom handlers work too — use type: 'custom' in the compare entry:

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'agent-v1',
    handler: async (msgs) => agentV1.chat(msgs),
  },
  compare: [
    { type: 'custom', name: 'agent-v2', handler: async (msgs) => agentV2.chat(msgs) },
  ],
})

You can also mix HTTP and custom agents:

export default defineConfig({
  agent: {
    name: 'cloud-agent',
    endpoint: 'http://localhost:3000/api/chat',
    body: { model: 'gpt-4o' },
  },
  compare: [
    { name: 'gpt-4o-mini', body: { model: 'gpt-4o-mini' } },
    { type: 'custom', name: 'local-agent', handler: async (msgs) => localAgent.run(msgs) },
  ],
})

Framework Compatibility

Agentest works with any agent framework — either through an OpenAI-compatible HTTP endpoint or via a custom handler function.

No Integration Needed

If your agent already exposes an OpenAI-compatible chat completions endpoint, point Agentest at it directly:

Framework / Service	How to Connect
OpenAI API	Use the endpoint directly
Azure OpenAI	Use the Azure endpoint with auth headers
LiteLLM Proxy	Proxy any model behind an OpenAI-compatible endpoint
OpenRouter	Use the OpenRouter endpoint
vLLM / llama.cpp / Ollama	All expose OpenAI-compatible servers
Vercel AI SDK	Expose a route handler that returns the standard format

LangChain / LangGraph

import { ChatOpenAI } from '@langchain/openai'
import { defineConfig, type ChatMessage } from 'agentest'

const model = new ChatOpenAI({ model: 'gpt-4o' })

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'langchain-agent',
    handler: async (messages: ChatMessage[]) => {
      const response = await model.invoke(
        messages.map((m) => ({ role: m.role, content: m.content })),
      )
      return { role: 'assistant' as const, content: response.content as string }
    },
  },
})

For LangGraph agents, call graph.invoke() in the handler and map the final state to a response.

Anthropic Claude SDK

import Anthropic from '@anthropic-ai/sdk'
import { defineConfig, type ChatMessage } from 'agentest'

const client = new Anthropic()

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'claude-agent',
    handler: async (messages: ChatMessage[]) => {
      const response = await client.messages.create({
        model: 'claude-sonnet-4-20250514',
        max_tokens: 1024,
        messages: messages
          .filter((m) => m.role !== 'system')
          .map((m) => ({ role: m.role as 'user' | 'assistant', content: m.content })),
        system: messages.find((m) => m.role === 'system')?.content,
      })
      const text = response.content[0].type === 'text' ? response.content[0].text : ''
      return { role: 'assistant' as const, content: text }
    },
  },
})

Vercel AI SDK

import { generateText } from 'ai'
import { openai } from '@ai-sdk/openai'
import { defineConfig, type ChatMessage } from 'agentest'

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'vercel-ai-agent',
    handler: async (messages: ChatMessage[]) => {
      const { text } = await generateText({
        model: openai('gpt-4o'),
        messages: messages.map((m) => ({ role: m.role, content: m.content })),
      })
      return { role: 'assistant' as const, content: text }
    },
  },
})

CrewAI / AutoGen / Python Frameworks

Python-based frameworks need a thin HTTP layer. Wrap your agent in a FastAPI endpoint:

# server.py
from fastapi import FastAPI
from crew import my_crew  # your CrewAI setup

app = FastAPI()

@app.post("/api/chat")
async def chat(request: dict):
    messages = request["messages"]
    result = my_crew.kickoff(inputs={"messages": messages})
    return {
        "choices": [{"message": {"role": "assistant", "content": str(result)}}]
    }

Then point Agentest at it:

export default defineConfig({
  agent: {
    name: 'crewai-agent',
    endpoint: 'http://localhost:8000/api/chat',
  },
})

This pattern works for any Python framework — CrewAI, AutoGen/AG2, LlamaIndex, Haystack, Pydantic AI, Semantic Kernel, etc.

Mastra

import { Agent } from '@mastra/core/agent'
import { defineConfig, type ChatMessage } from 'agentest'

const agent = new Agent({ /* your config */ })

export default defineConfig({
  agent: {
    type: 'custom',
    name: 'mastra-agent',
    handler: async (messages: ChatMessage[]) => {
      const response = await agent.generate(messages)
      return { role: 'assistant' as const, content: response.text }
    },
  },
})

Not Yet Supported

Protocol	Status
MCP (Model Context Protocol)	Not yet — MCP servers provide tools, not a chat interface
A2A (Agent-to-Agent)	Planned for a future release

Vitest Integration

Run Agentest scenarios as vitest tests for IDE integration, describe/it blocks, and familiar test output.

Quick Setup

// tests/agent.test.ts
import { defineSimSuite } from 'agentest/vitest'

defineSimSuite({
  agent: { name: 'my-agent', endpoint: 'http://localhost:3000/api/chat' },
})

npx vitest run

Output:

 ✓ Agentest > runs all scenarios (45s)

 Test Files  1 passed (1)
 Tests       1 passed (1)

Options

defineSimSuite(config, {
  scenario: 'booking',    // filter by name
  timeout: 180_000,       // per-test timeout (default: 120s)
  cwd: './packages/agent', // working directory
})

Single Scenario Testing

For more granular control, use runScenario to test individual scenarios with custom assertions:

import { runScenario } from 'agentest/vitest'
import { expect, it } from 'vitest'

it('booking flow completes the goal', async () => {
  const result = await runScenario(config, 'user books a morning slot')
  expect(result.passed).toBe(true)
}, 120_000)

it('cancel flow handles errors gracefully', async () => {
  const result = await runScenario(config, 'user cancels an order')
  // Custom assertions on the result
  expect(result.errors.filter(e => e.severity === 'critical')).toHaveLength(0)
})

Failure Output

When a scenario fails, the test error message includes:

Which conversations errored
Trajectory assertion failures (missing/extra/forbidden tool calls)
Average metric scores
Deduplicated error summaries with severity

Watch Mode

Re-run scenarios automatically when files change:

npx agentest run --watch
npx agentest run -w

Watch mode monitors:

All .sim.ts and .sim.js scenario files
The agentest.config.* config file

On any change, it clears the console, re-loads the config, re-discovers scenarios, and re-runs. Changes are debounced (300ms) to batch rapid saves.

Combine with other flags:

npx agentest run --watch --scenario "booking" --verbose

Press Ctrl+C to stop.

Local LLMs

Agentest supports local models for the simulated user and evaluation. Your agent can still run anywhere — local LLM config only affects Agentest's own LLM usage.

Ollama

export default defineConfig({
  provider: 'ollama',
  model: 'llama3.2',
  // defaults to http://localhost:11434/v1
  // ...
})

LM Studio / vLLM / llama.cpp

export default defineConfig({
  provider: 'openai-compatible',
  model: 'my-local-model',
  providerOptions: {
    baseURL: 'http://localhost:1234/v1',
  },
  // ...
})

Custom Ollama Host

export default defineConfig({
  provider: 'ollama',
  model: 'llama3.2',
  providerOptions: {
    baseURL: 'http://my-gpu-server:11434/v1',
  },
  // ...
})

Notes on Local LLMs

Local models must support structured output (JSON mode) for evaluation metrics to work correctly
Smaller models may produce lower-quality evaluations — consider using a cloud provider for evaluation even if your agent runs locally
The apiKey field in providerOptions is optional for local servers and defaults to a placeholder

Pass/Fail Logic

A scenario passes when all of the following are true:

No conversation threw an error (e.g., agent endpoint down, unmocked tool)
All trajectory assertions matched (if configured)
No errors at or above the configured failOnErrorSeverity were detected
All metric averages meet their configured thresholds

A scenario fails if any of these conditions are not met.

The overall run passes only if every scenario passes. The CLI exits with code 1 on any failure.

Error Severity Gate

failOnErrorSeverity: 'critical',  // default — only critical errors fail the scenario
failOnErrorSeverity: 'high',      // high + critical errors fail
failOnErrorSeverity: 'medium',    // medium + high + critical errors fail
failOnErrorSeverity: 'low',       // any error fails the scenario

The severity ladder is low → medium → high → critical. Setting a level means that level and above will cause a failure. The default is 'critical'.

Programmatic Usage

Agentest can be used as a library, not just through the CLI:

import {
  defineConfig,
  scenario,
  Simulator,
  Evaluator,
  TrajectoryMatcher,
  createProvider,
} from 'agentest'

const config = defineConfig({
  agent: { name: 'test', endpoint: 'http://localhost:3000/api/chat' },
})

const llm = await createProvider(config.provider, config.model)
const simulator = new Simulator(config, llm)

const s = scenario('test scenario', {
  profile: 'Casual user.',
  goal: 'Ask about the weather.',
})

const result = await simulator.runScenario(s)

// Evaluate
const evaluator = new Evaluator(llm, ['helpfulness', 'coherence'], undefined, 10)
for (const conv of result.conversations) {
  const evaluation = await evaluator.evaluateConversation(conv.turns, s.options)
  console.log(evaluation)
}

// Check trajectories
const matcher = new TrajectoryMatcher()
const trajResult = matcher.match(
  result.conversations[0].turns.flatMap(t => t.toolCalls),
  { matchMode: 'contains', expected: [{ name: 'get_weather', argMatchMode: 'ignore' }] },
)
console.log(trajResult.matched)

Roadmap: Multi-Agent Orchestration

Multi-agent support is planned for an upcoming release. This will let you test orchestrators that delegate to sub-agents:

agents config field — define additional agent endpoints alongside the primary agent
Routing detection — intercept routing tool calls (e.g. route_to_booking_agent) and forward to the actual sub-agent endpoint instead of mocking
Routing assertions — assert which sub-agents were invoked, in what order, and which were excluded
Scoped mocks — scope tool mocks per agent (booking-agent/create_booking)
Graph topologies — hub-and-spoke initially, then arbitrary agent-to-agent routing with subRoutes

// Planned API
export default defineConfig({
  agent: {
    name: 'orchestrator',
    endpoint: 'http://localhost:3000/api/orchestrator',
  },
  agents: {
    'booking-agent': { endpoint: 'http://localhost:3001/api/chat' },
    'faq-agent': { endpoint: 'http://localhost:3002/api/chat' },
  },
})

scenario('booking query routes correctly', {
  profile: 'Customer wanting to book.',
  goal: 'Book a haircut.',
  routing: {
    strategy: 'tool_calls',
    routes: {
      'route_to_booking': 'booking-agent',
      'route_to_faq': 'faq-agent',
    },
  },
  assertions: {
    routing: {
      matchMode: 'contains',
      expected: [{ agent: 'booking-agent' }],
      excluded: ['faq-agent'],
    },
  },
})

Understanding LLM Usage

Agentest makes many LLM calls per run. Understanding the cost structure helps you design cost-effective tests.

Per scenario with default settings (3 conversations, ~4 turns each, all 7 metrics):

Phase	Calls	Formula
Simulation (simulated user messages)	~12	conversations × turns
Simulation (goal-check per turn)	~12	conversations × turns
Evaluation (per-turn metrics)	~84	conversations × turns × 7 metrics
Evaluation (goal_completion)	3	conversations × 1
Error deduplication	1	1 per scenario
Total	~112

Tips to reduce cost during development:

conversationsPerScenario: 1 — 3x reduction
maxTurns: 4 — shorter conversations
metrics: ['helpfulness', 'goal_completion'] — only run key metrics (~3x reduction on evaluation)
--scenario "name" — run only the scenario you're iterating on

You can use environment variables to switch between cheap dev runs and thorough CI runs:

const isDev = process.env.CI !== 'true'

export default defineConfig({
  conversationsPerScenario: isDev ? 1 : 5,
  maxTurns: isDev ? 4 : 8,
  metrics: isDev ? ['helpfulness', 'goal_completion'] : undefined, // undefined = all
})

Requirements

Node.js >= 20
Your agent must either expose an OpenAI-compatible chat completions endpoint, or use a custom handler function (see Agent Configuration)
An LLM API key for the simulated user and evaluation (Anthropic, OpenAI, Google, or a local LLM)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
src		src
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Agentest

Prerequisites

Features

Why This Is Needed

Agentest vs. Other Tools

Table of Contents

Install

Quick Start

1. Create a config file

2. Write a scenario

3. Run

How It Works

Configuration

Agent Configuration

Chat Completions (default)

Custom Handler

LLM Provider

Simulation Parameters

Evaluation

File Discovery

Reporters

Unmocked Tools

Full Config Example

Scenarios

Profile & Goal

Knowledge

Overriding Global Settings

Prompt Template Customization

Mocks

Function Mocks

Sequence Mocks

Error Simulation

Unmocked Tools

Trajectory Assertions

Match Modes

Argument Matching

Evaluation Metrics

Quantitative Metrics

Qualitative Metrics

Thresholds

Error Deduplication

Custom Metrics

CLI

YAML Config

Comparison Mode

How Overrides Work

Comparing Custom Handlers

Framework Compatibility

No Integration Needed

LangChain / LangGraph

Anthropic Claude SDK

Vercel AI SDK

CrewAI / AutoGen / Python Frameworks

Mastra

Not Yet Supported

Vitest Integration

Quick Setup

Options

Single Scenario Testing

Failure Output

Watch Mode

Local LLMs

Ollama

LM Studio / vLLM / llama.cpp

Custom Ollama Host

Notes on Local LLMs

Pass/Fail Logic

Error Severity Gate

Programmatic Usage

Roadmap: Multi-Agent Orchestration

Understanding LLM Usage

Requirements

License

About

Resources

Uh oh!

Packages