Errantry

Agent-CLI usability testing — drive an LLM through your CLI using only --help, error messages, and discovery commands, then assert state. Tests whether your CLI is legible to an agent, not just whether commands work.

Part of the Signals & Sorcery ecosystem.

Most CLI tests answer "does this command work." Errantry answers "can an agent discover how to make it work, given only the surface your CLI exposes?" — the question that matters when the user of your CLI is an LLM, not a human. Closest neighbors are Terminal-Bench and Inspect AI; both test agent capability on tasks. Errantry inverts the frame: the CLI is the unit under test, the agent is the probe.

Try it in 30 seconds (no LLM key required)

git clone git@github.com:shiehn/Errantry.git
cd Errantry
npm install
npx tsc -b packages/core packages/cli packages/electron-bridge packages/playwright
bash scripts/mock-todo-test.sh

You'll see a markdown report with green assertions, turn count, and a trace of the bash commands the (mock) agent ran against the bundled todo CLI. Real-LLM run: drop --mock, set OPENAI_API_KEY, same command.

Why

If your CLI is organized well enough, and your error messages are contextual enough, an agent should be able to recover from wrong turns and still complete the original ask.

That hypothesis is testable. Define the user's ask in plain English, hand the agent the CLI binary on PATH, let it use --help / errors / dry-runs, and assert the resulting state. If the assertion fails, your help text or error message is the weak link — and the trace tells you which one.

Surfaces

The agent's reach into the system under test is mediated by a surface:

`surface:`	What the agent sees	Use when
`cli`	A single `bash` tool. Each call spawns a fresh subprocess.	The thing under test is a CLI binary on `PATH`.
`chat`	A single `chat({message})` tool that POSTs to `/errantry/chat` on your bridge. The bridge forwards to your in-app chat assistant, which drives the underlying tools and replies.	The thing under test is the chat-driven UX of your app — does a natural-language ask deliver the same outcome as the bare CLI?

A chat scenario passes when the assistant turns one plain-English message into the right state change. Because both surfaces share the assertion vocabulary (dbQuery, toolCalled, budget), you can write a CLI scenario and a chat scenario against the same goal and compare their friction scores.

Packages

Package	Role
`@errantry/core`	Agent loop, scenario format (YAML), assertion matchers. OpenAI + Anthropic + Mock providers. Test-runner-agnostic.
`@errantry/cli`	`errantry run scenario.yaml` standalone runner with `--mock` for tokenless dry-runs.
`@errantry/electron-bridge`	Drop-in HTTP bridge for Electron-TS apps under test — exposes `/errantry/{health,smoke,db/query,fixture,reset,app-config,chat}` from your main process. Read-only SQL guard.
`@errantry/playwright`	First-class Playwright extension with `errantry` and `app` fixtures, custom `expect` matchers (`toolCalled`, `budgetRespected`, `toHaveRow`).

Scenario format

name: happy-path-create-scene
surface: cli

setup:
  fixture: blank-project

agent:
  provider: openai
  model: gpt-4o-mini
  max_turns: 8

goal: |
  Create a new scene called "Verse" in the currently-bound project.

assertions:
  - dbQuery:
      sql: "SELECT name FROM scenes WHERE name = ?"
      args: ["Verse"]
      toHaveAtLeast: 1
  - toolCalled: { contains: scene }
  - helpInvoked: null
  - budget: { turns: 6, errors: 2 }

API keys are read from env (OPENAI_API_KEY, ANTHROPIC_API_KEY) — never from scenario YAML.

Programmatic API

For users dropping scenarios into existing Jest / Vitest suites:

import { runScenarioFile } from '@errantry/core';

const result = await runScenarioFile('scenarios/todo/happy-path-add.yaml', {
  cwd: workdir,
});
expect(result.passed).toBe(true);
expect(result.metrics.frictionScore).toBeLessThan(0.3);

Playwright extension (first-class for Electron-TS apps)

import { test, expect } from '@errantry/playwright';

test('agent creates a scene from --help discovery alone', async ({ errantry, app }) => {
  await app.bindFixture('blank-project');
  const result = await errantry.run({
    surface: 'cli',
    goal: 'Create a new scene called "Verse".',
    maxTurns: 12,
  });
  await expect(result).toolCalled({ contains: 'scene' });
  await expect(app.db).toHaveRow("SELECT name FROM scenes WHERE name = 'Verse'");
  await expect(result).budgetRespected({ turns: 6, errors: 2 });
});

Built-in matchers (Tier 1)

Matcher	Asserts on
`dbQuery(sql, args).toHaveRows` / `toHaveAtLeast` / `toMatch`	DB state via the bridge
`file(path).toExist` / `.toContain` / `.toMatchRegex`	Filesystem (paths resolve against scenario `cwd`)
`audioDuration(path, expectedSeconds, toleranceSeconds)`	Audio length via ffprobe
`toolCalled({ contains, matches, tool })`	Trace
`helpInvoked`	Agent ran `--help`, `-h`, or `tool_search`
`errorRecovered`	Agent hit an error and recovered with a successful retry
`budget({ turns, errors, frictionScore })`	Hard budgets, not soft metrics

Tier 2 (structural, for generative output) and Tier 3 (LLM-judged, for subjective goals) land in later phases.

Metrics

Every run reports turns, toolCalls, helpInvocations, errorsEncountered, errorsRecovered, and a frictionScore ((errors − recovered) / completedSubgoals). Use them to A/B test help-text and error-message changes — "we rewrote the scene_create error and friction dropped from 0.4 to 0.1" is a concrete claim Errantry can substantiate.

Adopting in an Electron-TS app

Three lines in your main process:

// src/main/index.ts
import { installErrantryBridge } from '@errantry/electron-bridge';

if (process.env.ERRANTRY_TEST === '1') {
  installErrantryBridge({
    db: getSharedDatabase(),
    onFixtureMount: async (name) => mountFixtureProject(name),
    onReset: async () => resetEphemeralState(),
  });
}

Then write scenarios as Playwright tests or YAML — the bridge serves the assertion endpoints runScenario calls during setup and verification.

Status

Phase	What	State
1	Playwright + CLI surface, Tier 1 matchers, sas-assistant adoption	shipping
2	MCP surface, Tier 2 (structural) matchers, `errorRecovered` correlated to remediation, Jest adapter	next
3	Validate `electron-bridge` against a second app; trace-only mode for pure CLIs	planned
4	Cross-run diffing, optional Tier 3 (LLM-judge) matcher, HTML reports	planned

Development

npm install
npx tsc -b packages/core packages/cli packages/electron-bridge packages/playwright
npm test           # 40 tests across 4 packages
npm run typecheck

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
artifacts		artifacts
assets		assets
packages		packages
scenarios		scenarios
scripts		scripts
.gitignore		.gitignore
.nvmrc		.nvmrc
.prettierrc.json		.prettierrc.json
LICENSE		LICENSE
README.md		README.md
eslint.config.mjs		eslint.config.mjs
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Errantry

Try it in 30 seconds (no LLM key required)

Why

Surfaces

Packages

Scenario format

Programmatic API

Playwright extension (first-class for Electron-TS apps)

Built-in matchers (Tier 1)

Metrics

Adopting in an Electron-TS app

Status

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Errantry

Try it in 30 seconds (no LLM key required)

Why

Surfaces

Packages

Scenario format

Programmatic API

Playwright extension (first-class for Electron-TS apps)

Built-in matchers (Tier 1)

Metrics

Adopting in an Electron-TS app

Status

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages