TOML-based Claude skill test runner (nextest-style)
skill-bench is a Rust-based test runner that executes test cases defined in TOML files. Designed for testing Claude Code skills.
Distributable as a standalone binary - users don't need a Rust environment.
- TOML test definitions: Write tests in TOML without writing Rust code
- nextest-style interface: Parallel execution, filtering, failure history management
- Fast parallel execution: Multi-threaded execution via rayon
- Rich assertions: 18 verification types (skills, MCP, tools, files, databases)
- Embedded assets: Harness plugin embedded in binary for standalone distribution
# Build from source
cargo build --release
cargo install --path .
# Or download the binary and add to PATH
cp target/release/skill-bench ~/.local/bin/skill-bench list# Run all tests (uses default pattern: cases)
skill-bench run
# Run tests from specific directory (automatically finds all .toml files)
skill-bench run "cases"
skill-bench run "cases/concept-interviewing"
# Or use explicit glob pattern
skill-bench run "cases/**/*.toml"
# Filter by test name (regex)
skill-bench run --filter "functional-.*"
# Filter by skill name
skill-bench run --skill investigation-recording
# Rerun only failed tests
skill-bench run --rerun-failed
# Specify parallel threads
skill-bench run --threads 4
# Specify plugin directory (path to directory containing .claude-plugin/)
skill-bench run --plugin-dir ./patent-kit
# Persist Claude session logs to directory
skill-bench run --log logs # Save logs to logs/ directory
skill-bench run -l . # Short form: save to current directory# Display timeline of test execution from log file
skill-bench timeline /path/to/log-file.log
# Show detailed output
skill-bench timeline /path/to/log-file.log --verboseThe pattern argument accepts either a directory or glob pattern:
cases- All TOML files recursively undercases/(directory mode)cases/concept-interviewing- All TOML files in a specific subdirectorycases/**/*.toml- Explicit glob pattern for all TOML files recursivelycases/*/*.toml- TOML files only in immediate subdirectoriescases/functional-*.toml- TOML files matching a specific pattern
When a directory is specified, all .toml files are automatically found recursively.
# List all tests (uses default pattern: cases)
skill-bench list
# List tests from specific directory
skill-bench list "cases/concept-interviewing"skill-bench/
├── Cargo.toml # Rust project config
├── src/ # Source code
│ ├── main.rs # Entry point
│ ├── cli/ # CLI argument definitions
│ ├── models/ # Data models
│ ├── runtime/ # Test discovery & execution
│ ├── assertions/ # Assertion library
│ ├── output/ # Result reporting
│ ├── state/ # Failure history management
│ └── assets/ # Embedded assets
│ └── harness-plugin/
├── cases/ # TOML test cases
│ ├── claim-analyzing/
│ ├── concept-interviewing/
│ └── ...
└── target/ # Build output
name = "test-name"
description = "Test description"
timeout = 120 # seconds
test_prompt = """
Test prompt here...
"""
[[setup]]
name = "setup_name"
type = "file"
path = "test.txt"
content = "File content"
[[checks]]
name = "check_name"
command = "skill-invoked"
skill = "skill-name"
[answers]
"question_key" = "answer_value"Assertions use structured TOML format:
skill-loaded- Skill was loadedskill-invoked- Skill was invokedskill-not-invoked- Skill was NOT invoked
mcp-loaded- MCP server was loadedmcp-tool-invoked- MCP tool was invokedmcp-success- MCP tool succeeded
tool-use- Tool was usedparam- Parameter value verification
file-content- Verify file contentfile-contains- File contains stringworkspace-file- File existsworkspace-dir- Directory exists
output-contains- Output contains stringlog-contains- Log contains patterntext-contains- Text content search
db-query- SQL query result verification- Numeric comparisons:
">0",">=5","=10","<3","<=2"
- Numeric comparisons:
Use deny = true on any assertion for negative verification:
[[checks]]
name = "should-not-contain-error"
command = "file-contains"
file = "output.txt"
contains = "error"
deny = trueFailed test history is saved to .skill-bench/test-history.json. Use --rerun-failed to re-run only tests that failed in the previous run.
MIT