Release mcptest 1.0.0 · soapbucket/mcptest

[1.0.0] - 2026-06-09

Added

First-class in-code test API in every SDK (no YAML required)
Surface agent run duration and cost in the report (WOR-1202)
Chained derived metrics + suite-level metric rollup (WOR-1198, WOR-1199)
OWASP gap probes SEC-036 unbounded-list, SEC-037 system-prompt-leakage (WOR-1200)
Color the default mcptest run console summary (WOR-1201)
Colorized output via --color (security + pretty report) (WOR-1201)
--models provider/model matrix sweep (WOR-1193)
Vulnerability report (HTML/Markdown) + OWASP LLM Top 10 coverage (WOR-1194)
Comparison-matrix reporter (test x model grid) (WOR-1190)
Named model-graded assertions (WOR-1192)
cel: deterministic assertion predicate (WOR-1191)
Session-ledger module + mcptest ledger diff/emit (WOR-1181/1184)
Canonical session-ledger schema + agent run-record fields (WOR-1180/1182/1183)
Trajectory assertions on the run envelope (WOR-1175)
mcptest mock --preset evil adversarial server (WOR-1176)
DESC-013 flag enum-worthy free-string params (WOR-1177)
Matrix parameterization across models and prompts (WOR-1019)
Render the eval envelope through all nine reporters (WOR-1008)
Conditional when: criteria and per-criterion threshold (WOR-1021)
Calibration anchors per criterion (WOR-1016)
Evidence-required judging (WOR-1015)
Multi-model judge panel with aggregate and tie-break (WOR-1017)
Per-eval judge model, jury, and metered cost (WOR-1007)
Grade a live tool-using agent run as the candidate (WOR-1006)
Configurable score scales and aggregation modes (WOR-1014)
Mcptest eval --explain dry-run (WOR-1018)
Per-criterion required and guard gating (WOR-1013)
Built-in rubric presets (WOR-1012)
Reusable named rubrics via rubrics: map and ref (WOR-1011)
Grade evals against rubrics with a real provider (WOR-1020/1009/1010)
Accept structured rubrics in the top-level evals: type (WOR-1005)
WOR-959 second-tier docs, examples, reliability reporting; fix F1 vacuous-perfect regression
WOR-964 distractor-tool scenario pack + fix WOR-962 module doc
WOR-962 name-free discovery + orchestration diagnostics sub-scores
WOR-969 add equal-function-set F1 example scenario
Integrate WOR-961 narrative-trace + WOR-963 trust-boundary checks
WOR-960 objective multi-server tool-selection F1 via equal-function sets (MSC-Bench)
Python-sdk and typescript-sdk example dirs round out WOR-951
SDK examples ship a working stdio mock; add MockServerSpec schema
Add WebBotAuth, AwsSigV4Auth, tls.mtls to v1; audit examples
Install paths, Can-I-trust-mcptest README section, CHANGELOG, install.sh
SBOM extraction + cosign keyless + SLSA L3 provenance on every release
Mcptest sbom subcommand with build-time embedded CycloneDX BOM
Argument_correctness + plan_quality agentic presets (WOR-874)
Mcp_task_completion + mcp_use metrics on the eval: block (WOR-862, WOR-863)
Wire the rubric metric into agent runs via the eval: block (WOR-872)
Rubric scoring core - weighted multi-criterion judged metric (WOR-872)
Refresh an expired cached login token at connect (WOR-868)
Secret-hygiene audit + header-free cassette guarantee (WOR-869)
Suite-level auth default with per-server override (WOR-867)
Auth.oauth client_credentials at connect for URL servers (WOR-866)
Client_credentials token exchange primitive (WOR-866)
Use the cached OAuth token from mcptest login on URL servers (WOR-302)
Recording transport that captures live exchanges (WOR-482)
Cassette: server source replays recordings with no network (WOR-482)
Replay transport that serves a recorded cassette (WOR-482)
Wire --wait-for-ready into the run and doctor paths (WOR-370)
Serve resources + prompts so all MCP primitives are testable
Evaluate llm-judge live through the installed provider (WOR-282)
NDJSON + TAP reporters, GitHub Actions annotations, every format from run
Make agent examples runnable end to end
Implement snapshot evaluation + fix contains schema (recovery)
Within-session stability targets (WOR-855)
Spec-version cross-version gate (WOR-857)
Calibration corrected-rate confidence interval (WOR-858)
Tool-description lint targets on tool_quality (WOR-856)
Jury bias signals as assertion targets (WOR-854)
Grade_delta check block (WOR-852)
Tool_quality check block (WOR-852)
Calibration check block (WOR-852)
Jury analytics as assertion targets (WOR-852)
Carry per-juror token/cache usage on the jury report (WOR-844)
Web-bot-auth directory subcommand publishes the public key (WOR-840 item 4)
Opt-in live auth probes behind --probe (WOR-840 item 3)
SigV4 named-profile resolution from ~/.aws (WOR-840 item 2)
Wire C2 adaptive seed campaign into live redteam (WOR-841)
RSA-PSS signing for Web Bot Auth (WOR-840 item 1)
Live security redteam command driving the C1 corpus (WOR-829)
Un-hide the Layer C1 red-team lane for live use (WOR-829)
Add per-assertion transform (WOR-837)
Add token, tool-call, and time run-wide caps (WOR-830)
Wire context hooks into the runner + schema, docs, example (WOR-836)
Context-aware lifecycle hook ABI in mcptest-core (WOR-836)
Doctor auth pre-flight + run auth-flag matrix (WOR-832)
Expose local cert + key pre-flight helpers (WOR-832)
Add registry usage-stats aggregator (WOR-838)
Run request/response transforms around the tool call (WOR-835)
Transform config + runner + defaultTest merge (WOR-835)
Web Bot Auth Ed25519 message signatures (WOR-497)
AWS SigV4 request signing (WOR-474)
Functional mTLS client identity (WOR-473)
Scaffold transport-level client-auth config wiring (WOR-473, WOR-474, WOR-497)
Suite-composition keys in v1.json (WOR-790)
Lifecycle hooks beforeAll/afterAll/beforeEach/afterEach (WOR-790)
Weighted combined-score model, assert-sets, defaultTest, derivedMetrics (WOR-790)
Wire CLI lanes and opt-in advisory judge behind --model (WOR-731)
Wire deterministic toxic_flow/namespace/integrity lanes into a unified run (WOR-731)
Similar via embedding cosine (WOR-825)
Is-sql via sqlparser (WOR-825)
Is-xml via quick-xml (WOR-825)
Wire conformance invariants into the CLI (WOR-756)
Spec-derived conformance invariants and composition safety (WOR-756)
--resume skips completed record cells (WOR-792)
Run journal for matrix resume (WOR-792)
Wire named matchers + not negation into parser and executor (WOR-788)
Add deterministic matcher impls + MatcherKind variants (WOR-788)
Sampling + roots round-trip corpus rules (WOR-791)
Trajectory match modes + golden-path scoring (WOR-780)
Add OAuth protocol conformance corpus (WOR-781)
Add --fault flag to mcptest mock + recovery docs (WOR-754)
Add unresponsive-server fault injection (WOR-754)
Add deterministic recovery scoring for faulty servers (WOR-754)
Add mcptest generate suite from declared tool schemas (WOR-786)
Synthesize a starter test suite from tool schemas (WOR-786)
Code-mode test harness (WOR-751)
Transport, auth, and local probe analyzers (WOR-737)
Layer B advisory LLM-judge detection (WOR-738)
Layer C2 adaptive attacker (WOR-740)
Layer C1 red-team exploitability oracle (WOR-739)
Notification-stream and server-request corpus assertions (WOR-753)
Scorecard security-posture signals (WOR-719)
External-scanner supplement primitive (WOR-741)
Check structuredContent against declared outputSchema (WOR-749)
Add calibrated confidence band and escalation flag (WOR-722)
Optional per-juror reliability weight, equal by default
Add DESC-009 and DESC-010 tool description checks (WOR-746)

CI

Bump actions to node24 majors (drop Node.js 20 deprecation warning)
Test the python and node SDKs against the prebuilt binary
Bump actions/cache v4 -> v5 (Node 24) to clear Node.js 20 deprecation (WOR-1209)
Remove publish-schema.yml; the schema is served from mcptest.sh (WOR-1209)
Remove superseded docs.yml (mdBook -> GitHub Pages) (WOR-1209)
Fix remaining gate failures surfaced on macOS/Windows (WOR-1209)
Auto-regenerate llms-full.txt when source docs change
Green the gate after re-enabling Actions (WOR-1209)
Add workflow_dispatch so the build gate can be re-run on demand (WOR-1209)
Add build gate (cross-compile all targets + docker) and fix Dockerfile MSRV (WOR-1209)
Bump actions/checkout from 4 to 6 (#1)

Chore

Remove stale doc-sync prompt and superseded design spec
De-stale multi-server/fixtures/restart-policy docs + notion e2e example
Rebrand to Soap Bucket LLC and curl-install via download.mcptest.sh

Documentation

Add a Native test-framework SDKs page (YAML vs in-code)
Add a dedicated, in-depth CEL matcher example
Link the logo to the website and add a Website nav line
Add a capability-diverse Examples section
Scrub internal Linear/process leak from the public contributor doc
Refresh llms.txt (live /docs URLs, new pages and scenarios)
Add scenarios 8-16, hosted test-server walkthroughs
Point cost control at mcptest eval, not run (WOR-1204)
Catalog the 15 missing example directories (WOR-1203)
Documentation + examples audit pass (accuracy, walkthrough, every-feature coverage)
Highlight the model-comparison matrix, matchers, and security report
Lead the README with the mcptest logo
Document the WOR-1189 SOTA features and an example
Add rubric-features.yml covering the configurable rubric surface
Publish six user-facing pages in SUMMARY
Add CI, crates.io, license, rust, stars, and docs badges
Resolve WOR-999 audit follow-ups (WOR-1000/1001/1002/1003)
Audit pass fixes (reporter count, residue, brand, SUMMARY index)
Wire the new hosted test-server scenarios + a cross-server example
Narrative-vs-trace divergence assertion guide
Document tool-selection F1 via equal-function sets
WOR-967 track MSC-Bench, MCP-Atlas, MCP Pitfall Lab, AttestMCP in research grounding
Fix dangling empty-paren residue in anthropic smoke test doc comment
Point public-facing pages at GitHub issues instead of Linear
Conformance CLI spec (run + refresh subcommands)
Document the agent eval: metrics + example (WOR-871)
Replace em-dashes in references with colons (gate fix)
Capture/escape-hatch for custom auth + ordering decision (WOR-870)
Auth-in-tests design + usage (WOR-865 epic)
De-stale the module comment now that login is wired end to end
Reconcile to shipped reality (cassette server source, multi-server, fixtures)
Rewrite mcptest-mock.md to match the real manifest server
Document resources and prompts test blocks
List the resources/prompts/ping methods the mock now serves
Add real-world suites for filesystem and fetch MCP servers
Lead with a concrete end-to-end walkthrough
State which calibration: fields are required
Verbosity + accuracy pass on the metric guides
Drop stale "no calibration: block" intro (WOR-852)
Design for computed-metric test conditions
Sync docs to the shipped schema + add doc-sync audit prompt
Show the YAML you actually author
Auth-fixtures for mTLS, SigV4, and Web Bot Auth (WOR-840 item 5)
Document run-wide caps in schema and yaml reference (WOR-830)
Document polyglot command hooks (WOR-834)
Unpublish the maintainer-only release-process and research pages
Document the transform step and add a runnable example (WOR-835)
Document the now-live auth schemes (WOR-473, WOR-474, WOR-497)
Broaden positioning to the full test surface
Fix broken links in index.md (WOR-831)
Canonicalize brand URL to soapbucket.com (WOR-831)
Remove ADRs from the public repo and clean references (WOR-831)
Replace fabricated flags with real equivalents (WOR-828)
Correct reporter syntax across all CI-provider snippets (WOR-828)
Remove internal positioning ADR + scorecards GTM sections (WOR-828)
Fix exit codes, command names, reporter syntax, and broken links in troubleshooting (WOR-828)
Fix reporter syntax and fabricated matchers in the CI guide (WOR-828)
Fix matcher, auth, cassette, and exit-code drift in faq and cassette examples (WOR-828)
Fix README sample output and soften uniqueness claim (WOR-828)
Fix matcher, compliance, and exit-code drift in concepts and what-is (WOR-828)
Fix exit codes, init scaffold, and run output drift in getting-started (WOR-828)
Remove internal sales material from public docs (WOR-828)
Document suite-composition primitives (WOR-790)
Document the full lane surface and the advisory boundary (WOR-731)
Document conformance invariants and composition mode (WOR-756)
Doc-hide deferred OfficialBridge surface (WOR-769)
Doc-hide typed-but-unconsumed runner models (WOR-769)
Doc-hide unwired non-surface lanes (WOR-769)
Doc-hide test-only jury dispatch and bias helpers (WOR-769)
Doc-hide deferred SigV4/mTLS/WebBotAuth schemes (WOR-769)
Document WOR-788 matchers in yaml-reference and lib doc (WOR-788)
Note sampling + roots corpus coverage (WOR-791)
Document the five match modes + golden path (WOR-780)
Add within-session stability to the TOC (WOR-759)
Fix multiple-reporter phantom in concepts, getting-started, llms-full (WOR-794)
Enforce missing_docs lint (WOR-774)
Enforce missing_docs and document module surface (WOR-774)
Enforce missing_docs and document module surface (WOR-774)
Fix phantom matcher names in agent examples (WOR-770)
Fix phantom matcher names in agent examples (WOR-770)
Fix phantom matcher names in agent examples (WOR-770)
Fix phantom matcher names in agents snippet (WOR-770)
Fix matcher/reporter/baseline drift (WOR-770)
Mark pentest-gate Server checks shipped against the SEC engine (WOR-718)
Document confidence band and escalation flag (WOR-722)
Continuous-eval scope (ADR 0043) + scorecard reliability dimensions
Index ADR 0042 in the ADR README
Add ADR 0042 jury reliability weighting and independent scoring
Document the namespace security family (WOR-736)
Document the annotation lint rules DESC-011 and DESC-012 (WOR-748)
Cite Anthropic advanced-tool-use and Cloudflare code-mode (WOR-746 follow-up)
Document DESC-009 and DESC-010 description checks (WOR-746)
Record open-core boundary for the security testing framework
Multi-layer security testing (ADR 0041) + OWASP MCP Top 10 cross-walk
Security testing framework (ADR 0040 + 32-check catalog)
Add May 2026 literature pass (judge robustness, agent reliability, MCP security)
Land MCP security research track (ADR 0039 + pentest/scorecard spec)
Tighten README, replace the v1.0 ships-dump with a concise At a glance
Remove placeholder Performance section from README
Complete the CLI reference (exec, mock, generate, login, prompt, cache)
Document pipe, tools-call chaining, scorers, mcp-server; fix stale paths
Move internal planning/marketing docs to /Users/rick/projects/soapbucket/docs/mcptest
Point brand link at soapbucket.com

Fixed

Drop npm provenance (private repo) and pin cosign exactly
Make crate publishing idempotent and use the sparse index
Publish crates with CARGO_REGISTRY_TOKEN
Make all six workspace crates publishable to crates.io
Vendor SEP corpus in-crate; make SLSA attestation non-fatal
Publish crates in dependency order (core before config)
Use syft tag selector for the SPDX SBOM step
Make the jvm and dotnet SDK examples run against the real binary + CI
Make the go and rust SDK examples run against the real binary + CI
Make the python and node SDK examples run against the real binary
Make discovery + agent fixtures cross-platform robust
De-flake the connect-branch test on Windows; surface all CI failures
Single-quote command-array paths in YAML so Windows paths parse (WOR-1209)
Single-quote the cassette path in YAML so Windows paths parse (WOR-1209)
Gate the bracket-order test's helpers to unix too (WOR-1209)
Gate the sh-based bracket-order test to unix (Windows) (WOR-1209)
De-flake the per-second tick test on loaded CI (WOR-1209)
Gate the test std::io::Write import to unix (Windows -D warnings) (WOR-1209)
Clear remaining cross-platform CI failures (scorer pipe, timing flakes, enum lint) (WOR-1209)
Tolerate broken-pipe stdin write; quiet Windows large_enum_variant (WOR-1209)
Send tool results as user turns so the agent loop works
Close six config foot-guns (WOR-1169..1174)
E2e examples + docs audit fixes (WOR-1114)
Use args, not arguments, in hosted-cross-server
Export env-file and --env/--var values into the process env
Correct the below-floor F1 scenario to actually drop below 50
Resolve WOR-959 integration compile + clippy errors
Correct eval/mod.rs 3-way merge (restore all modules + correct f1 exports)
WOR-966 add doctor decision that flags -32002 missing-resource code and names -32602
Resolve code-review 5-30 findings (WOR-954..958)
Exclude vendored corpus from em-dash lint; fix rustdoc intra-link
Proper RFC 9728 + RFC 8414 OAuth discovery (WOR-301)
Fail fast with a clear auth error on a 4xx instead of timing out
Apply server bearer_token_env on the run/connect path
Stabilize WOR-839 matrix-assert tests under parallel execution
Interpolate ${var} in args and command on the live run path
Clippy lints + RSA-PSS-aware webbot doctor test (WOR-839/WOR-840)
Extract gate_matrix_asserts to run_live_matrix for the size guard (WOR-839)
Fill AgentConfig run-wide caps in redteam (WOR-829 + WOR-830 reconcile)
De-flake stdio transport integration tests (WOR-833)
Resolve intra-doc link now that integrity is no longer doc-hidden (WOR-731)
Low-severity review cleanup roll-up (WOR-778)
Emit schema-valid ToolTest YAML (WOR-823)
Demote private HANG_CAP intra-doc link in mock (WOR-754)
Demote private ECE_BINS intra-doc link in calibration (WOR-760)
Make coverage, update-snapshots, and cache flags honest (WOR-766)
Wire --server-* overrides through to the live run target (WOR-766)
Repair masked gate failures (clippy derivable_impls + 767 contains fallout) (WOR-765)
Repair broken intra-doc links failing cargo doc -D warnings (WOR-765)
Route exact/contains/regex/schema through matchers crate (WOR-767)
Wire the dead cost/consensus early-termination into jury dispatch (WOR-776)
Make pairwise position-swap actually swap candidate order (WOR-776)
Group tools by server and dedup colliding idents (WOR-772)
Make Mistral honor its advertised tool_use (WOR-768)
Unify map_reqwest_error across providers (WOR-768)
Regenerate compliance stats.md for the WOR-750 corpus additions

Reverted

Drop redundant run-path schema warn (WOR-1170)

Security

Bump vitest to ^4.1.0 to clear the critical Dependabot advisory
Add cross-server namespace checks (WOR-736)
Add the mcptest security CLI subcommand (WOR-734)
Deterministic check engine + tool-surface checks (WOR-732, WOR-735)
Red-team example scenario corpus (WOR-720)
Observable-evidence oracle regression guard (WOR-721)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mcptest 1.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

[1.0.0] - 2026-06-09

Added

CI

Chore

Documentation

Fixed

Reverted

Security

Uh oh!