Skip to content

mcptest 1.0.0

Choose a tag to compare

@github-actions github-actions released this 09 Jun 03:58
· 41 commits to main since this release

[1.0.0] - 2026-06-09

Added

  • First-class in-code test API in every SDK (no YAML required)
  • Surface agent run duration and cost in the report (WOR-1202)
  • Chained derived metrics + suite-level metric rollup (WOR-1198, WOR-1199)
  • OWASP gap probes SEC-036 unbounded-list, SEC-037 system-prompt-leakage (WOR-1200)
  • Color the default mcptest run console summary (WOR-1201)
  • Colorized output via --color (security + pretty report) (WOR-1201)
  • --models provider/model matrix sweep (WOR-1193)
  • Vulnerability report (HTML/Markdown) + OWASP LLM Top 10 coverage (WOR-1194)
  • Comparison-matrix reporter (test x model grid) (WOR-1190)
  • Named model-graded assertions (WOR-1192)
  • cel: deterministic assertion predicate (WOR-1191)
  • Session-ledger module + mcptest ledger diff/emit (WOR-1181/1184)
  • Canonical session-ledger schema + agent run-record fields (WOR-1180/1182/1183)
  • Trajectory assertions on the run envelope (WOR-1175)
  • mcptest mock --preset evil adversarial server (WOR-1176)
  • DESC-013 flag enum-worthy free-string params (WOR-1177)
  • Matrix parameterization across models and prompts (WOR-1019)
  • Render the eval envelope through all nine reporters (WOR-1008)
  • Conditional when: criteria and per-criterion threshold (WOR-1021)
  • Calibration anchors per criterion (WOR-1016)
  • Evidence-required judging (WOR-1015)
  • Multi-model judge panel with aggregate and tie-break (WOR-1017)
  • Per-eval judge model, jury, and metered cost (WOR-1007)
  • Grade a live tool-using agent run as the candidate (WOR-1006)
  • Configurable score scales and aggregation modes (WOR-1014)
  • Mcptest eval --explain dry-run (WOR-1018)
  • Per-criterion required and guard gating (WOR-1013)
  • Built-in rubric presets (WOR-1012)
  • Reusable named rubrics via rubrics: map and ref (WOR-1011)
  • Grade evals against rubrics with a real provider (WOR-1020/1009/1010)
  • Accept structured rubrics in the top-level evals: type (WOR-1005)
  • WOR-959 second-tier docs, examples, reliability reporting; fix F1 vacuous-perfect regression
  • WOR-964 distractor-tool scenario pack + fix WOR-962 module doc
  • WOR-962 name-free discovery + orchestration diagnostics sub-scores
  • WOR-969 add equal-function-set F1 example scenario
  • Integrate WOR-961 narrative-trace + WOR-963 trust-boundary checks
  • WOR-960 objective multi-server tool-selection F1 via equal-function sets (MSC-Bench)
  • Python-sdk and typescript-sdk example dirs round out WOR-951
  • SDK examples ship a working stdio mock; add MockServerSpec schema
  • Add WebBotAuth, AwsSigV4Auth, tls.mtls to v1; audit examples
  • Install paths, Can-I-trust-mcptest README section, CHANGELOG, install.sh
  • SBOM extraction + cosign keyless + SLSA L3 provenance on every release
  • Mcptest sbom subcommand with build-time embedded CycloneDX BOM
  • Argument_correctness + plan_quality agentic presets (WOR-874)
  • Mcp_task_completion + mcp_use metrics on the eval: block (WOR-862, WOR-863)
  • Wire the rubric metric into agent runs via the eval: block (WOR-872)
  • Rubric scoring core - weighted multi-criterion judged metric (WOR-872)
  • Refresh an expired cached login token at connect (WOR-868)
  • Secret-hygiene audit + header-free cassette guarantee (WOR-869)
  • Suite-level auth default with per-server override (WOR-867)
  • Auth.oauth client_credentials at connect for URL servers (WOR-866)
  • Client_credentials token exchange primitive (WOR-866)
  • Use the cached OAuth token from mcptest login on URL servers (WOR-302)
  • Recording transport that captures live exchanges (WOR-482)
  • Cassette: server source replays recordings with no network (WOR-482)
  • Replay transport that serves a recorded cassette (WOR-482)
  • Wire --wait-for-ready into the run and doctor paths (WOR-370)
  • Serve resources + prompts so all MCP primitives are testable
  • Evaluate llm-judge live through the installed provider (WOR-282)
  • NDJSON + TAP reporters, GitHub Actions annotations, every format from run
  • Make agent examples runnable end to end
  • Implement snapshot evaluation + fix contains schema (recovery)
  • Within-session stability targets (WOR-855)
  • Spec-version cross-version gate (WOR-857)
  • Calibration corrected-rate confidence interval (WOR-858)
  • Tool-description lint targets on tool_quality (WOR-856)
  • Jury bias signals as assertion targets (WOR-854)
  • Grade_delta check block (WOR-852)
  • Tool_quality check block (WOR-852)
  • Calibration check block (WOR-852)
  • Jury analytics as assertion targets (WOR-852)
  • Carry per-juror token/cache usage on the jury report (WOR-844)
  • Web-bot-auth directory subcommand publishes the public key (WOR-840 item 4)
  • Opt-in live auth probes behind --probe (WOR-840 item 3)
  • SigV4 named-profile resolution from ~/.aws (WOR-840 item 2)
  • Wire C2 adaptive seed campaign into live redteam (WOR-841)
  • RSA-PSS signing for Web Bot Auth (WOR-840 item 1)
  • Live security redteam command driving the C1 corpus (WOR-829)
  • Un-hide the Layer C1 red-team lane for live use (WOR-829)
  • Add per-assertion transform (WOR-837)
  • Add token, tool-call, and time run-wide caps (WOR-830)
  • Wire context hooks into the runner + schema, docs, example (WOR-836)
  • Context-aware lifecycle hook ABI in mcptest-core (WOR-836)
  • Doctor auth pre-flight + run auth-flag matrix (WOR-832)
  • Expose local cert + key pre-flight helpers (WOR-832)
  • Add registry usage-stats aggregator (WOR-838)
  • Run request/response transforms around the tool call (WOR-835)
  • Transform config + runner + defaultTest merge (WOR-835)
  • Web Bot Auth Ed25519 message signatures (WOR-497)
  • AWS SigV4 request signing (WOR-474)
  • Functional mTLS client identity (WOR-473)
  • Scaffold transport-level client-auth config wiring (WOR-473, WOR-474, WOR-497)
  • Suite-composition keys in v1.json (WOR-790)
  • Lifecycle hooks beforeAll/afterAll/beforeEach/afterEach (WOR-790)
  • Weighted combined-score model, assert-sets, defaultTest, derivedMetrics (WOR-790)
  • Wire CLI lanes and opt-in advisory judge behind --model (WOR-731)
  • Wire deterministic toxic_flow/namespace/integrity lanes into a unified run (WOR-731)
  • Similar via embedding cosine (WOR-825)
  • Is-sql via sqlparser (WOR-825)
  • Is-xml via quick-xml (WOR-825)
  • Wire conformance invariants into the CLI (WOR-756)
  • Spec-derived conformance invariants and composition safety (WOR-756)
  • --resume skips completed record cells (WOR-792)
  • Run journal for matrix resume (WOR-792)
  • Wire named matchers + not negation into parser and executor (WOR-788)
  • Add deterministic matcher impls + MatcherKind variants (WOR-788)
  • Sampling + roots round-trip corpus rules (WOR-791)
  • Trajectory match modes + golden-path scoring (WOR-780)
  • Add OAuth protocol conformance corpus (WOR-781)
  • Add --fault flag to mcptest mock + recovery docs (WOR-754)
  • Add unresponsive-server fault injection (WOR-754)
  • Add deterministic recovery scoring for faulty servers (WOR-754)
  • Add mcptest generate suite from declared tool schemas (WOR-786)
  • Synthesize a starter test suite from tool schemas (WOR-786)
  • Code-mode test harness (WOR-751)
  • Transport, auth, and local probe analyzers (WOR-737)
  • Layer B advisory LLM-judge detection (WOR-738)
  • Layer C2 adaptive attacker (WOR-740)
  • Layer C1 red-team exploitability oracle (WOR-739)
  • Notification-stream and server-request corpus assertions (WOR-753)
  • Scorecard security-posture signals (WOR-719)
  • External-scanner supplement primitive (WOR-741)
  • Check structuredContent against declared outputSchema (WOR-749)
  • Add calibrated confidence band and escalation flag (WOR-722)
  • Optional per-juror reliability weight, equal by default
  • Add DESC-009 and DESC-010 tool description checks (WOR-746)

CI

  • Bump actions to node24 majors (drop Node.js 20 deprecation warning)
  • Test the python and node SDKs against the prebuilt binary
  • Bump actions/cache v4 -> v5 (Node 24) to clear Node.js 20 deprecation (WOR-1209)
  • Remove publish-schema.yml; the schema is served from mcptest.sh (WOR-1209)
  • Remove superseded docs.yml (mdBook -> GitHub Pages) (WOR-1209)
  • Fix remaining gate failures surfaced on macOS/Windows (WOR-1209)
  • Auto-regenerate llms-full.txt when source docs change
  • Green the gate after re-enabling Actions (WOR-1209)
  • Add workflow_dispatch so the build gate can be re-run on demand (WOR-1209)
  • Add build gate (cross-compile all targets + docker) and fix Dockerfile MSRV (WOR-1209)
  • Bump actions/checkout from 4 to 6 (#1)

Chore

  • Remove stale doc-sync prompt and superseded design spec
  • De-stale multi-server/fixtures/restart-policy docs + notion e2e example
  • Rebrand to Soap Bucket LLC and curl-install via download.mcptest.sh

Documentation

  • Add a Native test-framework SDKs page (YAML vs in-code)
  • Add a dedicated, in-depth CEL matcher example
  • Link the logo to the website and add a Website nav line
  • Add a capability-diverse Examples section
  • Scrub internal Linear/process leak from the public contributor doc
  • Refresh llms.txt (live /docs URLs, new pages and scenarios)
  • Add scenarios 8-16, hosted test-server walkthroughs
  • Point cost control at mcptest eval, not run (WOR-1204)
  • Catalog the 15 missing example directories (WOR-1203)
  • Documentation + examples audit pass (accuracy, walkthrough, every-feature coverage)
  • Highlight the model-comparison matrix, matchers, and security report
  • Lead the README with the mcptest logo
  • Document the WOR-1189 SOTA features and an example
  • Add rubric-features.yml covering the configurable rubric surface
  • Publish six user-facing pages in SUMMARY
  • Add CI, crates.io, license, rust, stars, and docs badges
  • Resolve WOR-999 audit follow-ups (WOR-1000/1001/1002/1003)
  • Audit pass fixes (reporter count, residue, brand, SUMMARY index)
  • Wire the new hosted test-server scenarios + a cross-server example
  • Narrative-vs-trace divergence assertion guide
  • Document tool-selection F1 via equal-function sets
  • WOR-967 track MSC-Bench, MCP-Atlas, MCP Pitfall Lab, AttestMCP in research grounding
  • Fix dangling empty-paren residue in anthropic smoke test doc comment
  • Point public-facing pages at GitHub issues instead of Linear
  • Conformance CLI spec (run + refresh subcommands)
  • Document the agent eval: metrics + example (WOR-871)
  • Replace em-dashes in references with colons (gate fix)
  • Capture/escape-hatch for custom auth + ordering decision (WOR-870)
  • Auth-in-tests design + usage (WOR-865 epic)
  • De-stale the module comment now that login is wired end to end
  • Reconcile to shipped reality (cassette server source, multi-server, fixtures)
  • Rewrite mcptest-mock.md to match the real manifest server
  • Document resources and prompts test blocks
  • List the resources/prompts/ping methods the mock now serves
  • Add real-world suites for filesystem and fetch MCP servers
  • Lead with a concrete end-to-end walkthrough
  • State which calibration: fields are required
  • Verbosity + accuracy pass on the metric guides
  • Drop stale "no calibration: block" intro (WOR-852)
  • Design for computed-metric test conditions
  • Sync docs to the shipped schema + add doc-sync audit prompt
  • Show the YAML you actually author
  • Auth-fixtures for mTLS, SigV4, and Web Bot Auth (WOR-840 item 5)
  • Document run-wide caps in schema and yaml reference (WOR-830)
  • Document polyglot command hooks (WOR-834)
  • Unpublish the maintainer-only release-process and research pages
  • Document the transform step and add a runnable example (WOR-835)
  • Document the now-live auth schemes (WOR-473, WOR-474, WOR-497)
  • Broaden positioning to the full test surface
  • Fix broken links in index.md (WOR-831)
  • Canonicalize brand URL to soapbucket.com (WOR-831)
  • Remove ADRs from the public repo and clean references (WOR-831)
  • Replace fabricated flags with real equivalents (WOR-828)
  • Correct reporter syntax across all CI-provider snippets (WOR-828)
  • Remove internal positioning ADR + scorecards GTM sections (WOR-828)
  • Fix exit codes, command names, reporter syntax, and broken links in troubleshooting (WOR-828)
  • Fix reporter syntax and fabricated matchers in the CI guide (WOR-828)
  • Fix matcher, auth, cassette, and exit-code drift in faq and cassette examples (WOR-828)
  • Fix README sample output and soften uniqueness claim (WOR-828)
  • Fix matcher, compliance, and exit-code drift in concepts and what-is (WOR-828)
  • Fix exit codes, init scaffold, and run output drift in getting-started (WOR-828)
  • Remove internal sales material from public docs (WOR-828)
  • Document suite-composition primitives (WOR-790)
  • Document the full lane surface and the advisory boundary (WOR-731)
  • Document conformance invariants and composition mode (WOR-756)
  • Doc-hide deferred OfficialBridge surface (WOR-769)
  • Doc-hide typed-but-unconsumed runner models (WOR-769)
  • Doc-hide unwired non-surface lanes (WOR-769)
  • Doc-hide test-only jury dispatch and bias helpers (WOR-769)
  • Doc-hide deferred SigV4/mTLS/WebBotAuth schemes (WOR-769)
  • Document WOR-788 matchers in yaml-reference and lib doc (WOR-788)
  • Note sampling + roots corpus coverage (WOR-791)
  • Document the five match modes + golden path (WOR-780)
  • Add within-session stability to the TOC (WOR-759)
  • Fix multiple-reporter phantom in concepts, getting-started, llms-full (WOR-794)
  • Enforce missing_docs lint (WOR-774)
  • Enforce missing_docs and document module surface (WOR-774)
  • Enforce missing_docs and document module surface (WOR-774)
  • Fix phantom matcher names in agent examples (WOR-770)
  • Fix phantom matcher names in agent examples (WOR-770)
  • Fix phantom matcher names in agent examples (WOR-770)
  • Fix phantom matcher names in agents snippet (WOR-770)
  • Fix matcher/reporter/baseline drift (WOR-770)
  • Mark pentest-gate Server checks shipped against the SEC engine (WOR-718)
  • Document confidence band and escalation flag (WOR-722)
  • Continuous-eval scope (ADR 0043) + scorecard reliability dimensions
  • Index ADR 0042 in the ADR README
  • Add ADR 0042 jury reliability weighting and independent scoring
  • Document the namespace security family (WOR-736)
  • Document the annotation lint rules DESC-011 and DESC-012 (WOR-748)
  • Cite Anthropic advanced-tool-use and Cloudflare code-mode (WOR-746 follow-up)
  • Document DESC-009 and DESC-010 description checks (WOR-746)
  • Record open-core boundary for the security testing framework
  • Multi-layer security testing (ADR 0041) + OWASP MCP Top 10 cross-walk
  • Security testing framework (ADR 0040 + 32-check catalog)
  • Add May 2026 literature pass (judge robustness, agent reliability, MCP security)
  • Land MCP security research track (ADR 0039 + pentest/scorecard spec)
  • Tighten README, replace the v1.0 ships-dump with a concise At a glance
  • Remove placeholder Performance section from README
  • Complete the CLI reference (exec, mock, generate, login, prompt, cache)
  • Document pipe, tools-call chaining, scorers, mcp-server; fix stale paths
  • Move internal planning/marketing docs to /Users/rick/projects/soapbucket/docs/mcptest
  • Point brand link at soapbucket.com

Fixed

  • Drop npm provenance (private repo) and pin cosign exactly
  • Make crate publishing idempotent and use the sparse index
  • Publish crates with CARGO_REGISTRY_TOKEN
  • Make all six workspace crates publishable to crates.io
  • Vendor SEP corpus in-crate; make SLSA attestation non-fatal
  • Publish crates in dependency order (core before config)
  • Use syft tag selector for the SPDX SBOM step
  • Make the jvm and dotnet SDK examples run against the real binary + CI
  • Make the go and rust SDK examples run against the real binary + CI
  • Make the python and node SDK examples run against the real binary
  • Make discovery + agent fixtures cross-platform robust
  • De-flake the connect-branch test on Windows; surface all CI failures
  • Single-quote command-array paths in YAML so Windows paths parse (WOR-1209)
  • Single-quote the cassette path in YAML so Windows paths parse (WOR-1209)
  • Gate the bracket-order test's helpers to unix too (WOR-1209)
  • Gate the sh-based bracket-order test to unix (Windows) (WOR-1209)
  • De-flake the per-second tick test on loaded CI (WOR-1209)
  • Gate the test std::io::Write import to unix (Windows -D warnings) (WOR-1209)
  • Clear remaining cross-platform CI failures (scorer pipe, timing flakes, enum lint) (WOR-1209)
  • Tolerate broken-pipe stdin write; quiet Windows large_enum_variant (WOR-1209)
  • Send tool results as user turns so the agent loop works
  • Close six config foot-guns (WOR-1169..1174)
  • E2e examples + docs audit fixes (WOR-1114)
  • Use args, not arguments, in hosted-cross-server
  • Export env-file and --env/--var values into the process env
  • Correct the below-floor F1 scenario to actually drop below 50
  • Resolve WOR-959 integration compile + clippy errors
  • Correct eval/mod.rs 3-way merge (restore all modules + correct f1 exports)
  • WOR-966 add doctor decision that flags -32002 missing-resource code and names -32602
  • Resolve code-review 5-30 findings (WOR-954..958)
  • Exclude vendored corpus from em-dash lint; fix rustdoc intra-link
  • Proper RFC 9728 + RFC 8414 OAuth discovery (WOR-301)
  • Fail fast with a clear auth error on a 4xx instead of timing out
  • Apply server bearer_token_env on the run/connect path
  • Stabilize WOR-839 matrix-assert tests under parallel execution
  • Interpolate ${var} in args and command on the live run path
  • Clippy lints + RSA-PSS-aware webbot doctor test (WOR-839/WOR-840)
  • Extract gate_matrix_asserts to run_live_matrix for the size guard (WOR-839)
  • Fill AgentConfig run-wide caps in redteam (WOR-829 + WOR-830 reconcile)
  • De-flake stdio transport integration tests (WOR-833)
  • Resolve intra-doc link now that integrity is no longer doc-hidden (WOR-731)
  • Low-severity review cleanup roll-up (WOR-778)
  • Emit schema-valid ToolTest YAML (WOR-823)
  • Demote private HANG_CAP intra-doc link in mock (WOR-754)
  • Demote private ECE_BINS intra-doc link in calibration (WOR-760)
  • Make coverage, update-snapshots, and cache flags honest (WOR-766)
  • Wire --server-* overrides through to the live run target (WOR-766)
  • Repair masked gate failures (clippy derivable_impls + 767 contains fallout) (WOR-765)
  • Repair broken intra-doc links failing cargo doc -D warnings (WOR-765)
  • Route exact/contains/regex/schema through matchers crate (WOR-767)
  • Wire the dead cost/consensus early-termination into jury dispatch (WOR-776)
  • Make pairwise position-swap actually swap candidate order (WOR-776)
  • Group tools by server and dedup colliding idents (WOR-772)
  • Make Mistral honor its advertised tool_use (WOR-768)
  • Unify map_reqwest_error across providers (WOR-768)
  • Regenerate compliance stats.md for the WOR-750 corpus additions

Reverted

  • Drop redundant run-path schema warn (WOR-1170)

Security

  • Bump vitest to ^4.1.0 to clear the critical Dependabot advisory
  • Add cross-server namespace checks (WOR-736)
  • Add the mcptest security CLI subcommand (WOR-734)
  • Deterministic check engine + tool-surface checks (WOR-732, WOR-735)
  • Red-team example scenario corpus (WOR-720)
  • Observable-evidence oracle regression guard (WOR-721)