feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) by Dumbris · Pull Request #552 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-05-31T14:07:56Z

Spec 065 / A2 — D1 retrieval datasets (MCP-740)

Generates and commits the frozen D1 retrieval-evaluation artifacts under specs/065-evaluation-foundation/datasets/, produced by A1's mcp-eval CLI (merged cb37f84, mcp-eval PR #9). Gate 2 design accepted before any code.

Artifacts (committed)

File	Contents
`corpus_v1.json`	Frozen 45-tool snapshot (`GET /api/v1/tools`) over 7 no-auth reference servers: filesystem, git, memory, sqlite, fetch, time, sequential-thinking. Immutable (CN-002); refresh = `corpus_v2` (FR-012).
`corpus_v1.source.json`	Secret-free, reproducible mcpproxy source config used to freeze the corpus.
`retrieval_golden_v1.json`	47 graded queries (relevance 0\|1\|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool).
`baseline_v1.json`	Reference Recall@{1,3,5,10}/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05 — the CI regression anchor (FR-009). `security` section reserved for D2 (CN-004).
`README.md`	Documented, repeatable regeneration procedure (FR-012).

Invariants / conformance

INV-1 (every golden tool_id ∈ corpus): mcp-eval datasets validate → OK: 47 queries valid against corpus_v1.
Schema conformance: contracts/retrieval-dataset.schema.json ✓.
Hard-negatives flagged in notes with the tempting near-dup labeled relevance 0 (e.g. filesystem:search_files ↔ memory:search_nodes; sqlite:read_query ↔ filesystem:read_text_file; fetch:fetch ↔ filesystem:read_text_file; time:get_current_time ↔ time:convert_time).

End-to-end verification

Recall@1/3/5/10: 0.418 0.560 0.681 0.791
MRR=0.568  nDCG@10=0.609  MAP=0.544
Gate (recall_at_5): PASS (value=0.681 threshold=0.631)

Score reports are not committed (CN-003).

Reference notes

Baseline retrieval metrics are top-level because RetrievalScorer reads baseline["recall_at"]["5"] directly (quickstart §4: --baseline datasets/baseline_v1.json). The D2 security section nests under security.
Cross-repo generation in ~/repos/mcp-eval authorized by the accepted Gate-1 Option 1 on MCP-735; committed artifacts land in mcpproxy-go (my lane).

Per Gate 3 (FR-005): opening this PR, not merging. Human merges.

Related #MCP-740

…e (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-05-31T14:09:23Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`50db38a`
Status:	✅ Deploy successful!
Preview URL:	https://7bf02efc.mcpproxy-docs.pages.dev
Branch Preview URL:	https://065-a2-retrieval-datasets.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-05-31T14:12:03Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-31T14:13:41Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 065-a2-retrieval-datasets

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26714833115 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

… README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris · 2026-05-31T15:17:38Z

Superseded by the renamed branch PR (corpus file rename + merged D1/D2 README). Closing in favor of that.

* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery First implementation epic of the H2-2026 roadmap. Move from asserting to measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen corpus, the prerequisite + GEPA fitness function) and D2 security regression corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet- security metric). Both extend the existing mcp-eval harness; gated in CI. D3/D4/D5/D6 are follow-on specs that compose on these. * docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST /api/v1/index/search) and D2 (security regression corpus, per-detector precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart. Constitution PASS. D3/D4/D5/D6 remain follow-on specs. * feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550) Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): security_corpus_v1.json (D2 security regression dataset) (#551) Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561) * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 readiness probe — parse data.results from index/search envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): D1 readiness waits for full tool catalog before scoring The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563) The eval.yml pull_request.paths filter only matched the D2 security surface (internal/security, cmd/scan-eval) and the harness/datasets, so changes to the D1 retrieval system it gates — BM25 index, MCP tool- discovery routing, the REST search envelope, server/CLI boot — never triggered the workflow. Spec 065 requires CI to catch discovery regressions when search/index/tool-discovery behavior changes. Add internal/index/**, internal/server/**, internal/httpapi/**, and cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2 security and D1 retrieval stay green. Related MCP-833 --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris mentioned this pull request May 31, 2026

feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README #554

Merged

Dumbris closed this May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2)#552

feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2)#552
Dumbris wants to merge 1 commit into
065-evaluation-foundationfrom
065-a2-retrieval-datasets

Dumbris commented May 31, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026

Uh oh!

codecov-commenter commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Dumbris commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Spec 065 / A2 — D1 retrieval datasets (MCP-740)

Artifacts (committed)

Invariants / conformance

End-to-end verification

Reference notes

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented May 31, 2026

Codecov Report

Uh oh!

github-actions Bot commented May 31, 2026

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Dumbris commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Dumbris commented May 31, 2026 •

edited

Loading