feat(065): cmd/scan-eval D2 detector bridge (B1) by Dumbris · Pull Request #550 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-05-31T13:29:00Z

Summary

Spec 065 / B1 (MCP-738). Bridges the D2 security corpus to mcpproxy's production sensitive-data detector and emits per-entry, per-detector verdict JSON — the contract the Python SecurityScorer (B3) consumes. Offline, deterministic test tooling only: no new runtime or REST surface (Security-by-Default, R-03).

Gate-2 approved design: plan doc · confirmation 01547a84 accepted.

What's here

cmd/scan-eval/ — reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, writes scan-verdict.schema.json output.
- Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point — the Docker bundled-scanner pass is deferred per Gate 2).
- Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure.
specs/065-evaluation-foundation/contracts/scan-verdict.schema.json — the verdict output contract B3 derives per-detector TP/FP/TN/FN → P/R/F1/FPR from.
cmd/scan-eval/eval_test.go + testdata/security_corpus_min.json — TDD fixtures.

Scope decision (Gate 2)

R-03 names "Detector.Scan + the bundled scanner registry". Recon showed the bundled registry is Docker-based (per-entry container runs, some needing secrets) — too heavy/non-deterministic for the per-PR gate. This PR ships the deterministic in-process detector + the JSON contract (unblocks B3 immediately); the Docker scanners are a documented --scanners extension point deferred to a follow-up.

Honest-measurement evidence (INV-3)

On the committed 4-entry fixture the sensitive-data detector scores TP=1 / FN=1 / TN=1 / FP=1:

malicious + embedded AWS key → flagged (TP)
malicious + pure prompt-injection text → not flagged (FN — the documented limitation this eval exists to measure)
benign weather tool → not flagged (TN)
benign doc mentioning ~/.aws/credentials → flagged (visible FP)

It measures real coverage rather than trivially passing.

Verification

go test ./cmd/scan-eval/... -race — 7/7 pass
go build ./cmd/scan-eval — ok
golangci-lint run --config .github/.golangci.yml ./cmd/scan-eval/... — 0 issues

Related: Spec 065 Evaluation Foundation. Unblocks B3 (Python SecurityScorer, separate repo). Do not self-merge (FR-005 / Gate 3).

Co-Authored-By: Paperclip noreply@paperclip.ing

Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-05-31T13:29:57Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`927408e`
Status:	✅ Deploy successful!
Preview URL:	https://0ebaa3b4.mcpproxy-docs.pages.dev
Branch Preview URL:	https://738-scan-eval-detector-bridg.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-05-31T13:32:16Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.78652% with 26 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
cmd/scan-eval/main.go	56.52%	13 Missing and 7 partials ⚠️
cmd/scan-eval/eval.go	86.04%	5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-31T13:34:51Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 738-scan-eval-detector-bridge

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26713941333 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery First implementation epic of the H2-2026 roadmap. Move from asserting to measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen corpus, the prerequisite + GEPA fitness function) and D2 security regression corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet- security metric). Both extend the existing mcp-eval harness; gated in CI. D3/D4/D5/D6 are follow-on specs that compose on these. * docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST /api/v1/index/search) and D2 (security regression corpus, per-detector precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart. Constitution PASS. D3/D4/D5/D6 remain follow-on specs. * feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550) Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): security_corpus_v1.json (D2 security regression dataset) (#551) Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561) * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 readiness probe — parse data.results from index/search envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): D1 readiness waits for full tool catalog before scoring The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563) The eval.yml pull_request.paths filter only matched the D2 security surface (internal/security, cmd/scan-eval) and the harness/datasets, so changes to the D1 retrieval system it gates — BM25 index, MCP tool- discovery routing, the REST search envelope, server/CLI boot — never triggered the workflow. Spec 065 requires CI to catch discovery regressions when search/index/tool-discovery behavior changes. Add internal/index/**, internal/server/**, internal/httpapi/**, and cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2 security and D1 retrieval stay green. Related MCP-833 --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris merged commit 9d51472 into 065-evaluation-foundation May 31, 2026
25 checks passed

Dumbris deleted the 738-scan-eval-detector-bridge branch May 31, 2026 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(065): cmd/scan-eval D2 detector bridge (B1)#550

feat(065): cmd/scan-eval D2 detector bridge (B1)#550
Dumbris merged 1 commit into
065-evaluation-foundationfrom
738-scan-eval-detector-bridge

Dumbris commented May 31, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026

Uh oh!

codecov-commenter commented May 31, 2026

Uh oh!

github-actions Bot commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented May 31, 2026

Summary

What's here

Scope decision (Gate 2)

Honest-measurement evidence (INV-3)

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented May 31, 2026

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented May 31, 2026

Codecov Report

Uh oh!

github-actions Bot commented May 31, 2026

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants