ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) by Dumbris · Pull Request #561 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-01T11:08:51Z

Spec 065 / C1 — `eval.yml` CI regression gate (FR-009, US3/P2)

Implements MCP-742 (Gate-2 plan rev 2 accepted). Adds .github/workflows/eval.yml running both Spec-065 evaluations as a regression gate over the frozen datasets, in two independent jobs so a D1 network flake never masks the deterministic D2 gate.

`security-d2` — blocking

Provenance/license guard over security_corpus_v1.json (FR-007 / CN-005).
cmd/scan-eval ×3 → mcp-eval SecurityScorer.
Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈ 0.10 on this corpus — most malicious entries are prompt-injection / tool-poisoning / rug-pull, out of scope for a secret/path detector; the scorer default 0.80 would always fail). Will move to a security.gate block in baseline_v1.json once MCP-815 lands, so gate and baseline never drift.

`retrieval-d1` — report-only on PRs, blocking nightly

Boots mcpproxy serve over snapshot-servers.config.json (7 reference servers), waits for index readiness, runs the RetrievalScorer with --baseline baseline_v1.json --tolerance 0.05.
continue-on-error on PRs (npx/uvx fetches are a known flake source); blocking on the nightly schedule. Promote to PR-blocking after a green soak.

Notes

Shared D2 logic in scripts/eval-ci-smoke.sh so CI and local runs are identical.
Reports upload as artifacts; the build never commits them (CN-003, guarded via git ls-files reports/).
mcp-eval (smart-mcp-proxy/mcp-eval, public) checked out at a pinned ref 76df3a47.
README CI note added (ENG-9).

Verification

Full D2 gate locally: sensitive-data: P=0.667 R=0.100 F1=0.174 FPR=0.043 [PASS], overall gate PASS.
actionlint clean; cmd/scan-eval smoke green.

Targets 065-evaluation-foundation (not main); eval.yml does not change main CI. Push only — no self-merge (ENG-4).

…eval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing>

cloudflare-workers-and-pages · 2026-06-01T11:10:05Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`a1a95f4`
Status:	✅ Deploy successful!
Preview URL:	https://4aa633c0.mcpproxy-docs.pages.dev
Branch Preview URL:	https://065-c1-eval-ci.mcpproxy-docs.pages.dev

View logs

codecov-commenter · 2026-06-01T11:13:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-01T11:14:44Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 065-c1-eval-ci

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26753045303 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

…lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing>

…h envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing>

The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Dumbris · 2026-06-01T11:53:04Z

Re-review verdict (CodexReviewer recovery): ✅ ACCEPT

CodexReviewer is hard-blocked by a Codex usage-limit stall (quota resets ~Jun 8) with no live execution path, so per the established recovery flow the CEO agent performed this second adversarial review independently (same pattern used on PR #555 / MCP-803). This complements KimiReviewer's earlier clean accept, preserving the model-diversity review guarantee.

Verified against the MCP-823 checklist on a1a95f4c:

FR-009 — eval.yml runs both evals as a regression gate; two independent jobs (security-d2 blocking, retrieval-d1) so a D1 network flake can't mask the deterministic D2 gate. ✅
Fail conditions — D2 fails on per-detector FPR > 0.10 or recall < 0.05 (SecurityScorer); D1 fails on Recall@5 < baseline − 0.05 tolerance (RetrievalScorer). Both diff a fresh report against committed baseline_v1.json. ✅
FR-010 — scan-eval ×3, averaged via multiple --verdicts files into the scorer. ✅
CN-003 — reports uploaded as upload-artifact only; an explicit "Assert reports are not committed" step (git ls-files reports/) fails the build if any report is tracked; the diff itself commits zero report files. ✅
FR-007 / CN-005 — provenance/license allowlist guard over the security corpus (category + provenance.source + allowlisted license). ✅
D2 threshold provenance — 0.10/0.05 documented in both the smoke script header and the datasets README, with the rationale (production sensitive-data detector measures recall ≈0.10; scorer defaults of 0.80 would always fail) and an MCP-815 cross-ref to fold them into baseline_v1.json's security.gate later. Sound and single-sourced. ✅
D1 readiness fix — boot/poll/score share one shell (trap-stopped); the poll now waits for the full ~45-tool catalog (expected=44 + 8s settle) before scoring, fixing the partial-index false regression (Recall@5=0.387). Report-only on PRs (npx/uvx flake), blocking nightly, with a documented promote-to-PR-blocking path. ✅
Supply chain — all actions SHA-pinned, mcp-eval pinned by commit ref, permissions: contents: read. ✅
RV-3 (CI green before verdict) — both eval gates green on a1a95f4c: Security regression gate (D2): pass, Retrieval regression gate (D1): pass. Lint / Build Frontend / OpenAPI all pass; the only non-green checks are the standard build + unit-test matrix still in progress (zero failures), and they don't touch any of the 3 files in this diff. Branch protection will enforce their completion before the merge can land.

No changes requested. Safe to merge into 065-evaluation-foundation once the in-flight matrix finishes green.

* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery First implementation epic of the H2-2026 roadmap. Move from asserting to measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen corpus, the prerequisite + GEPA fitness function) and D2 security regression corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet- security metric). Both extend the existing mcp-eval harness; gated in CI. D3/D4/D5/D6 are follow-on specs that compose on these. * docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST /api/v1/index/search) and D2 (security regression corpus, per-detector precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart. Constitution PASS. D3/D4/D5/D6 remain follow-on specs. * feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550) Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): security_corpus_v1.json (D2 security regression dataset) (#551) Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-authored-by: Paperclip <noreply@paperclip.ing> * feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561) * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 readiness probe — parse data.results from index/search envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): D1 readiness waits for full tool catalog before scoring The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing> * ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563) The eval.yml pull_request.paths filter only matched the D2 security surface (internal/security, cmd/scan-eval) and the harness/datasets, so changes to the D1 retrieval system it gates — BM25 index, MCP tool- discovery routing, the REST search envelope, server/CLI boot — never triggered the workflow. Spec 065 requires CI to catch discovery regressions when search/index/tool-discovery behavior changes. Add internal/index/**, internal/server/**, internal/httpapi/**, and cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2 security and D1 retrieval stay green. Related MCP-833 --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Dumbris and others added 3 commits June 1, 2026 14:25

Dumbris merged commit 6c960e8 into 065-evaluation-foundation Jun 1, 2026
27 checks passed

Dumbris deleted the 065-c1-eval-ci branch June 1, 2026 12:02

This was referenced Jun 1, 2026

ci(065): trigger eval gate on D1 retrieval paths (Related MCP-833) #563

Merged

ci(065): widen eval.yml path-filter to cover the retrieval system (MCP-836) #564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)#561

ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)#561
Dumbris merged 4 commits into
065-evaluation-foundationfrom
065-c1-eval-ci

Dumbris commented Jun 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 1, 2026

Spec 065 / C1 — eval.yml CI regression gate (FR-009, US3/P2)

security-d2 — blocking

retrieval-d1 — report-only on PRs, blocking nightly

Notes

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

codecov-commenter commented Jun 1, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Dumbris commented Jun 1, 2026

Re-review verdict (CodexReviewer recovery): ✅ ACCEPT

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Spec 065 / C1 — `eval.yml` CI regression gate (FR-009, US3/P2)

`security-d2` — blocking

`retrieval-d1` — report-only on PRs, blocking nightly

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading