Skip to content

ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)#561

Merged
Dumbris merged 4 commits into
065-evaluation-foundationfrom
065-c1-eval-ci
Jun 1, 2026
Merged

ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)#561
Dumbris merged 4 commits into
065-evaluation-foundationfrom
065-c1-eval-ci

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented Jun 1, 2026

Spec 065 / C1 — eval.yml CI regression gate (FR-009, US3/P2)

Implements MCP-742 (Gate-2 plan rev 2 accepted). Adds .github/workflows/eval.yml running both Spec-065 evaluations as a regression gate over the frozen datasets, in two independent jobs so a D1 network flake never masks the deterministic D2 gate.

security-d2 — blocking

  • Provenance/license guard over security_corpus_v1.json (FR-007 / CN-005).
  • cmd/scan-eval ×3 → mcp-eval SecurityScorer.
  • Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈ 0.10 on this corpus — most malicious entries are prompt-injection / tool-poisoning / rug-pull, out of scope for a secret/path detector; the scorer default 0.80 would always fail). Will move to a security.gate block in baseline_v1.json once MCP-815 lands, so gate and baseline never drift.

retrieval-d1 — report-only on PRs, blocking nightly

  • Boots mcpproxy serve over snapshot-servers.config.json (7 reference servers), waits for index readiness, runs the RetrievalScorer with --baseline baseline_v1.json --tolerance 0.05.
  • continue-on-error on PRs (npx/uvx fetches are a known flake source); blocking on the nightly schedule. Promote to PR-blocking after a green soak.

Notes

  • Shared D2 logic in scripts/eval-ci-smoke.sh so CI and local runs are identical.
  • Reports upload as artifacts; the build never commits them (CN-003, guarded via git ls-files reports/).
  • mcp-eval (smart-mcp-proxy/mcp-eval, public) checked out at a pinned ref 76df3a47.
  • README CI note added (ENG-9).

Verification

  • Full D2 gate locally: sensitive-data: P=0.667 R=0.100 F1=0.174 FPR=0.043 [PASS], overall gate PASS.
  • actionlint clean; cmd/scan-eval smoke green.

Targets 065-evaluation-foundation (not main); eval.yml does not change main CI. Push only — no self-merge (ENG-4).

…eval (MCP-742)

Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both
Spec-065 evaluations as a regression gate over the frozen datasets:

- security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval
  ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05
  (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer
  defaults of 0.80 would always fail). Sourced in one place pending the MCP-815
  baseline `security.gate` block so gate and baseline never drift.
- retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index
  readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs
  (npx/uvx fetch flake), blocking on the nightly schedule.

Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as
artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned
public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043),
actionlint clean.

Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted).

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Jun 1, 2026

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: a1a95f4
Status: ✅  Deploy successful!
Preview URL: https://4aa633c0.mcpproxy-docs.pages.dev
Branch Preview URL: https://065-c1-eval-ci.mcpproxy-docs.pages.dev

View logs

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

📦 Build Artifacts

Workflow Run: View Run
Branch: 065-c1-eval-ci

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (27 MB)
  • archive-windows-arm64 (24 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26753045303 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris and others added 3 commits June 1, 2026 14:25
…lifecycle

The retrieval-d1 job failed: `mcpproxy serve` exited immediately with
"data_dir: directory does not exist" (serve refuses to create a missing
data_dir), and the server was backgrounded in a separate step from the readiness
poll (a process backgrounded in one step is reaped when that step's shell exits).

Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE
step (shared shell) with a trap that stops the server however the step ends; also
fail fast if the server process dies during startup. D2 gate unaffected.

Verified: mcpproxy boots and serves /api/v1/status locally with the created
data_dir; actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…h envelope

The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7
servers (45 tools), but the probe parsed the index/search response at the top
level while results are nested under the `{"success":true,"data":{"results":[…]}}`
envelope, so it read 0 every attempt and timed out.

Fix: parse `data.results`. Verified locally — index returns 5 results for q=file
within ~6s of boot. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
The retrieval scorer was running against a partially-indexed instance: the
readiness probe passed at the first indexed tool (>=1 search result), so scoring
started before all 7 reference servers connected -> Recall@5 measured 0.387 vs
baseline threshold 0.631 (false regression).

Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools
across the 7 servers) and add a short settle for the index build, then score.

Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 =
0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline
is exactly reproducible. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@Dumbris
Copy link
Copy Markdown
Member Author

Dumbris commented Jun 1, 2026

Re-review verdict (CodexReviewer recovery): ✅ ACCEPT

CodexReviewer is hard-blocked by a Codex usage-limit stall (quota resets ~Jun 8) with no live execution path, so per the established recovery flow the CEO agent performed this second adversarial review independently (same pattern used on PR #555 / MCP-803). This complements KimiReviewer's earlier clean accept, preserving the model-diversity review guarantee.

Verified against the MCP-823 checklist on a1a95f4c:

  • FR-009eval.yml runs both evals as a regression gate; two independent jobs (security-d2 blocking, retrieval-d1) so a D1 network flake can't mask the deterministic D2 gate. ✅
  • Fail conditions — D2 fails on per-detector FPR > 0.10 or recall < 0.05 (SecurityScorer); D1 fails on Recall@5 < baseline − 0.05 tolerance (RetrievalScorer). Both diff a fresh report against committed baseline_v1.json. ✅
  • FR-010scan-eval ×3, averaged via multiple --verdicts files into the scorer. ✅
  • CN-003 — reports uploaded as upload-artifact only; an explicit "Assert reports are not committed" step (git ls-files reports/) fails the build if any report is tracked; the diff itself commits zero report files. ✅
  • FR-007 / CN-005 — provenance/license allowlist guard over the security corpus (category + provenance.source + allowlisted license). ✅
  • D2 threshold provenance0.10/0.05 documented in both the smoke script header and the datasets README, with the rationale (production sensitive-data detector measures recall ≈0.10; scorer defaults of 0.80 would always fail) and an MCP-815 cross-ref to fold them into baseline_v1.json's security.gate later. Sound and single-sourced. ✅
  • D1 readiness fix — boot/poll/score share one shell (trap-stopped); the poll now waits for the full ~45-tool catalog (expected=44 + 8s settle) before scoring, fixing the partial-index false regression (Recall@5=0.387). Report-only on PRs (npx/uvx flake), blocking nightly, with a documented promote-to-PR-blocking path. ✅
  • Supply chain — all actions SHA-pinned, mcp-eval pinned by commit ref, permissions: contents: read. ✅
  • RV-3 (CI green before verdict) — both eval gates green on a1a95f4c: Security regression gate (D2): pass, Retrieval regression gate (D1): pass. Lint / Build Frontend / OpenAPI all pass; the only non-green checks are the standard build + unit-test matrix still in progress (zero failures), and they don't touch any of the 3 files in this diff. Branch protection will enforce their completion before the merge can land.

No changes requested. Safe to merge into 065-evaluation-foundation once the in-flight matrix finishes green.

@Dumbris Dumbris merged commit 6c960e8 into 065-evaluation-foundation Jun 1, 2026
27 checks passed
@Dumbris Dumbris deleted the 065-c1-eval-ci branch June 1, 2026 12:02
Dumbris added a commit that referenced this pull request Jun 1, 2026
* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery

First implementation epic of the H2-2026 roadmap. Move from asserting to
measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen
corpus, the prerequisite + GEPA fitness function) and D2 security regression
corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet-
security metric). Both extend the existing mcp-eval harness; gated in CI.
D3/D4/D5/D6 are follow-on specs that compose on these.

* docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation

Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST
/api/v1/index/search) and D2 (security regression corpus, per-detector
precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval
harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart.
Constitution PASS. D3/D4/D5/D6 remain follow-on specs.

* feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550)

Bridge the Spec 065 / D2 security corpus to mcpproxy's production
sensitive-data detector and emit per-entry, per-detector verdict JSON
for the Python SecurityScorer (B3). Offline, deterministic test tooling
only — no runtime or REST surface (Security-by-Default, R-03).

- cmd/scan-eval: reads a security-corpus.schema.json-conforming file,
  runs each entry.description through security.NewDetector(nil).Scan,
  echoes ground-truth id/label/category, emits scan-verdict.schema.json.
- Flags: --corpus (required), --out (default stdout),
  --detectors=sensitive-data (default), --scanners (reserved opt-in
  extension point for the deferred Docker bundled-scanner pass).
- Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure.
- contracts/scan-verdict.schema.json: the verdict output contract B3
  consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR.
- Test-first: TP (embedded AWS key), TN, missing/empty corpus, and
  deterministic-output coverage; committed minimal corpus fixture.

The fixture demonstrates honest measurement (INV-3): the detector is a
true positive on a credential-exfil description, a false negative on
pure prompt-injection text, and a visible false positive on a benign
doc referencing ~/.aws/credentials — i.e. it measures real coverage
rather than trivially passing.

Co-authored-by: Paperclip <noreply@paperclip.ing>

* feat(065): security_corpus_v1.json (D2 security regression dataset) (#551)

Add the D2 labeled security corpus the detection scorer measures against,
plus a co-located validator test enforcing the contract and cross-entity
invariants (INV-3, INV-4 / SC-004).

- 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull),
  15 clean benign, 8 attack-resembling hard negatives (2 per attack category).
- Every entry carries label + category + provenance.{source,license} (FR-007).
- Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the
  unconfirmed-license mcp-injection-experiments are referenced externally only,
  never vendored (CN-005, R-07, R-A). The validator fails the build on any
  non-redistributable license.
- corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri
  jsonschema/v6) and asserts attack coverage + >=1 hard negative per category.

Related #739

Co-authored-by: Paperclip <noreply@paperclip.ing>

* feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554)

* feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2)

Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's
mcp-eval datasets/retrieval CLI (cb37f84):

- corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers
  (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via
  GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012).
- corpus_v1.source.json: secret-free, reproducible mcpproxy source config.
- retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server
  hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes
  schema + INV-1 validation.
- baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05,
  the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads
  them directly); empty security section reserved for D2 (CN-004).
- README.md: documented, repeatable regeneration procedure (FR-012).

Verified end-to-end against a live BM25 index: validate OK, gate PASS
(Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003).

Related #MCP-740
Co-Authored-By: Paperclip <noreply@paperclip.ing>

* fix(065): rename D1 dataset files for clarity + merge D1/D2 README

Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs
corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config
corpus_v1.json' which fails. Renamed:
- corpus_v1.source.json -> snapshot-servers.config.json (the servable config)
- corpus_v1.json        -> corpus_v1.tools.json        (the scored snapshot)
Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2
dataset READMEs into one with an explicit servable-vs-dataset cheat sheet.

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>

* ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561)

* ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)

Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both
Spec-065 evaluations as a regression gate over the frozen datasets:

- security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval
  ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05
  (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer
  defaults of 0.80 would always fail). Sourced in one place pending the MCP-815
  baseline `security.gate` block so gate and baseline never drift.
- retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index
  readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs
  (npx/uvx fetch flake), blocking on the nightly schedule.

Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as
artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned
public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043),
actionlint clean.

Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted).

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle

The retrieval-d1 job failed: `mcpproxy serve` exited immediately with
"data_dir: directory does not exist" (serve refuses to create a missing
data_dir), and the server was backgrounded in a separate step from the readiness
poll (a process backgrounded in one step is reaped when that step's shell exits).

Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE
step (shared shell) with a trap that stops the server however the step ends; also
fail fast if the server process dies during startup. D2 gate unaffected.

Verified: mcpproxy boots and serves /api/v1/status locally with the created
data_dir; actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 readiness probe — parse data.results from index/search envelope

The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7
servers (45 tools), but the probe parsed the index/search response at the top
level while results are nested under the `{"success":true,"data":{"results":[…]}}`
envelope, so it read 0 every attempt and timed out.

Fix: parse `data.results`. Verified locally — index returns 5 results for q=file
within ~6s of boot. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): D1 readiness waits for full tool catalog before scoring

The retrieval scorer was running against a partially-indexed instance: the
readiness probe passed at the first indexed tool (>=1 search result), so scoring
started before all 7 reference servers connected -> Recall@5 measured 0.387 vs
baseline threshold 0.631 (false regression).

Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools
across the 7 servers) and add a short settle for the index build, then score.

Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 =
0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline
is exactly reproducible. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>

* ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563)

The eval.yml pull_request.paths filter only matched the D2 security
surface (internal/security, cmd/scan-eval) and the harness/datasets, so
changes to the D1 retrieval system it gates — BM25 index, MCP tool-
discovery routing, the REST search envelope, server/CLI boot — never
triggered the workflow. Spec 065 requires CI to catch discovery
regressions when search/index/tool-discovery behavior changes.

Add internal/index/**, internal/server/**, internal/httpapi/**, and
cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2
security and D1 retrieval stay green.

Related MCP-833

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants