feat(065): Evaluation Foundation (D1+D2) — integration to main by Dumbris · Pull Request #562 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-06-01T12:09:47Z

Spec 065 — Evaluation Foundation (D1 + D2)

Integration PR landing the complete Spec 065 surface on main. All work was
built test-first across 6 child issues (MCP-737/738/739/740/741/742) under
spec-064 glass-cockpit gates, each reviewed and merged into the
065-evaluation-foundation integration branch. This PR brings that branch up to
date with main (clean merge, zero conflicts) and proposes it for merge.

Diff vs main is exactly 23 new files (+3389, 0 deletions) — no main work is
reverted.

What ships

D2 — Security regression gate (blocking)
- cmd/scan-eval/ — Go detector bridge that runs internal/security over a
  frozen corpus and emits scan verdicts (eval.go, main.go, eval_test.go).
- specs/065-evaluation-foundation/datasets/security_corpus_v1.json — D2 corpus.
D1 — Retrieval/discovery scoring (report-only on PRs, blocking nightly)
- Frozen retrieval datasets: corpus_v1.tools.json, retrieval_golden_v1.json,
  baseline_v1.json, snapshot-servers.config.json, datasets/README.md,
  datasets/corpus_test.go.
CI — .github/workflows/eval.yml + scripts/eval-ci-smoke.sh
- Two independent jobs: security-d2 (HARD gate, deterministic, Go+Python only)
  and retrieval-d1 (report-only on PRs, blocking on the nightly soak).
- Python scorers live in the public smart-mcp-proxy/mcp-eval repo, pinned at
  MCP_EVAL_REF=76df3a47 (SecurityScorer/B3 merge) for reproducibility.
Spec docs — specs/065-evaluation-foundation/ (spec, plan, research,
data-model, contracts, quickstart, checklist).

Companion repo

The Python scorers (RetrievalScorer + SecurityScorer) shipped separately in
smart-mcp-proxy/mcp-eval (PRs #9, #10 — both merged). This PR pins to that ref.

Local verification

go build ./... — OK (main + 065 compile together)
go test ./cmd/scan-eval/... — ok
go test ./specs/065-evaluation-foundation/datasets/... — ok
eval.yml last ran green on the C1 head (a1a95f4c, run at 11:49).

Gate 3

Per spec-064 doctrine I open this PR and do not merge — the human merges on
GitHub once required checks are green. The blocking gate here is the D2 security
job; D1 is report-only on PRs.

Related: MCP-735, MCP-737, MCP-738, MCP-739, MCP-740, MCP-741, MCP-742

…scovery First implementation epic of the H2-2026 roadmap. Move from asserting to measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen corpus, the prerequisite + GEPA fitness function) and D2 security regression corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet- security metric). Both extend the existing mcp-eval harness; gated in CI. D3/D4/D5/D6 are follow-on specs that compose on these.

…Evaluation Foundation Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST /api/v1/index/search) and D2 (security regression corpus, per-detector precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart. Constitution PASS. D3/D4/D5/D6 remain follow-on specs.

Bridge the Spec 065 / D2 security corpus to mcpproxy's production sensitive-data detector and emit per-entry, per-detector verdict JSON for the Python SecurityScorer (B3). Offline, deterministic test tooling only — no runtime or REST surface (Security-by-Default, R-03). - cmd/scan-eval: reads a security-corpus.schema.json-conforming file, runs each entry.description through security.NewDetector(nil).Scan, echoes ground-truth id/label/category, emits scan-verdict.schema.json. - Flags: --corpus (required), --out (default stdout), --detectors=sensitive-data (default), --scanners (reserved opt-in extension point for the deferred Docker bundled-scanner pass). - Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure. - contracts/scan-verdict.schema.json: the verdict output contract B3 consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR. - Test-first: TP (embedded AWS key), TN, missing/empty corpus, and deterministic-output coverage; committed minimal corpus fixture. The fixture demonstrates honest measurement (INV-3): the detector is a true positive on a credential-exfil description, a false negative on pure prompt-injection text, and a visible false positive on a benign doc referencing ~/.aws/credentials — i.e. it measures real coverage rather than trivially passing. Co-authored-by: Paperclip <noreply@paperclip.ing>

…551) Add the D2 labeled security corpus the detection scorer measures against, plus a co-located validator test enforcing the contract and cross-entity invariants (INV-3, INV-4 / SC-004). - 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull), 15 clean benign, 8 attack-resembling hard negatives (2 per attack category). - Every entry carries label + category + provenance.{source,license} (FR-007). - Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the unconfirmed-license mcp-injection-experiments are referenced externally only, never vendored (CN-005, R-07, R-A). The validator fails the build on any non-redistributable license. - corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri jsonschema/v6) and asserts attack coverage + >=1 hard negative per category. Related #739 Co-authored-by: Paperclip <noreply@paperclip.ing>

… README (#554) * feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2) Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's mcp-eval datasets/retrieval CLI (cb37f84): - corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012). - corpus_v1.source.json: secret-free, reproducible mcpproxy source config. - retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes schema + INV-1 validation. - baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05, the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads them directly); empty security section reserved for D2 (CN-004). - README.md: documented, repeatable regeneration procedure (FR-012). Verified end-to-end against a live BM25 index: validate OK, gate PASS (Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003). Related #MCP-740 Co-Authored-By: Paperclip <noreply@paperclip.ing> * fix(065): rename D1 dataset files for clarity + merge D1/D2 README Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config corpus_v1.json' which fails. Renamed: - corpus_v1.source.json -> snapshot-servers.config.json (the servable config) - corpus_v1.json -> corpus_v1.tools.json (the scored snapshot) Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2 dataset READMEs into one with an explicit servable-vs-dataset cheat sheet. --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

…eval (MCP-742) (#561) * ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both Spec-065 evaluations as a regression gate over the frozen datasets: - security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05 (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer defaults of 0.80 would always fail). Sourced in one place pending the MCP-815 baseline `security.gate` block so gate and baseline never drift. - retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs (npx/uvx fetch flake), blocking on the nightly schedule. Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043), actionlint clean. Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted). Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle The retrieval-d1 job failed: `mcpproxy serve` exited immediately with "data_dir: directory does not exist" (serve refuses to create a missing data_dir), and the server was backgrounded in a separate step from the readiness poll (a process backgrounded in one step is reaped when that step's shell exits). Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE step (shared shell) with a trap that stops the server however the step ends; also fail fast if the server process dies during startup. D2 gate unaffected. Verified: mcpproxy boots and serves /api/v1/status locally with the created data_dir; actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): fix D1 readiness probe — parse data.results from index/search envelope The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7 servers (45 tools), but the probe parsed the index/search response at the top level while results are nested under the `{"success":true,"data":{"results":[…]}}` envelope, so it read 0 every attempt and timed out. Fix: parse `data.results`. Verified locally — index returns 5 results for q=file within ~6s of boot. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> * ci(065): D1 readiness waits for full tool catalog before scoring The retrieval scorer was running against a partially-indexed instance: the readiness probe passed at the first indexed tool (>=1 search result), so scoring started before all 7 reference servers connected -> Recall@5 measured 0.387 vs baseline threshold 0.631 (false regression). Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools across the 7 servers) and add a short settle for the index build, then score. Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 = 0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline is exactly reproducible. actionlint clean. Related #555 datasets; MCP-742. Co-Authored-By: Paperclip <noreply@paperclip.ing> --------- Co-authored-by: Paperclip <noreply@paperclip.ing>

Bring spec 065 (Evaluation Foundation D1+D2) integration branch up to date with main before opening the integration PR. No conflicts; diff vs main is exactly the 23 new spec-065 files. Co-Authored-By: Paperclip <noreply@paperclip.ing>

codecov-commenter · 2026-06-01T12:13:33Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 70.78652% with 26 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
cmd/scan-eval/main.go	56.52%	13 Missing and 7 partials ⚠️
cmd/scan-eval/eval.go	86.04%	5 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-06-01T12:17:24Z

📦 Build Artifacts

Workflow Run: View Run
Branch: 065-evaluation-foundation

Available Artifacts

archive-darwin-amd64 (28 MB)
archive-darwin-arm64 (25 MB)
archive-linux-amd64 (16 MB)
archive-linux-arm64 (14 MB)
archive-windows-amd64 (27 MB)
archive-windows-arm64 (24 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (21 MB)
installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26763895630 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris · 2026-06-01T12:22:22Z

Critic (Codex) review — Dumbris's PR #562
Verdict: request_changes
Strengths: The PR is scoped to Spec 065 D1/D2 evaluation foundation files, includes the spec/design docs and datasets, and the dedicated Eval jobs for Security regression gate (D2) and Retrieval regression gate (D1) are passing at head 0f11ec4f0909bb5d07c023309b8a62855085140e.
Findings:

Required checks are not all green yet: Integration Tests is still IN_PROGRESS on the E2E Tests workflow, job https://github.com/smart-mcp-proxy/mcpproxy-go/actions/runs/26754023284/job/78850722217. Per the pre-merge gate rule, any pending required check is an automatic request_changes until it completes successfully.
Provenance check: ok

Dumbris · 2026-06-01T12:22:30Z

Critic (Codex) review — Dumbris's PR #562
Verdict: request_changes
Head: 0f11ec4f0909bb5d07c023309b8a62855085140e

Strengths: The PR keeps reports out of git, pins third-party GitHub Actions and the companion mcp-eval ref, and adds focused Go tests for the scan-eval bridge plus dataset schema/invariant checks.

Findings:

cmd/scan-eval does not satisfy the D2 scanner-evaluation scope. The spec requires the security corpus to cover tool poisoning, prompt injection, shadowing, and rug-pull, and requires scoring each security detector/scanner (specs/065-evaluation-foundation/spec.md:83, specs/065-evaluation-foundation/spec.md:84). The repo already has bundled scanners intended for those categories (internal/security/scanner/registry_bundled.go:22, internal/security/scanner/registry_bundled.go:61, internal/security/scanner/registry_bundled.go:79, internal/security/scanner/registry_bundled.go:95), but this PR hard-codes a single sensitive-data detector (cmd/scan-eval/eval.go:11, cmd/scan-eval/eval.go:107) and explicitly treats --scanners as ignored/not implemented (cmd/scan-eval/main.go:42, cmd/scan-eval/main.go:56). I ran the bridge over the committed corpus and it flagged only 2/20 malicious entries, with 0 prompt-injection, 0 shadowing, and 0 rug-pull detections. That makes the D2 gate mostly measure secret/path leakage, not the attack categories the spec says the security evaluation is meant to measure.
The security gate can pass while missing nearly all malicious categories. The script documents that the scorer default recall floor is 0.80 but overrides it to 0.05 because the production sensitive-data detector only reaches about 0.10 recall on this corpus (scripts/eval-ci-smoke.sh:25, scripts/eval-ci-smoke.sh:40). The committed baseline has an empty security section (specs/065-evaluation-foundation/datasets/baseline_v1.json:21, specs/065-evaluation-foundation/datasets/baseline_v1.json:25). Combined with finding 1, this allows the PR to satisfy the CI job while failing the spec's purpose of measuring whether the security scanners actually catch the malicious corpus categories (specs/065-evaluation-foundation/spec.md:40, specs/065-evaluation-foundation/spec.md:48).
D1 does not implement the spec's configurable averaging requirement. The spec requires run-to-run variance to be averaged over configurable N runs and reported as mean/tolerance (specs/065-evaluation-foundation/spec.md:66, specs/065-evaluation-foundation/spec.md:89), but the CI retrieval job hard-codes --runs 1 with no workflow/env knob (.github/workflows/eval.yml:181, .github/workflows/eval.yml:186). D2 has a RUNS env default, but D1 does not, so the regression gate cannot meet FR-010/US3 Enhance configuration management and signal handling in mcpproxy #2 as written.

Checks: gh pr checks 562 --watch=false is currently green for all non-skipped checks.

Provenance check: ok

) The eval.yml pull_request.paths filter only matched the D2 security surface (internal/security, cmd/scan-eval) and the harness/datasets, so changes to the D1 retrieval system it gates — BM25 index, MCP tool- discovery routing, the REST search envelope, server/CLI boot — never triggered the workflow. Spec 065 requires CI to catch discovery regressions when search/index/tool-discovery behavior changes. Add internal/index/**, internal/server/**, internal/httpapi/**, and cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2 security and D1 retrieval stay green. Related MCP-833

cloudflare-workers-and-pages · 2026-06-01T15:14:56Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`f14bcc1`
Status:	✅ Deploy successful!
Preview URL:	https://ca4e0b02.mcpproxy-docs.pages.dev
Branch Preview URL:	https://065-evaluation-foundation.mcpproxy-docs.pages.dev

View logs

Dumbris and others added 7 commits May 31, 2026 14:43

Merge branch 'main' into 065-evaluation-foundation

0f11ec4

Bring spec 065 (Evaluation Foundation D1+D2) integration branch up to date with main before opening the integration PR. No conflicts; diff vs main is exactly the 23 new spec-065 files. Co-Authored-By: Paperclip <noreply@paperclip.ing>

Dumbris mentioned this pull request Jun 1, 2026

ci(065): trigger eval gate on D1 retrieval paths (Related MCP-833) #563

Merged

Dumbris merged commit c2a0117 into main Jun 1, 2026
47 of 48 checks passed

Dumbris deleted the 065-evaluation-foundation branch June 1, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(065): Evaluation Foundation (D1+D2) — integration to main#562

feat(065): Evaluation Foundation (D1+D2) — integration to main#562
Dumbris merged 8 commits into
mainfrom
065-evaluation-foundation

Dumbris commented Jun 1, 2026

Uh oh!

codecov-commenter commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented Jun 1, 2026

Spec 065 — Evaluation Foundation (D1 + D2)

What ships

Companion repo

Local verification

Gate 3

Uh oh!

codecov-commenter commented Jun 1, 2026

Codecov Report

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

Dumbris commented Jun 1, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 1, 2026 •

edited

Loading

cloudflare-workers-and-pages Bot commented Jun 1, 2026 •

edited

Loading