Skip to content

feat(070): registry easy upstream-add — keystone core op + CLI MVP (US2)#555

Merged
Dumbris merged 23 commits into
mainfrom
070-registry-easy-upstream-add
Jun 1, 2026
Merged

feat(070): registry easy upstream-add — keystone core op + CLI MVP (US2)#555
Dumbris merged 23 commits into
mainfrom
070-registry-easy-upstream-add

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented May 31, 2026

Spec 070 — Registry: make discovery actual & easy to add to upstream

Two phases of spec 070, opened for review under the Glass Cockpit gates (agents open PRs, the board merges — do not self-merge).

Phase 2 — Keystone core op (foundational, already on branch)

server.AddServerFromRegistry — the single backend op every surface funnels through: resolve a registry reference (registryId+serverId+overrides) → re-derive a validated config.ServerConfig server-side → persist quarantined. The client never sends a config blob, so a compromised/buggy client cannot smuggle command/args or disable quarantine (CN-001 / CN-002 / decision D1). Stable cross-surface error codes: registry_not_found, server_not_found, no_install_info, missing_required_input, duplicate_name.

Phase 3 — CLI MVP (US2) 🎯 — this change

Closes the genuine net-new gap: discover→add on the CLI.

mcpproxy registry list                       # available registries
mcpproxy registry search <q> -r <registry>   # find a server (ID column feeds add)
mcpproxy registry add <registry> <serverId>  # add it (quarantined)
mcpproxy upstream approve <name>             # approve once trusted
  • cmd/mcpproxy/registry_cmd.goregistry list|search|add Cobra group. list/search are daemon-first with an in-process fallback (work with or without a running daemon); add requires the daemon (keystone op is server-side). Legacy search-servers retained unchanged as a back-compat alias.
  • internal/cliclientListRegistries, SearchRegistry, AddFromRegistry + client-side RegistryAddError (carries the stable code + missing-input names).
  • internal/httpapiPOST /api/v1/registries/{id}/servers/{serverId}/add → new AddServerFromRegistryRef controller method; code→HTTP status mapping (404 / 400). OpenAPI regenerated.
  • internal/serverAddServerFromRegistryRef adapter projecting failures onto contracts.RegistryAddError.
  • internal/contractsAddFromRegistryRequest / AddFromRegistryData / AddedServerSummary / RegistryAddError.

Tests / verification

  • CLI e2e against a running daemon + in-process mock registry: list → search → add → assert quarantined in upstream list, plus the missing_required_input actionable-error path (names the --env key).
  • cliclient httptest units; server adapter units; keystone units (unchanged, still green).
  • go build personal + server editions; golangci-lint 0 issues; go test ./internal/server -short green.

Decisions worth a reviewer's eye

  1. REST add route pulled forward from T014. The CLI is a thin REST client; the Phase-3 e2e ("add via running daemon") cannot pass without a server endpoint. The route is squarely in the backend lane; the frontend wiring (T015/T016, US1) remains.
  2. Discovered config-merge gotcha (out of scope, flagged for US4/T018): a custom registry in config that omits protocol inherits custom/pulse from the default registry list via viper's index-based slice merge, so its servers get parsed with the wrong parser. Test configs set an explicit protocol to avoid it. T018 (merge defaults∪config by ID) should fix the root cause.

Related #746

Dumbris and others added 6 commits May 31, 2026 16:57
… spec

Close the search-to-add loop + reach surface parity across Web UI, MCP, CLI
(all share the unified AddUpstreamServer path, quarantine-by-default). Real
gaps from research: CLI has no registry/search/add commands; Web UI Repositories
searches but can't one-click-add; registry list hardcoded/rebuild-only; key-less
registries error. Adds unified add-from-registry-result, CLI parity, Web UI
loop, MCP convenience, config-driven registry list + cache refresh, and a
cross-surface consistency regression. Roadmap PILLAR A. Paperclip goal da399902.
…easy upstream-add

Phase 0/1/2 speckit design artifacts for spec 070. Research found the
search->add loop is already partially closed (Web UI Add button, CLI
search/list) but via three divergent normalizations; keystone is one
backend AddServerFromRegistry core op all surfaces call, guarded by a
cross-surface consistency regression (CN-004/FR-010).

No implementation yet — submitted for the per-spec design gate (Gate 2).

Related #746
Single backend op that turns a registry reference (registryId + serverId +
overrides) into a validated, quarantined upstream server. Every surface
(REST/MCP/CLI) will funnel through this one normalization (CN-001); identical
input yields an identical persisted config (CN-004). The client sends a
reference, never a config blob — the server re-derives it (security D1), so a
buggy/compromised client cannot smuggle command/args or disable quarantine.

- registries.ServerEntry gains RequiredInputs ([]RequiredInput)
- registries.FindServerByID + sentinel errors (registry/server not found)
- registries.DetectRequiredInputs: explicit + ${VAR}/$VAR heuristic scan
- server.AddServerFromRegistry orchestrator + pure buildServerConfigFromEntry
- quarantine-by-default (CN-002, never client-overridable); refusals on
  no_install_info / missing_required_input / duplicate_name
- RegistryAddErrorCode maps errors to stable cross-surface codes

Phase 1+2 of MCP-746. Tests: go test ./internal/server/ -run TestAddFromRegistry
-race (green); registries input/lookup unit tests; build + linter clean.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Repoint the registry "Add to MCP" button at the backend keystone instead
of parsing install_cmd client-side (CN-001 / issue #483 class of bug):

- T015: new api.addServerFromRegistry(registryId, serverId, {name,env,enabled})
  POSTs to /api/v1/registries/{id}/servers/{serverId}/add and surfaces the
  structured cross-surface error (code + missing_inputs). Removes the old
  addServerFromRepository install_cmd.split path.
- T016: required-input prompt dialog in Repositories.vue — opens on a
  missing_required_input response, renders one field per missing input
  (rich decl from search required_inputs[] when present, else bare names),
  blocks Add until all are filled, resubmits as env. data-test hooks added
  across the search → add flow.
- T017: e2e/playwright/registry-add.spec.ts (search→Add→quarantined +
  prompt path); pending the live REST route to run green.
- types: RequiredInput + RepositoryServer.required_inputs[].

Verified: 5 vitest unit tests green, vue-tsc clean, vite build green.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Phase 3 of spec 070 — close the discover→add loop on the CLI, the genuine
net-new gap. After this a user can:
  mcpproxy registry list → registry search <q> -r <reg> → registry add <reg> <id>
and the server appears quarantined in `mcpproxy upstream list`.

- cmd/mcpproxy/registry_cmd.go: `registry list|search|add` Cobra group.
  list/search are daemon-first with an in-process fallback; add requires the
  daemon because the keystone op (config derivation + quarantine) is
  server-side (CN-001 / decision D1). Legacy `search-servers` retained.
- internal/cliclient: ListRegistries, SearchRegistry, AddFromRegistry +
  client-side RegistryAddError carrying the stable code + missing inputs.
- internal/httpapi: POST /api/v1/registries/{id}/servers/{serverId}/add wired
  to the keystone via a new AddServerFromRegistryRef controller method;
  code→HTTP status mapping (404 not-found, 400 derivation/dup/missing-input).
- internal/server: AddServerFromRegistryRef adapter projecting failures onto
  the cross-surface contracts.RegistryAddError (names missing --env keys).
- internal/contracts: AddFromRegistryRequest/Data/AddedServerSummary +
  RegistryAddError.

Tests: cliclient httptest units; server adapter units; CLI e2e against a
running daemon + mock registry (list→search→add→quarantined) and the
missing-required-input actionable-error path.

Note: the REST add route was pulled forward from T014 because the CLI is a
thin REST client and the e2e ("add via running daemon") cannot pass without a
server endpoint; the frontend wiring (T015/T016) remains for US1.

Related #746

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 31, 2026

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: d24c577
Status: ✅  Deploy successful!
Preview URL: https://47dd4952.mcpproxy-docs.pages.dev
Branch Preview URL: https://070-registry-easy-upstream-a.mcpproxy-docs.pages.dev

View logs

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 31, 2026

Dumbris and others added 4 commits May 31, 2026 20:12
…est registry

T017 follow-up after running the e2e green against a live server:
- nav to /ui/repositories (history-mode base /ui/), not a hash route
- default the no-input happy path to docker-mcp-catalog (reliably addable)
- target the required-input server (default fleur/stripe) independently of
  the no-input registry, so both tests run from one invocation

Verified: both tests pass — search→Add→quarantined and
search→Add→prompt→quarantined — against a fresh throwaway core.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Adds operation=add_from_registry to the upstream_servers MCP tool so agents
can add an upstream from a registry reference (registry + id) without
hand-constructing command/args/url. Dispatches to the keystone
AddServerFromRegistryRef op: the server re-derives the runnable config from
the registry entry and persists it quarantined (CN-001 / CN-002 / security
decision D1).

New params: registry, id (env_json/name/enabled reuse the existing fields).
Success returns the slim AddedServerSummary; failure returns a structured
error (isError=true) carrying the same stable Code as REST/CLI plus
missing_inputs for missing_required_input (FR-003), so the same failure reads
identically across every surface.

T012/T013 of spec 070 Phase 4. Tests: internal/server/mcp_add_from_registry_test.go
(happy path + missing_required_input structured error).
The prior commit added only the upstream_servers enum/param surface; the
dispatch case and handler were missing and the test referenced undefined
helpers, so the package did not compile. This wires it up end-to-end:

- handleAddServerFromRegistry dispatches to the keystone
  mainServer.AddServerFromRegistryRef; success -> AddedServerSummary,
  failure -> structured error (code + missing_inputs, isError=true)
  matching the REST/CLI surfaces (CN-001).
- add_from_registry added to the dispatch switch and the agent-token
  block list (write-op parity).
- Tests use a real runtime-backed mainServer plus an httptest registry
  (SetRegistriesFromConfig): happy path asserts a quarantined stdio
  config, missing-input path asserts the structured missing_required_input
  error with the offending input names.

Verified: go test ./internal/server (incl. the two new tests), go build
./..., golangci-lint run ./internal/server/... all pass.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 31, 2026

📦 Build Artifacts

Workflow Run: View Run
Branch: 070-registry-easy-upstream-add

Available Artifacts

  • archive-darwin-amd64 (28 MB)
  • archive-darwin-arm64 (25 MB)
  • archive-linux-amd64 (16 MB)
  • archive-linux-arm64 (14 MB)
  • archive-windows-amd64 (27 MB)
  • archive-windows-arm64 (24 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (21 MB)
  • installer-dmg-darwin-arm64 (19 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26742088591 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris added 6 commits May 31, 2026 20:36
…-006)

SetRegistriesFromConfig previously replaced the registry list wholesale, so
a user who added one custom registry in their config dropped all built-in
defaults. Change it to merge defaults union config keyed by ID: defaults
first, custom appended, colliding IDs override in place. Extract the five
built-in registries into config.DefaultRegistries() as the single source of
truth shared by DefaultConfig() and the merge. Add RequiresKey to both
RegistryEntry types (plumbing for FR-008).

Related #765
A registry marked RequiresKey is skipped via ErrRegistryKeyMissing when no
API key is configured (env MCPPROXY_REGISTRY_<ID>_API_KEY), instead of
running a doomed fetch. SearchServers short-circuits before the network call;
FindServerByID inherits the behaviour. Calling surfaces map the sentinel to
an 'unavailable' marker so the overall search still succeeds.

Related #765
…ger (US4 FR-007)

Primitives for registry cache freshness: Invalidate drops a key, Refresh
aliases it (lazy re-fetch), InvalidatePrefix drops all keys under a stable
prefix (one refresh clears every tag/query variant of a registry), and Peek
returns a record without evicting it even when expired so callers can serve
stale data while flagging its age. All stats-accounted.

Related #765
…bsent (US4 FR-007/FR-008)

Wire the cache primitives into the registry search path: SearchRegistryServers
now caches each (registry,tag,query,limit) result and serves it while flagging
age/staleness via contracts.RegistryCacheInfo. Add RefreshRegistryCache +
POST /api/v1/registries/{id}/refresh to drop a registry's cached lists on
demand. A key-requiring registry with no key configured surfaces as an
'unavailable' marker (HTTP 200) on both the REST search endpoint and the MCP
search_servers tool instead of failing the surface (FR-008). Regenerate the
OpenAPI spec for the new endpoint and response fields.

Related #765
AddServer mutated the live immutable config snapshot returned by
runtime.Config() in place (currentConfig.Servers = append(...)) before
publishing via UpdateConfig. Background goroutines that range cfg.Servers
concurrently — LoadConfiguredServers during startup and the 2s-delayed
DiscoverAndIndexTools — raced that write, which the -race detector flagged
on TestHandleUpstreamServers_AddFromRegistry_HappyPath in CI.

Clone the config and its server slice, append to the clone, then publish
the new snapshot atomically. Behaviour is unchanged; readers holding the
prior snapshot are now isolated, as the immutable-snapshot contract intends.

Related #555
DiscoverTools snapshotted client.Config directly while only holding the
Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu
before calling managed.Client.SetConfig, which guards mc.Config with the
client's own mc.mu. The two accesses therefore used different mutexes,
producing a data race the -race detector flagged intermittently in CI
(PR #555, run 26719908592).

Fix the read side: read config via the mutex-guarded GetConfig() accessor in
DiscoverTools and in the API-facing status readers (GetStats,
GetTotalToolCount, ListServers) that share the identical hazard — they read
client.Config after releasing m.mu and run concurrently with reconcile.
mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu),
so this is deadlock-safe.

Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against
the readers concurrently; trips -race before the fix, green after.

Related: MCP-770
Dumbris added a commit that referenced this pull request Jun 1, 2026
The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats,
GetTotalToolCount, ListServers) through GetConfig(), but the underlying
hazard is broader: managed.Client.Config is a pointer field that SetConfig
swaps off m.mu (reconcile add path), while it is read concurrently from many
sites that do NOT hold the client's mc.mu — including Client.Connect's
unlocked phase (client.go:197) and detached state-change callback goroutines
(onStateChange, spawned by StateManager.SetError). The CI -race detector
flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after
the first fix was cherry-picked.

A mutex-guarded accessor can't cover this: most internal readers run while
mc.mu is already held, so routing them through an RLock GetConfig() would
deadlock (sync.RWMutex is not reentrant). Instead make the field itself
lock-free and data-race-free:

- Replace the exported `Config *config.ServerConfig` field with an
  unexported `cfg atomic.Pointer[config.ServerConfig]`.
- GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are
  lock-free and safe whether or not mc.mu is held.
- Route every reader through GetConfig() across the managed client, the
  upstream Manager, and the supervisor actor pool.

Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against
Client.Connect concurrently; trips -race on the old field, green after.

Verified: go test -race ./internal/upstream/... ./internal/runtime/... and
the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath /
_AdminAllowed_WriteOps pass x5 under -race; build + linter clean.

Related: MCP-770
Dumbris added 3 commits June 1, 2026 07:05
The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats,
GetTotalToolCount, ListServers) through GetConfig(), but the underlying
hazard is broader: managed.Client.Config is a pointer field that SetConfig
swaps off m.mu (reconcile add path), while it is read concurrently from many
sites that do NOT hold the client's mc.mu — including Client.Connect's
unlocked phase (client.go:197) and detached state-change callback goroutines
(onStateChange, spawned by StateManager.SetError). The CI -race detector
flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after
the first fix was cherry-picked.

A mutex-guarded accessor can't cover this: most internal readers run while
mc.mu is already held, so routing them through an RLock GetConfig() would
deadlock (sync.RWMutex is not reentrant). Instead make the field itself
lock-free and data-race-free:

- Replace the exported `Config *config.ServerConfig` field with an
  unexported `cfg atomic.Pointer[config.ServerConfig]`.
- GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are
  lock-free and safe whether or not mc.mu is held.
- Route every reader through GetConfig() across the managed client, the
  upstream Manager, and the supervisor actor pool.

Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against
Client.Connect concurrently; trips -race on the old field, green after.

Verified: go test -race ./internal/upstream/... ./internal/runtime/... and
the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath /
_AdminAllowed_WriteOps pass x5 under -race; build + linter clean.

Related: MCP-770
newAddFromRegistryTestServer opened a real Server (and its BBolt config.db
under t.TempDir()) but never closed it. On Windows an open file cannot be
unlinked, so t.TempDir's RemoveAll failed the AddFromRegistry tests with
"config.db ... being used by another process". Unix unlinks open files, so
ubuntu/macOS passed and hid it; the failure only surfaced once the MCP-770
race fix stopped aborting the test binary before cleanup ran.

Register t.Cleanup(mainSrv.Shutdown) after t.TempDir() so LIFO cleanup closes
the runtime/storage (releasing the handle) before the temp dir is removed.
Shutdown is nil-httpServer-safe and idempotent.

Related: MCP-770, spec 070
StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and
StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the
monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the
process-monitoring equivalents — with no synchronization. When a connect and a
shutdown overlap on the same client, the -race detector flags WG.Add (Start)
vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier
MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's
shutdown path actually run concurrently with reconcile.

Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods
mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor
goroutines never acquire it; it is never held across c.mu), so Stop's bounded
WG.Wait under the lock cannot deadlock.

Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips
-race on the unsynchronized fields, green with monitoringMu.

Related: MCP-770, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and
StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the
monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the
process-monitoring equivalents — with no synchronization. When a connect and a
shutdown overlap on the same client, the -race detector flags WG.Add (Start)
vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier
MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's
shutdown path actually run concurrently with reconcile.

Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods
mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor
goroutines never acquire it; it is never held across c.mu), so Stop's bounded
WG.Wait under the lock cannot deadlock.

Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips
-race on the unsynchronized fields, green with monitoringMu.

Related: MCP-770, spec 070
The monitoringMu serialization (prev commit) was incomplete: StopStderr/
StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms
timeout. When a real child keeps its stderr open, monitorStderr blocks in
scanner.Scan(), the wait times out, Stop returns and releases monitoringMu —
but the detached WG.Wait goroutine lives on and races the next connect's
WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring
WG.Add vs the lingering StopStderrMonitoring WG.Wait).

Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the
monitor goroutine closes, and pass the context + done as locals so an
abandoned (timed-out) monitor never reads the shared ctx/state a later Start
overwrites. Stop now waits on done directly under monitoringMu (no detached
waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx
as a parameter.

Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking
pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact
escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race.

Related: MCP-770, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
The monitoringMu serialization (prev commit) was incomplete: StopStderr/
StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms
timeout. When a real child keeps its stderr open, monitorStderr blocks in
scanner.Scan(), the wait times out, Stop returns and releases monitoringMu —
but the detached WG.Wait goroutine lives on and races the next connect's
WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring
WG.Add vs the lingering StopStderrMonitoring WG.Wait).

Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the
monitor goroutine closes, and pass the context + done as locals so an
abandoned (timed-out) monitor never reads the shared ctx/state a later Start
overwrites. Stop now waits on done directly under monitoringMu (no detached
waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx
as a parameter.

Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking
pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact
escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race.

Related: MCP-770, spec 070
Kimi's dual-AI review of #555 returned REQUEST_CHANGES on the missing docs per
ENG-9. Add docs for the new registry surfaces:

- docs/features/registry-add.md — cohesive feature guide covering all three
  surfaces (CLI `registry list/search/add`, REST add/refresh/browse endpoints,
  MCP `upstream_servers` add_from_registry op), the CN-001 re-derivation
  security model, quarantine-by-default, and the cross-surface
  missing_required_input error contract.
- docs/api/rest-api.md — new Registries section listing the four endpoints
  (list, search, add, refresh) with request/response shapes, linking the
  feature guide.

Docs only; no code change. Clears Kimi's REQUEST_CHANGES for the Gate-3 merge.

Related: MCP-788, MCP-785, MCP-761, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
DiscoverTools snapshotted client.Config directly while only holding the
Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu
before calling managed.Client.SetConfig, which guards mc.Config with the
client's own mc.mu. The two accesses therefore used different mutexes,
producing a data race the -race detector flagged intermittently in CI
(PR #555, run 26719908592).

Fix the read side: read config via the mutex-guarded GetConfig() accessor in
DiscoverTools and in the API-facing status readers (GetStats,
GetTotalToolCount, ListServers) that share the identical hazard — they read
client.Config after releasing m.mu and run concurrently with reconcile.
mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu),
so this is deadlock-safe.

Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against
the readers concurrently; trips -race before the fix, green after.

Related: MCP-770
Dumbris added a commit that referenced this pull request Jun 1, 2026
The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats,
GetTotalToolCount, ListServers) through GetConfig(), but the underlying
hazard is broader: managed.Client.Config is a pointer field that SetConfig
swaps off m.mu (reconcile add path), while it is read concurrently from many
sites that do NOT hold the client's mc.mu — including Client.Connect's
unlocked phase (client.go:197) and detached state-change callback goroutines
(onStateChange, spawned by StateManager.SetError). The CI -race detector
flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after
the first fix was cherry-picked.

A mutex-guarded accessor can't cover this: most internal readers run while
mc.mu is already held, so routing them through an RLock GetConfig() would
deadlock (sync.RWMutex is not reentrant). Instead make the field itself
lock-free and data-race-free:

- Replace the exported `Config *config.ServerConfig` field with an
  unexported `cfg atomic.Pointer[config.ServerConfig]`.
- GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are
  lock-free and safe whether or not mc.mu is held.
- Route every reader through GetConfig() across the managed client, the
  upstream Manager, and the supervisor actor pool.

Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against
Client.Connect concurrently; trips -race on the old field, green after.

Verified: go test -race ./internal/upstream/... ./internal/runtime/... and
the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath /
_AdminAllowed_WriteOps pass x5 under -race; build + linter clean.

Related: MCP-770
Dumbris added a commit that referenced this pull request Jun 1, 2026
StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and
StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the
monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the
process-monitoring equivalents — with no synchronization. When a connect and a
shutdown overlap on the same client, the -race detector flags WG.Add (Start)
vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier
MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's
shutdown path actually run concurrently with reconcile.

Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods
mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor
goroutines never acquire it; it is never held across c.mu), so Stop's bounded
WG.Wait under the lock cannot deadlock.

Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips
-race on the unsynchronized fields, green with monitoringMu.

Related: MCP-770, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
The monitoringMu serialization (prev commit) was incomplete: StopStderr/
StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms
timeout. When a real child keeps its stderr open, monitorStderr blocks in
scanner.Scan(), the wait times out, Stop returns and releases monitoringMu —
but the detached WG.Wait goroutine lives on and races the next connect's
WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring
WG.Add vs the lingering StopStderrMonitoring WG.Wait).

Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the
monitor goroutine closes, and pass the context + done as locals so an
abandoned (timed-out) monitor never reads the shared ctx/state a later Start
overwrites. Stop now waits on done directly under monitoringMu (no detached
waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx
as a parameter.

Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking
pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact
escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race.

Related: MCP-770, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
…etConfig (MCP-770) (#556)

* fix(upstream): route concurrent Config reads through GetConfig (MCP-770)

DiscoverTools snapshotted client.Config directly while only holding the
Manager's m.mu, but the reconcile add path (AddServerConfig) releases m.mu
before calling managed.Client.SetConfig, which guards mc.Config with the
client's own mc.mu. The two accesses therefore used different mutexes,
producing a data race the -race detector flagged intermittently in CI
(PR #555, run 26719908592).

Fix the read side: read config via the mutex-guarded GetConfig() accessor in
DiscoverTools and in the API-facing status readers (GetStats,
GetTotalToolCount, ListServers) that share the identical hazard — they read
client.Config after releasing m.mu and run concurrently with reconcile.
mc.mu is a leaf lock (only Get/SetConfig take it, never while holding m.mu),
so this is deadlock-safe.

Add TestDiscoverTools_ConfigRace: drives AddServerConfig (SetConfig) against
the readers concurrently; trips -race before the fix, green after.

Related: MCP-770

* fix(upstream): make managed.Client config access atomic (MCP-770)

The first MCP-770 fix routed Manager-side reads (DiscoverTools, GetStats,
GetTotalToolCount, ListServers) through GetConfig(), but the underlying
hazard is broader: managed.Client.Config is a pointer field that SetConfig
swaps off m.mu (reconcile add path), while it is read concurrently from many
sites that do NOT hold the client's mc.mu — including Client.Connect's
unlocked phase (client.go:197) and detached state-change callback goroutines
(onStateChange, spawned by StateManager.SetError). The CI -race detector
flagged the Connect-vs-SetConfig variant on PR #555's macOS unit job after
the first fix was cherry-picked.

A mutex-guarded accessor can't cover this: most internal readers run while
mc.mu is already held, so routing them through an RLock GetConfig() would
deadlock (sync.RWMutex is not reentrant). Instead make the field itself
lock-free and data-race-free:

- Replace the exported `Config *config.ServerConfig` field with an
  unexported `cfg atomic.Pointer[config.ServerConfig]`.
- GetConfig() returns cfg.Load(); SetConfig() does cfg.Store(). Both are
  lock-free and safe whether or not mc.mu is held.
- Route every reader through GetConfig() across the managed client, the
  upstream Manager, and the supervisor actor pool.

Add TestConnect_ConfigRace: drives AddServerConfig (SetConfig) against
Client.Connect concurrently; trips -race on the old field, green after.

Verified: go test -race ./internal/upstream/... ./internal/runtime/... and
the original CI victims TestHandleUpstreamServers_AddFromRegistry_HappyPath /
_AdminAllowed_WriteOps pass x5 under -race; build + linter clean.

Related: MCP-770

* fix(core): serialize stderr/process monitoring lifecycle (MCP-770)

StartStderrMonitoring (from connectStdio during a reconcile-driven Connect) and
StopStderrMonitoring (from Disconnect during Manager.ShutdownAll) accessed the
monitoring lifecycle fields — stderrMonitoringCtx/Cancel/WaitGroup and the
process-monitoring equivalents — with no synchronization. When a connect and a
shutdown overlap on the same client, the -race detector flags WG.Add (Start)
vs WG.Wait (Stop). CI caught this on PR #555's ubuntu unit job once the earlier
MCP-770 fixes (atomic Config) + the Windows temp-dir cleanup made the test's
shutdown path actually run concurrently with reconcile.

Add a dedicated monitoringMu that makes the four Start*/Stop*Monitoring methods
mutually exclusive, so Add never overlaps Wait. The mutex is a leaf (monitor
goroutines never acquire it; it is never held across c.mu), so Stop's bounded
WG.Wait under the lock cannot deadlock.

Add TestStderrMonitoring_StartStopRace: hammers Start/Stop concurrently; trips
-race on the unsynchronized fields, green with monitoringMu.

Related: MCP-770, spec 070

* fix(core): per-cycle done channel for monitoring stop (MCP-770)

The monitoringMu serialization (prev commit) was incomplete: StopStderr/
StopProcessMonitoring ran WG.Wait() in a DETACHED goroutine guarded by a 500ms
timeout. When a real child keeps its stderr open, monitorStderr blocks in
scanner.Scan(), the wait times out, Stop returns and releases monitoringMu —
but the detached WG.Wait goroutine lives on and races the next connect's
WG.Add. CI caught this on #555 (DiscoverTools→Connect StartStderrMonitoring
WG.Add vs the lingering StopStderrMonitoring WG.Wait).

Replace the reused sync.WaitGroup with a per-cycle `done chan struct{}` the
monitor goroutine closes, and pass the context + done as locals so an
abandoned (timed-out) monitor never reads the shared ctx/state a later Start
overwrites. Stop now waits on done directly under monitoringMu (no detached
waiter) and nils the per-cycle fields. monitorStderr/monitorProcess take ctx
as a parameter.

Strengthen TestStderrMonitoring with an AbandonedMonitorNoRace case: a blocking
pipe keeps the monitor alive so Stop hits the timeout/abandon path — the exact
escape the WaitGroup version raced on. AddFromRegistry passes 12x under -race.

Related: MCP-770, spec 070
Dumbris added a commit that referenced this pull request Jun 1, 2026
* docs(064): Glass Cockpit — transparent & steerable agent cockpit spec

Spec, plan, and design artifacts for making the existing Paperclip
"MCPProxy" cockpit (spec 045) transparent and steerable: invert the
default from "proceed" to "checkpoint at every design-decision
boundary" via three human gates (plan-of-attack, per-spec design,
pre-merge) mapped to Paperclip native primitives (executionPolicy
approval stages, request_confirmation/suggest_tasks interactions,
issue_tree_holds), with reasoning visible before each gate and a
single "waiting on you" view.

Phased rollout: A) config + agent-instruction only (the dry-run
target), B) a Paperclip plugin for the fused transparency UI, C) a
fork only if A/B fall short. SynapBus is log/wiki only, never on the
critical path. Includes rewritten gate-aware agent instructions,
consumed-API + executionPolicy + agent-instruction contracts, data
model, research, and operator quickstart.

Supersedes spec 062's fresh-dev-instance approach; extends spec 045.

* docs(064): amend gate model — human-merge → dual-AI auto-merge + human veto

Session 2 amendment to FR-005 + US3: replace the mandatory human-merge gate
(throughput bottleneck) with draft-PR + dual-AI-review consensus auto-merge.
Two reviewers on different model families (Gemini 2.5-pro Critic + Codex),
never the implementer; tests-green + both-accept → auto-merge; human is an
optional 3rd reviewer with veto (request-changes/hold freezes auto-merge).
Prerequisite flagged: a bot GitHub identity (agents currently = author's gh,
and GitHub forbids self-approval) — interim fallback is 2-AI-review-as-
required-check with the human merging. codex-local Paperclip adapter exists.

* docs(064): dual-AI-review auto-merge — reviewer doctrine + engineer draft-PR + setup

Adds the Session-2 gate-model deliverables: reviewer/REVIEWER.md (shared
RV-1..RV-6 dual-review doctrine), codex-reviewer/AGENTS.md (2nd reviewer on
codex-local), amends engineer ENG-4 to open DRAFT PRs + request 2 AI reviewers
(no self-merge), and auto-merge-setup.md (GitHub branch-protection config +
the bot-identity prerequisite + interim human-merge fallback + open items).

* docs(064): reviewers on subscription auth only; Gemini quota-exhausted, Codex gpt-5.5 ready

Per user directive: both AI reviewers use paid SUBSCRIPTION logins, not API keys.
- Gemini Critic: subscription/OAuth, pin gemini-2.5-pro (3.5/3 UNVERIFIED — quota
  exhausted on every probe today; switch if confirmed). TWO blockers: quota +
  empty-prompt adapter bug → cannot accept yet.
- Codex reviewer: ChatGPT subscription, gpt-5.5 (codex-cli 0.46.0) — READY now.
- Live two-reviewer set today = Codex + human (FR-005f) until Gemini recovers.

* docs(064): Critic GEMINI.md — subscription-only, quota-blocked; Kimi+Codex as live pair

Gemini settings pin gemini-3.1-pro-preview (subscription/OAuth) but quota is
exhausted (no reset hint) + empty-prompt adapter bug → Critic can't accept yet.
Live 2-AI reviewer pair = Codex gpt-5.5 + Kimi-K2.5 (opencode_local, Gcore key
present); Gemini rejoins as 3rd reviewer when quota returns.

* docs(064): stand up Codex+Kimi reviewer agents; codex config fix

Live dual-AI reviewer pair created in the running Paperclip cockpit and
verified responding (2026-05-31):
- CodexReviewer  — codex_local / gpt-5-codex   (5b94562c-…)
- KimiReviewer   — opencode_local / Kimi-K2.5  (fdaa1d4c-…)
Both carry managed instruction bundles (shared doctrine + RV-1..RV-6 +
role notes), report to CEO, idle, heartbeat off (woken by review-stage).

Docs:
- add canonical kimi-reviewer/AGENTS.md (the design lacked it)
- correct codex-reviewer/AGENTS.md model facts (gpt-5.5 -> gpt-5-codex)
- auto-merge-setup.md: live pair is Codex+Kimi; Gemini Critic becomes the
  3rd reviewer when its subscription quota recovers

codex config fix (~/.codex/config.toml, not in repo): model_reasoning_effort
xhigh->high and model gpt-5.5->gpt-5-codex. On codex-cli 0.46.0 + ChatGPT
subscription auth, gpt-5.5 needs a newer CLI and gpt-5.4/5.3-codex/5.2 are
auth-restricted; gpt-5-codex/gpt-5 are the working models. Backup at
~/.codex/config.toml.bak.pre-reviewer-fix.*

* docs(064): engineers drive PRs green + bundle docs/ updates (ENG-8/9)

engineer/AGENTS.md:
- ENG-8: drive every required check to green before review — run local
  verification before push, watch `gh pr checks --watch`, push fixes until
  all green; never leave/hand off a red PR, never --no-verify or weaken a
  check. Green CI is the engineer's job, not the reviewer's.
- ENG-9: when a change touches CLI/REST/MCP API/config/defaults/security or
  anything under docs/, the SAME PR must update docs/ (+ CLAUDE.md/
  oas/swagger.yaml/README where mirrored). Docs-only changes exempt from TDD.
- ENG-5 reworked to dual-AI merge-readiness (Codex+Kimi accept + all CI green).

reviewer/REVIEWER.md RV-3: red/pending check = automatic request_changes;
missing docs when the change warrants them = request_changes.

Applied to the live Paperclip brains: 3 engineers (Backend/Frontend/MacOS)
re-flattened from canonical; Codex+Kimi reviewer brains refreshed.

* docs(064): record applied CI-context branch protection on main

Phase-1 gate live on main (no bot identity needed): required_status_checks
strict=false with 8 always-run, non-path-conditional contexts (Lint, Unit
Tests ubuntu, Build ubuntu/macos/windows, Build Frontend, Validate PR title,
Verify OpenAPI Artifacts). Existing 1-review + enforce_admins=false kept.
Verified: green PR #553 satisfies all 8 (blocked only by review); in-flight
PR #555 blocked on pending required checks. Documents the deliberately-
excluded checks and the Go-version-pinned context-name fragility.

* docs(064): ENG-3 — branch from origin/main, never from a feature branch

Forking a work branch from another feature branch drags its unmerged
commits into the PR (root cause of spec-064 docs leaking into the
MCP-770 race fix #556). ENG-3 now mandates fetch + branch from
origin/main explicitly, in both the engineer bundle and the contract.
…defaults in ListRegistries

Address the two Codex Gate-3 review findings on PR #555 (spec 070).

Finding 1 (security): the add_from_registry MCP op persisted a new upstream
without the AllowServerAdd check that the ordinary `add` op enforces, so a user
who set allow_server_add=false could still add a server by registry reference.
Gate add_from_registry behind the same check.

Finding 2 (FR-006 consistency): Runtime.ListRegistries returned a hard-coded
legacy Smithery entry for an empty config, or only cfg.Registries when set —
diverging from the merged defaults+custom source that search/add use. Route it
through registries.SetRegistriesFromConfig so the daemon/API list always shows
built-in defaults merged with custom entries.

Tests-first for both (ENG-1).

Related #555
@Dumbris
Copy link
Copy Markdown
Member Author

Dumbris commented Jun 1, 2026

Re-review verdict: APPROVED ✅ (MCP-803)

Both Gate-3 findings from MCP-800 are correctly addressed in commit d24c577c.

Finding 1 — SECURITY: add_from_registry bypassed allow_server_add

FIXED. internal/server/mcp.go:2450add_from_registry is folded into the same case operationAdd, "add_from_registry": branch gating on AllowServerAdd. When allow_server_add=false, both plain add and add_from_registry return "Adding servers is not allowed" before any persistence.

Test: TestHandleUpstreamServers_AddFromRegistry_BlockedWhenAddDisallowed passes with the correct assertion.

Finding 2 — FR-006: ListRegistries hard-coded legacy entry

FIXED. internal/runtime/runtime.go:1455-1456ListRegistries now calls registries.SetRegistriesFromConfig(cfg) + registries.ListRegistries(), the same merged catalog path as SearchServers. No more hard-coded Smithery bypass.

Test: TestListRegistries_MergesDefaultsWithCustom covers empty-config and custom-config cases.

CI status

  • All required checks green on PR-triggered run 26742088571 (the authoritative run).
  • Push-triggered run 26742086535 had two non-regression failures:
    • Ubuntu: non-deterministic bbolt shutdown race in test helper — passed on re-run.
    • Windows: GitHub Actions runner cancellation — pre-existing infra flake, not caused by this PR.
  • Lint, CodeQL, Build (all platforms), E2E, OAuth E2E: all pass.

Safe to merge.

@Dumbris Dumbris merged commit 354580f into main Jun 1, 2026
46 of 48 checks passed
@Dumbris Dumbris deleted the 070-registry-easy-upstream-add branch June 1, 2026 08:33
Dumbris added a commit that referenced this pull request Jun 1, 2026
…N-004/SC-004)

Guards that all three add surfaces (Web UI, CLI, MCP) route through the
single AddServerFromRegistry keystone op with quarantine-by-default
preserved. The production code landed via PR #555; this is the Phase-7
regression that was cut after #555's branch point.

Related MCP-746
Dumbris added a commit that referenced this pull request Jun 1, 2026
The US1/US2 gaps recorded 2026-05-31 were partly stale: Web UI already
had an 'Add to MCP' path and the CLI already listed/searched registries.
Real work was de-duplicating every surface onto the single
AddServerFromRegistry keystone op (PR #555 / 354580f).

Related MCP-746
Dumbris added a commit that referenced this pull request Jun 1, 2026
…lifecycle

The retrieval-d1 job failed: `mcpproxy serve` exited immediately with
"data_dir: directory does not exist" (serve refuses to create a missing
data_dir), and the server was backgrounded in a separate step from the readiness
poll (a process backgrounded in one step is reaped when that step's shell exits).

Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE
step (shared shell) with a trap that stops the server however the step ends; also
fail fast if the server process dies during startup. D2 gate unaffected.

Verified: mcpproxy boots and serves /api/v1/status locally with the created
data_dir; actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Dumbris added a commit that referenced this pull request Jun 1, 2026
…n + O3 amendment (Related MCP-746) (#559)

* test(070): cross-surface registry-add consistency regression (T021, CN-004/SC-004)

Guards that all three add surfaces (Web UI, CLI, MCP) route through the
single AddServerFromRegistry keystone op with quarantine-by-default
preserved. The production code landed via PR #555; this is the Phase-7
regression that was cut after #555's branch point.

Related MCP-746

* docs(070): O3 spec amendment — record stale US1/US2 gap premise

The US1/US2 gaps recorded 2026-05-31 were partly stale: Web UI already
had an 'Add to MCP' path and the CLI already listed/searched registries.
Real work was de-duplicating every surface onto the single
AddServerFromRegistry keystone op (PR #555 / 354580f).

Related MCP-746

* fix(upstream): resolve connectStdio↔monitorStderr data race on c.stderr

TestCrossSurfaceConsistency_RegistryAdd is the first unit test to drive the
real supervisor connect loop under -race and it surfaced a genuine production
data race: connectStdio reassigns the shared c.stderr field on every
(re)connect (connection_stdio.go:217) while the monitorStderr goroutine read
that same field to build its scanner (monitoring.go:170). A reconnect's write
raced a lingering monitor's read.

Capture the stderr reader as a local under monitoringMu in
StartStderrMonitoring and pass it to monitorStderr as an argument, completing
the existing "abandoned goroutine touches only its locals" design — the
monitor no longer reads the shared field, so the write can never race it.

Add TestStderrMonitoring_ReassignDuringMonitorNoRace, which reproduces the
exact write@217/read@170 pair under -race (red before this change, green
after).

Related #559

* fix(configsvc): guard Subscribe init-send against concurrent Close (send-on-closed)

The registry-add unit tests added in this PR drive a full Runtime + configsvc
under -race and surfaced a second production data race (distinct from the
connectStdio↔monitorStderr fix): Subscribe spawns a goroutine that sends the
initial snapshot on the subscriber channel (service.go:206) holding no lock,
while Close (service.go:261) and Unsubscribe close that same channel under
subMu. The send racing the close both data-races and panics with
"send on closed channel".

Deliver the initial snapshot under subMu(R) after confirming the channel is
still a live subscriber (isSubscribedLocked), so the close paths — which run
under the subMu write lock — are mutually exclusive with the send. The send is
non-blocking (the cap-10 buffer is empty at subscribe time) so the lock is
never held across a blocking send. Preserve the ctx-canceled early-out.

Add TestService_SubscribeCloseRace, which reproduces the panic/race under
-race (red before this change, green after).

Related #559
Dumbris added a commit that referenced this pull request Jun 1, 2026
…h envelope

The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7
servers (45 tools), but the probe parsed the index/search response at the top
level while results are nested under the `{"success":true,"data":{"results":[…]}}`
envelope, so it read 0 every attempt and timed out.

Fix: parse `data.results`. Verified locally — index returns 5 results for q=file
within ~6s of boot. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Dumbris added a commit that referenced this pull request Jun 1, 2026
The retrieval scorer was running against a partially-indexed instance: the
readiness probe passed at the first indexed tool (>=1 search result), so scoring
started before all 7 reference servers connected -> Recall@5 measured 0.387 vs
baseline threshold 0.631 (false regression).

Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools
across the 7 servers) and add a short settle for the index build, then score.

Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 =
0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline
is exactly reproducible. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
Dumbris added a commit that referenced this pull request Jun 1, 2026
…eval (MCP-742) (#561)

* ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)

Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both
Spec-065 evaluations as a regression gate over the frozen datasets:

- security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval
  ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05
  (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer
  defaults of 0.80 would always fail). Sourced in one place pending the MCP-815
  baseline `security.gate` block so gate and baseline never drift.
- retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index
  readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs
  (npx/uvx fetch flake), blocking on the nightly schedule.

Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as
artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned
public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043),
actionlint clean.

Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted).

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle

The retrieval-d1 job failed: `mcpproxy serve` exited immediately with
"data_dir: directory does not exist" (serve refuses to create a missing
data_dir), and the server was backgrounded in a separate step from the readiness
poll (a process backgrounded in one step is reaped when that step's shell exits).

Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE
step (shared shell) with a trap that stops the server however the step ends; also
fail fast if the server process dies during startup. D2 gate unaffected.

Verified: mcpproxy boots and serves /api/v1/status locally with the created
data_dir; actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 readiness probe — parse data.results from index/search envelope

The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7
servers (45 tools), but the probe parsed the index/search response at the top
level while results are nested under the `{"success":true,"data":{"results":[…]}}`
envelope, so it read 0 every attempt and timed out.

Fix: parse `data.results`. Verified locally — index returns 5 results for q=file
within ~6s of boot. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): D1 readiness waits for full tool catalog before scoring

The retrieval scorer was running against a partially-indexed instance: the
readiness probe passed at the first indexed tool (>=1 search result), so scoring
started before all 7 reference servers connected -> Recall@5 measured 0.387 vs
baseline threshold 0.631 (false regression).

Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools
across the 7 servers) and add a short settle for the index build, then score.

Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 =
0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline
is exactly reproducible. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>
Dumbris added a commit that referenced this pull request Jun 1, 2026
* docs(065): Evaluation Foundation spec (D1+D2) — measure security & discovery

First implementation epic of the H2-2026 roadmap. Move from asserting to
measuring: D1 tool-retrieval golden set (Recall@k/MRR/nDCG over a frozen
corpus, the prerequisite + GEPA fitness function) and D2 security regression
corpus (per-detector precision/recall/F1 + false-positive-rate, the quiet-
security metric). Both extend the existing mcp-eval harness; gated in CI.
D3/D4/D5/D6 are follow-on specs that compose on these.

* docs(065): plan + research + data-model + contracts + quickstart for Evaluation Foundation

Phase 0/1 design for D1 (tool-retrieval golden set, Recall@k/MRR/nDCG via REST
/api/v1/index/search) and D2 (security regression corpus, per-detector
precision/recall/F1/FPR via a cmd/scan-eval JSON bridge). Extends the mcp-eval
harness; datasets frozen + versioned; 3 JSON-schema contracts; quickstart.
Constitution PASS. D3/D4/D5/D6 remain follow-on specs.

* feat(065): add cmd/scan-eval D2 detector bridge (B1) (#550)

Bridge the Spec 065 / D2 security corpus to mcpproxy's production
sensitive-data detector and emit per-entry, per-detector verdict JSON
for the Python SecurityScorer (B3). Offline, deterministic test tooling
only — no runtime or REST surface (Security-by-Default, R-03).

- cmd/scan-eval: reads a security-corpus.schema.json-conforming file,
  runs each entry.description through security.NewDetector(nil).Scan,
  echoes ground-truth id/label/category, emits scan-verdict.schema.json.
- Flags: --corpus (required), --out (default stdout),
  --detectors=sensitive-data (default), --scanners (reserved opt-in
  extension point for the deferred Docker bundled-scanner pass).
- Exit codes: 0 ok, 4 bad/missing corpus or flags, 1 write failure.
- contracts/scan-verdict.schema.json: the verdict output contract B3
  consumes to derive per-detector TP/FP/TN/FN -> P/R/F1/FPR.
- Test-first: TP (embedded AWS key), TN, missing/empty corpus, and
  deterministic-output coverage; committed minimal corpus fixture.

The fixture demonstrates honest measurement (INV-3): the detector is a
true positive on a credential-exfil description, a false negative on
pure prompt-injection text, and a visible false positive on a benign
doc referencing ~/.aws/credentials — i.e. it measures real coverage
rather than trivially passing.

Co-authored-by: Paperclip <noreply@paperclip.ing>

* feat(065): security_corpus_v1.json (D2 security regression dataset) (#551)

Add the D2 labeled security corpus the detection scorer measures against,
plus a co-located validator test enforcing the contract and cross-entity
invariants (INV-3, INV-4 / SC-004).

- 43 entries: 20 malicious (tool_poisoning/prompt_injection/shadowing/rug_pull),
  15 clean benign, 8 attack-resembling hard negatives (2 per attack category).
- Every entry carries label + category + provenance.{source,license} (FR-007).
- Sources limited to self-authored + DVMCP (MIT); MCPTox/MCP-AttackBench and the
  unconfirmed-license mcp-injection-experiments are referenced externally only,
  never vendored (CN-005, R-07, R-A). The validator fails the build on any
  non-redistributable license.
- corpus_test.go validates against security-corpus.schema.json (santhosh-tekuri
  jsonschema/v6) and asserts attack coverage + >=1 hard negative per category.

Related #739

Co-authored-by: Paperclip <noreply@paperclip.ing>

* feat(065): D1 retrieval datasets (renamed for clarity) + merged D1/D2 README (#554)

* feat(065): D1 retrieval datasets — frozen corpus, golden set, baseline (A2)

Generate and commit the Spec 065 D1 retrieval evaluation artifacts via A1's
mcp-eval datasets/retrieval CLI (cb37f84):

- corpus_v1.json: frozen 45-tool snapshot over 7 no-auth reference MCP servers
  (filesystem, git, memory, sqlite, fetch, time, sequential-thinking), via
  GET /api/v1/tools. Immutable (CN-002); refresh = corpus_v2 (FR-012).
- corpus_v1.source.json: secret-free, reproducible mcpproxy source config.
- retrieval_golden_v1.json: 47 graded queries (relevance 0|1|2), 11 cross-server
  hard-negatives (FR-001), R-C compliant (queries never name the tool). Passes
  schema + INV-1 validation.
- baseline_v1.json: reference Recall@k/MRR/nDCG@10/MAP + Recall@5 tolerance 0.05,
  the CI regression anchor (FR-009). Retrieval metrics top-level (scorer reads
  them directly); empty security section reserved for D2 (CN-004).
- README.md: documented, repeatable regeneration procedure (FR-012).

Verified end-to-end against a live BM25 index: validate OK, gate PASS
(Recall@5=0.681 >= baseline-0.05). Score reports stay local (CN-003).

Related #MCP-740
Co-Authored-By: Paperclip <noreply@paperclip.ing>

* fix(065): rename D1 dataset files for clarity + merge D1/D2 README

Addresses pre-merge review on #552: corpus_v1.source.json (valid config) vs
corpus_v1.json (scored snapshot, NOT a config) invited 'serve --config
corpus_v1.json' which fails. Renamed:
- corpus_v1.source.json -> snapshot-servers.config.json (the servable config)
- corpus_v1.json        -> corpus_v1.tools.json        (the scored snapshot)
Updated baseline_v1.json source_config + corpus note refs; merged the D1 and D2
dataset READMEs into one with an explicit servable-vs-dataset cheat sheet.

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>

* ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742) (#561)

* ci(065): eval.yml regression gate — D2 security (blocking) + D1 retrieval (MCP-742)

Spec 065 / C1 (FR-009, US3/P2). Add `.github/workflows/eval.yml` running both
Spec-065 evaluations as a regression gate over the frozen datasets:

- security-d2 (blocking): provenance/license guard (FR-007/CN-005) → cmd/scan-eval
  ×3 → mcp-eval SecurityScorer. Thresholds --fpr-ceiling 0.10 --recall-floor 0.05
  (the sensitive-data detector measures recall ≈0.10 on this corpus; scorer
  defaults of 0.80 would always fail). Sourced in one place pending the MCP-815
  baseline `security.gate` block so gate and baseline never drift.
- retrieval-d1: boots mcpproxy over snapshot-servers.config.json, waits for index
  readiness, runs the RetrievalScorer with baseline+tolerance. Report-only on PRs
  (npx/uvx fetch flake), blocking on the nightly schedule.

Shared D2 logic in scripts/eval-ci-smoke.sh (CI == local). Reports upload as
artifacts, never committed (CN-003, guarded). mcp-eval checked out at a pinned
public ref. Verified locally: full D2 gate PASS (P=0.667 R=0.100 FPR=0.043),
actionlint clean.

Related #555 datasets; implements MCP-742 (Gate-2 plan rev 2 accepted).

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 retrieval job — create data_dir + single-step server lifecycle

The retrieval-d1 job failed: `mcpproxy serve` exited immediately with
"data_dir: directory does not exist" (serve refuses to create a missing
data_dir), and the server was backgrounded in a separate step from the readiness
poll (a process backgrounded in one step is reaped when that step's shell exits).

Fix: mkdir -p the data_dir, and boot + readiness-poll + run the scorer in ONE
step (shared shell) with a trap that stops the server however the step ends; also
fail fast if the server process dies during startup. D2 gate unaffected.

Verified: mcpproxy boots and serves /api/v1/status locally with the created
data_dir; actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): fix D1 readiness probe — parse data.results from index/search envelope

The retrieval-d1 readiness poll never passed: mcpproxy booted and indexed all 7
servers (45 tools), but the probe parsed the index/search response at the top
level while results are nested under the `{"success":true,"data":{"results":[…]}}`
envelope, so it read 0 every attempt and timed out.

Fix: parse `data.results`. Verified locally — index returns 5 results for q=file
within ~6s of boot. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

* ci(065): D1 readiness waits for full tool catalog before scoring

The retrieval scorer was running against a partially-indexed instance: the
readiness probe passed at the first indexed tool (>=1 search result), so scoring
started before all 7 reference servers connected -> Recall@5 measured 0.387 vs
baseline threshold 0.631 (false regression).

Fix: poll /api/v1/tools until the catalog reaches the near-full count (~45 tools
across the 7 servers) and add a short settle for the index build, then score.

Verified locally end-to-end on a fully-indexed instance: Recall@1/3/5/10 =
0.418/0.560/0.681/0.791, Gate(recall_at_5) PASS (0.681 vs 0.631) — the baseline
is exactly reproducible. actionlint clean.

Related #555 datasets; MCP-742.

Co-Authored-By: Paperclip <noreply@paperclip.ing>

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>

* ci(065): trigger eval gate on D1 retrieval system-under-test paths (#563)

The eval.yml pull_request.paths filter only matched the D2 security
surface (internal/security, cmd/scan-eval) and the harness/datasets, so
changes to the D1 retrieval system it gates — BM25 index, MCP tool-
discovery routing, the REST search envelope, server/CLI boot — never
triggered the workflow. Spec 065 requires CI to catch discovery
regressions when search/index/tool-discovery behavior changes.

Add internal/index/**, internal/server/**, internal/httpapi/**, and
cmd/mcpproxy/** to the trigger paths. Job logic is unchanged, so D2
security and D1 retrieval stay green.

Related MCP-833

---------

Co-authored-by: Paperclip <noreply@paperclip.ing>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants