feat(diagnostics): stable error-code catalog with REST + CLI surfacing#400
Merged
Conversation
📦 Build ArtifactsWorkflow Run: View Run Available Artifacts
How to DownloadOption 1: GitHub Web UI (easiest)
Option 2: GitHub CLI gh run download 24901706290 --repo smart-mcp-proxy/mcpproxy-go
|
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Adds the speckit specify/clarify/plan/tasks artifacts for spec 044 (diagnostics & error taxonomy). No production code changes. Related to design doc docs/superpowers/specs/2026-04-24-diagnostics-error-taxonomy-design.md. - specs/044-diagnostics-taxonomy/spec.md — user stories, functional reqs, success criteria - specs/044-diagnostics-taxonomy/plan.md — implementation plan - specs/044-diagnostics-taxonomy/research.md — phase 0 decisions - specs/044-diagnostics-taxonomy/data-model.md — catalog/error/fix types - specs/044-diagnostics-taxonomy/contracts/diagnostics-openapi.yaml — REST contract - specs/044-diagnostics-taxonomy/contracts/catalog-schema.json — JSON schema - specs/044-diagnostics-taxonomy/quickstart.md — developer onboarding - specs/044-diagnostics-taxonomy/tasks.md — dependency-ordered task list - specs/044-diagnostics-taxonomy/checklists/requirements.md — spec QA checklist - CLAUDE.md — active-technologies entry
Adds the new internal/diagnostics package with: - types.go — Code, Severity, FixStep, CatalogEntry, DiagnosticError, FixRequest/Result - codes.go — 25 stable codes across 8 domains (STDIO, OAUTH, HTTP, DOCKER, CONFIG, QUARANTINE, NETWORK, UNKNOWN) - registry.go — in-memory registry seeded in init() - catalog.go — Get/All/Has lookup API with stable ordering - classifier.go — Classify() with typed-error inspection (errors.Is/As) and DiagnoseHTTPStatus() helper - fixers.go — fixer registry keyed by FixStep.FixerKey - catalog_test.go — completeness tests (FR-003) - classifier_test.go — golden samples for STDIO/HTTP/NETWORK classifiers All 25 codes pass completeness tests with race detection. Full module still builds. Related to spec 044.
…point
Spec 044 Phase C — catalog plumbing + surfacing.
## Changes
- internal/diagnostics/builtin_fixers.go — safe default fixers for
stdio_show_last_logs, config_migrate_deprecated, server_disable_scanner,
oauth_reauth (dry-run preview; execute-mode blocked until higher layer
wires real implementation)
- internal/runtime/stateview/stateview.go — adds Diagnostic *DiagnosticError
to ServerStatus with deep-clone support
- internal/runtime/supervisor/supervisor.go — classifies ConnectionInfo.LastError
into a stable code at the two existing LastError assignment sites; clears
Diagnostic on successful reconnect
- internal/runtime/runtime.go — serializes the Diagnostic onto the server
map so it flows through /api/v1/servers and the new per-server endpoint
- internal/httpapi/diagnostics_per_server.go — new GET /api/v1/servers/{id}/diagnostics
returning server health + structured diagnostic when failing
- internal/httpapi/diagnostics_fix.go — new POST /api/v1/diagnostics/fix
with in-memory 1/s per-(server,code) rate limiter (429 + Retry-After),
destructive dry-run default, 409 when destructive+execute is sent
without explicit mode, 15s invocation timeout
- internal/httpapi/server.go — registers both routes inside /api/v1
- docs/errors/*.md — 29 stub pages (one per registered code) via
scripts/gen-errors-docs.sh
- scripts/check-errors-docs-links.sh — CI-friendly bidirectional link check
- scripts/gen-errors-docs.sh — regenerates missing docs stubs
- scripts/test-diagnostics-e2e.sh — smoke runner
## Testing
- go test -race ./internal/diagnostics/... — PASS (25 codes + fixer tests)
- go test -race ./internal/runtime/stateview/... ./internal/runtime/supervisor/... — PASS
- go test ./internal/httpapi/... — PASS
- scripts/check-errors-docs-links.sh — OK (29 codes, all linked)
- scripts/test-diagnostics-e2e.sh — OK
Related to spec 044.
Adds a new Cobra subcommand that prints the full diagnostics catalog in pretty or JSON format. Primarily useful for AI agents consuming the stable taxonomy and for auto-generating release notes / docs indexes. ## Changes - cmd/mcpproxy/doctor_listcodes_cmd.go — new subcommand registered under 'mcpproxy doctor list-codes' with -o pretty|json ## Testing - go build ./cmd/mcpproxy — OK - /tmp/mcpproxy doctor list-codes — prints 29 codes - /tmp/mcpproxy doctor list-codes -o json | jq '.[] | select(.severity == "error")' — works Related to spec 044.
Wires the web UI surface for spec 044 (diagnostics & error taxonomy): - New ErrorPanel.vue component renders the structured diagnostic payload on the ServerDetail page whenever a warn/error severity diagnostic is attached. Severity badge, stable code, user_message, cause, and ordered fix_steps (link/command/button) are all shown. Destructive button fixes require a second "Execute" click and a browser confirm(); non-destructive ones run straight as dry_run. - Toasts report preview/failure text from POST /diagnostics/fix. - Typed API surface: Server gains error_code + diagnostic fields; contracts.Server grows matching Go fields; management.ListServers, the legacy converter, and server.GetAllServers all populate them from the stateview snapshot so every REST shape is consistent. - Per-server handler now JSON-normalizes the typed diagnostics.Code / diagnostics.Severity values so clients get plain strings. - Classifier gains a string-matching fallback for stdio spawn failures wrapped by intermediate layers (ENOENT/EACCES/handshake timeout) — the runtime currently string-wraps these, so typed-error matching alone was returning MCPX_UNKNOWN_UNCLASSIFIED for common failures. - classifyAndAttach now falls back to UnknownUnclassified instead of silently nil-ing the diagnostic when classification returns "". - Vitest spec covers code rendering, severity -> colour, dry-run fix dispatch, and null-diagnostic no-op. Verified end-to-end via claude-in-chrome: broken stdio server surfaces ErrorPanel with code MCPX_STDIO_SPAWN_ENOENT, copyable PATH probe command, docs link, and Preview (dry-run) button that POSTs to /api/v1/diagnostics/fix and fires a success toast. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces spec 044 diagnostics in the native macOS tray so users can see classified server failures from the menu bar without opening the web UI. - ServerStatus model gains `error_code` and a typed `DiagnosticPayload` (code, severity, cause, user_message, fix_steps, docs_url) mirroring the REST schema. Decodable from the additive /api/v1/servers payload. - AppState.worstDiagnosticSeverity aggregates the highest severity across enabled servers; AppState.serversWithDiagnostic lists the attention-worthy ones (warn/error only). - TrayIcon overlays a small coloured dot on the menu bar symbol: red for any `error` severity, orange for `warn`. No badge when every enabled server is clean — the plain monochrome icon is preserved for the majority case. - TrayMenu gains an "⚠ Fix issues (N)" group above the existing "Needs Attention" section. Each entry shows the server name, stable code (e.g. MCPX_STDIO_SPAWN_ENOENT), and a severity icon; clicking opens the server detail page where the web UI ErrorPanel renders the full fix_steps list. Build verified via swiftc; end-to-end tray visual verification with mcpproxy-ui-test was deferred — the dev bundle's auto-started core clashes with the prod mcpproxy daemon's BBolt lock on this workstation. The data contract is covered by the web UI ErrorPanel tests and the management-service round-trip tests from the previous commit, and the Swift code paths are straight conditional rendering keyed off the same fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the CLI surface for running registered diagnostics fixers and
scoping the regular doctor health check to a single server:
- New subcommand `mcpproxy doctor fix <CODE> --server <name>` resolves
the fixer_key from the diagnostics catalog automatically (picks the
first Button-type step; --fixer-key overrides when a code defines
several) and POSTs to /api/v1/diagnostics/fix via the existing
socket-backed cliclient. Defaults to dry_run; --execute applies
mutating changes.
- New `-o json` output for scripting / CI use; pretty output shows
outcome, preview, and whether the run was a dry_run.
- `mcpproxy doctor --server <name>` filters the standard health check
payload to a single server by trimming per-server arrays
(upstream_errors, oauth_required, oauth_issues, runtime_warnings,
missing_secrets.used_by, quarantine stats) and recomputing
total_issues so the pretty-print summary stays accurate.
- cliclient.InvokeDiagnosticFix wraps the fix endpoint and unwraps the
standard {success, data, request_id} envelope.
- doctor fix respects the root --data-dir flag when locating the
daemon socket so tests/staging can target non-default daemon
installs (the legacy doctor command previously ignored it).
- Unit tests cover catalog fixer_key resolution (first-button +
exact-match + missing), diagnostic-by-server filtering (including
missing_secrets.used_by), and quarantine stats filtering.
- docs/cli-management-commands.md documents both the new subcommand
and the --server flag.
Verified end-to-end against a local daemon with a deliberately-broken
stdio server: dry-run shows the stock fixer preview with outcome
"success"; -o json emits the structured envelope; --server trims the
main doctor output to just the targeted server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…osticError
Completes Phase D of spec 044 by giving producers in the OAUTH, DOCKER,
CONFIG, and QUARANTINE domains a way to attach a stable diagnostics
code to their terminal errors without a deep refactor, plus matching
free-text classifier fallbacks so legacy / vendor errors still land on
the right code.
- New `WrapError(code, err)` helper returns an error that satisfies
`interface{ Code() diagnostics.Code }`. Classify() inspects this
via errors.As as its top-priority fast path, so a call-site that
opts in bypasses every string-match fallback.
- Convenience shortcuts (WrapOAuthRefreshExpired, WrapDockerDaemonDown,
WrapConfigMissingSecret, WrapQuarantineToolChanged, etc.) live
alongside WrapError so callers don't have to import individual
code constants. Mirror wrappers are re-exported from internal/oauth
(WrapRefreshExpired / WrapRefresh403 / WrapDiscoveryFailed) for the
OAuth sentinel-error pattern already used there.
- Classifier gains string-match fallbacks for the four domains that
previously fell through to MCPX_UNKNOWN_UNCLASSIFIED:
OAUTH: refresh expired / 403 / discovery failed / callback timeout
/ redirect_uri mismatch
DOCKER: daemon down / snap-apparmor / no permission / image pull failed
CONFIG: deprecated field / parse error / missing secret
QUARANTINE: pending approval / tool changed (rug pull)
- runtime.RefreshOAuthToken now attributes terminal refresh outcomes
using the new wrappers, matching the "expired" / "403" /
"invalid_grant" / "no refresh token" patterns emitted by the
upstream manager + mcp-go. This is the first real producer-side
wrapping; more sites can opt in incrementally.
- 13 new golden-sample classifier tests cover every domain + the
typed-wrap fast path (including the nil-passthrough contract).
Kept the existing string-fallback matchers deliberately broad so that
error rewording in upstream/mcp-go doesn't regress classification; as
more producers opt into WrapError explicitly the fallbacks stop
running in practice.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated oas/swagger.yaml + oas/docs.go to reflect the new contracts.Diagnostic + contracts.DiagnosticFixStep + Server.diagnostic + Server.error_code fields introduced in the earlier spec 044 commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…S, tray SKIPPED) Phase 1 (CLI) PASS: all 29 diagnostic codes registered, doctor --server reports correctly, doctor fix dry-run returns preview, --execute is rate-limited at 429 as the documented safety guard, classifier maps zsh:1:no-such-file errors to MCPX_STDIO_SPAWN_ENOENT. Phase 2 (Web UI) PASS: ErrorPanel renders full diagnostic payload (severity band, error code, user message, cause snippet, three fix-step rows with Copy + docs link + Preview dry-run button). Preview click fires POST /api/v1/diagnostics/fix -> 200. Phase 3 (macOS tray) SKIPPED: production tray running, cycling it mid- session was too risky. Needs post-merge visual confirmation. Three non-blocking findings captured: 1. doctor CLI has no HTTP fallback when enable_socket: false 2. Web UI Preview-button success is silent (no toast/inline render) 3. Fix endpoint body uses "mode" string, not boolean "dry_run" Production state untouched throughout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second-opinion critique from Gemini CLI (v0.38.1, model gemini-3.1-pro-preview) run in --yolo mode with read access to all 4 related repos (mcpproxy-go-diagnostics-taxonomy, mcpproxy-go-retention-telemetry, mcpproxy-telemetry, mcpproxy-dash). Findings: 4 P1, 4 P2, 2 P3. Notable P1s to triage: 1. ErrorPanel.vue hides Execute button on non-destructive steps — users can only click "Preview (dry-run)", making non-destructive fixes unreachable. 2. env_kind decision tree evaluates HasCIEnv before HasCloudIDEEnv, so GitHub Codespaces / Gitpod (which set CI=true) are misclassified as CI bots. 3. Activation funnel query reads server_configured from LATEST heartbeat — users who configured then deleted a server fall out of the lifetime funnel. 4. autostart.go caches a missing sidecar file as nil for a full hour, poisoning the critical first heartbeat on slow-tray startup races. Content is verbatim Gemini output; no human edits except a header block documenting invocation + model. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emini P1) Non-destructive fix steps of type=button previously rendered only a "Preview (dry-run)" button; there was no way to actually execute them from the UI. Restructure so: - Non-destructive: single primary Execute button, mode=execute. - Destructive: Preview (dry-run) + Execute (gated by window.confirm()). Updates vitest cases to cover both paths including the confirm() reject and accept branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d code) The 409 guard at diagnostics_fix.go was unreachable: mode was defaulted to dry_run before the check, so `mode == ModeExecute && body.Mode != ModeExecute` could never fire. Destructive fixers silently downgraded to dry_run when the client forgot the mode field, masking implementation errors. Move the check ahead of the default: - destructive + body.Mode == "" -> 409 (force explicit intent) - non-destructive + body.Mode == "" -> defaults to execute - explicit dry_run/execute -> proceed as before Adds diagnostics_fix_test.go with 4 subtests covering each path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
108aee3 to
6ea9ac7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stable
MCPX_*error-code catalog for mcpproxy with end-to-end surfacing: REST API, Vue web UI, macOS tray, CLI. 29 codes across 8 domains (STDIO, OAUTH, HTTP, DOCKER, CONFIG, QUARANTINE, NETWORK, UNKNOWN). Every code has a user-facing message, ≥1 fix step, and a docs page — completeness enforced in CI.Design:
docs/superpowers/specs/2026-04-24-diagnostics-error-taxonomy-design.mdSpec:
specs/044-diagnostics-taxonomy/Scope delivered vs design
DiagnosticErroron stateview + REST endpoints + fix endpoint rate-limitErrorPanel.vue4d92c82)6aa8305)doctor list-codes+doctor fix <CODE>+doctor --server4f6872c+7b81e03)diagnosticsobjectWhat ships
Backend (
internal/diagnostics/)MCPX_*codes; completeness test enforcesuser_message, ≥1fix_step,docs_url, severityWrapError(code, err)) + free-text fallbacks for every domainruntime.RefreshOAuthTokenwraps OAuth failuresREST
GET /api/v1/servers/{id}/diagnostics— per-server structured error + fix stepsPOST /api/v1/diagnostics/fix— 1/s per-(server,code) rate limit, dry-run default, 409 on destructive+execute without explicit mode/api/v1/serversresponse viacontracts.Server.diagnosticfieldWeb UI (
frontend/src/components/diagnostics/ErrorPanel.vue)link→ anchor,command→ copyable inline code,button→POST /diagnostics/fix(Preview =dry_run=true; Execute = click-to-confirm)ServerDetail.vuewheneverserver.diagnostic?.codeis setmacOS tray
error, orange if onlywarn)Sendablewarnings onUpdateService.swift)CLI
mcpproxy doctor list-codes— pretty +-o jsonoutput; 29 codesmcpproxy doctor fix <CODE> --server <name>— auto-resolvesfixer_keyfrom catalog;--executerequired for destructive; dry-run defaultmcpproxy doctor --server <name>— scope health check to one serverDocs
docs/errors/<CODE>.mdstubsscripts/check-errors-docs-links.sh— bidirectional link checker (CI-safe)Ground rules observed
--dry-run; 409 without--executeTwo bugs caught mid-implementation
contracts.Servertyped struct was missingdiagnostic/error_code, so/api/v1/serverssilently dropped them despiteruntime.GetAllServerspopulating them. Fixed inmanagement.ListServers+contracts.ConvertGenericServersToTyped+ a third map-based path ininternal/server/server.go:GetAllServers.diagnostics.Code/diagnostics.Severity(named strings) serialized correctly but failedstringtype assertions in the per-server handler. Fixed with a JSON round-trip normalization.Verification
go build ./...— PASSgo build -tags server ./cmd/mcpproxy— PASSgo test -race ./internal/diagnostics/...— PASSgo test -race ./internal/runtime/stateview/— PASSgo test -race ./internal/runtime/supervisor/— PASSgo test -race ./internal/httpapi/— PASSscripts/check-errors-docs-links.sh— OK (29 codes linked)scripts/test-diagnostics-e2e.sh— PASSKnown non-blockers
go test -race ./internal/server/times out at 300s — pre-existing, reproduced on base commit4f6872c, not caused by this PR.mcpproxy-ui-test) was skipped — the dev tray's embedded core collides with the running production tray on the BBolt lock. Data contracts covered by backend tests + Vue vitest.Follow-ups (separate PRs)
diagnosticssub-object (counts + fix outcomes) once PR feat(telemetry): schema v3 extension — env_kind, activation funnel, autostart (spec 044) #401 merges🤖 Generated with Claude Code