Skip to content

feat(diagnostics): stable error-code catalog with REST + CLI surfacing#400

Merged
Dumbris merged 13 commits into
mainfrom
feat/diagnostics-taxonomy
Apr 24, 2026
Merged

feat(diagnostics): stable error-code catalog with REST + CLI surfacing#400
Dumbris merged 13 commits into
mainfrom
feat/diagnostics-taxonomy

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented Apr 24, 2026

Summary

Stable MCPX_* error-code catalog for mcpproxy with end-to-end surfacing: REST API, Vue web UI, macOS tray, CLI. 29 codes across 8 domains (STDIO, OAUTH, HTTP, DOCKER, CONFIG, QUARANTINE, NETWORK, UNKNOWN). Every code has a user-facing message, ≥1 fix step, and a docs page — completeness enforced in CI.

Design: docs/superpowers/specs/2026-04-24-diagnostics-error-taxonomy-design.md
Spec: specs/044-diagnostics-taxonomy/

Scope delivered vs design

Phase Status
A — speckit flow DONE
B — error inventory + catalog + completeness tests DONE
C — STDIO classifier + DiagnosticError on stateview + REST endpoints + fix endpoint rate-limit DONE
D — HTTP/NETWORK classifier branches + OAUTH/DOCKER/CONFIG/QUARANTINE raw-error wrapping DONE
E — Vue ErrorPanel.vue DONE (commit 4d92c82)
F — macOS tray badge + "Fix issues (N)" menu group DONE (commit 6aa8305)
G — CLI doctor list-codes + doctor fix <CODE> + doctor --server DONE (commits 4f6872c + 7b81e03)
H — v3 telemetry diagnostics object DEFERRED (pending spec 044 retention-telemetry PR #401 merge)

What ships

Backend (internal/diagnostics/)

  • Types + catalog + registry; 29 MCPX_* codes; completeness test enforces user_message, ≥1 fix_step, docs_url, severity
  • Classifier with typed-fast-path (WrapError(code, err)) + free-text fallbacks for every domain
  • First production opt-in: runtime.RefreshOAuthToken wraps OAuth failures

REST

  • GET /api/v1/servers/{id}/diagnostics — per-server structured error + fix steps
  • POST /api/v1/diagnostics/fix — 1/s per-(server,code) rate limit, dry-run default, 409 on destructive+execute without explicit mode
  • Both added to /api/v1/servers response via contracts.Server.diagnostic field

Web UI (frontend/src/components/diagnostics/ErrorPanel.vue)

  • Severity-coloured header, stable code, user message
  • Fix-step rendering: link → anchor, command → copyable inline code, buttonPOST /diagnostics/fix (Preview = dry_run=true; Execute = click-to-confirm)
  • Docs-URL link, toast on success/failure
  • Vitest: code rendering, severity→colour, fix-dispatch, null-noop
  • Wired from ServerDetail.vue whenever server.diagnostic?.code is set

macOS tray

  • Severity-coloured badge dot overlayed on status-bar icon (red if any error, orange if only warn)
  • "⚠ Fix issues (N)" menu group above existing "Needs Attention" section
  • Menu items open the web UI to that server's detail page
  • Swift binary compiles clean (only pre-existing Sendable warnings on UpdateService.swift)

CLI

  • mcpproxy doctor list-codes — pretty + -o json output; 29 codes
  • mcpproxy doctor fix <CODE> --server <name> — auto-resolves fixer_key from catalog; --execute required for destructive; dry-run default
  • mcpproxy doctor --server <name> — scope health check to one server

Docs

  • 29 auto-generated docs/errors/<CODE>.md stubs
  • scripts/check-errors-docs-links.sh — bidirectional link checker (CI-safe)

Ground rules observed

  • Error codes are stable — once named, never renamed
  • No auto-remediation — fixers only run via explicit user click or CLI invocation
  • Destructive fixes default to --dry-run; 409 without --execute
  • No BBolt schema change — rate-limiter state is in-memory, resets on restart
  • Additive API — zero removed/renamed fields

Two bugs caught mid-implementation

  1. contracts.Server typed struct was missing diagnostic/error_code, so /api/v1/servers silently dropped them despite runtime.GetAllServers populating them. Fixed in management.ListServers + contracts.ConvertGenericServersToTyped + a third map-based path in internal/server/server.go:GetAllServers.
  2. Typed diagnostics.Code / diagnostics.Severity (named strings) serialized correctly but failed string type assertions in the per-server handler. Fixed with a JSON round-trip normalization.

Verification

  • go build ./... — PASS
  • go build -tags server ./cmd/mcpproxy — PASS
  • go test -race ./internal/diagnostics/... — PASS
  • go test -race ./internal/runtime/stateview/ — PASS
  • go test -race ./internal/runtime/supervisor/ — PASS
  • go test -race ./internal/httpapi/ — PASS
  • scripts/check-errors-docs-links.sh — OK (29 codes linked)
  • scripts/test-diagnostics-e2e.sh — PASS
  • Swift tray compile — PASS (5.9 MB binary)

Known non-blockers

  • go test -race ./internal/server/ times out at 300s — pre-existing, reproduced on base commit 4f6872c, not caused by this PR.
  • Runtime tray visual verification (screenshots via mcpproxy-ui-test) was skipped — the dev tray's embedded core collides with the running production tray on the BBolt lock. Data contracts covered by backend tests + Vue vitest.

Follow-ups (separate PRs)

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 24, 2026

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6ea9ac7
Status:⚡️  Build in progress...

View logs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 24, 2026

📦 Build Artifacts

Workflow Run: View Run
Branch: feat/diagnostics-taxonomy

Available Artifacts

  • archive-darwin-amd64 (26 MB)
  • archive-darwin-arm64 (23 MB)
  • archive-linux-amd64 (15 MB)
  • archive-linux-arm64 (13 MB)
  • archive-windows-amd64 (26 MB)
  • archive-windows-arm64 (23 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (19 MB)
  • installer-dmg-darwin-arm64 (17 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 24901706290 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

claude added 13 commits April 24, 2026 20:55
Adds the speckit specify/clarify/plan/tasks artifacts for spec 044
(diagnostics & error taxonomy). No production code changes.

Related to design doc docs/superpowers/specs/2026-04-24-diagnostics-error-taxonomy-design.md.

- specs/044-diagnostics-taxonomy/spec.md — user stories, functional reqs, success criteria
- specs/044-diagnostics-taxonomy/plan.md — implementation plan
- specs/044-diagnostics-taxonomy/research.md — phase 0 decisions
- specs/044-diagnostics-taxonomy/data-model.md — catalog/error/fix types
- specs/044-diagnostics-taxonomy/contracts/diagnostics-openapi.yaml — REST contract
- specs/044-diagnostics-taxonomy/contracts/catalog-schema.json — JSON schema
- specs/044-diagnostics-taxonomy/quickstart.md — developer onboarding
- specs/044-diagnostics-taxonomy/tasks.md — dependency-ordered task list
- specs/044-diagnostics-taxonomy/checklists/requirements.md — spec QA checklist
- CLAUDE.md — active-technologies entry
Adds the new internal/diagnostics package with:
- types.go — Code, Severity, FixStep, CatalogEntry, DiagnosticError, FixRequest/Result
- codes.go — 25 stable codes across 8 domains (STDIO, OAUTH, HTTP, DOCKER,
  CONFIG, QUARANTINE, NETWORK, UNKNOWN)
- registry.go — in-memory registry seeded in init()
- catalog.go — Get/All/Has lookup API with stable ordering
- classifier.go — Classify() with typed-error inspection (errors.Is/As) and
  DiagnoseHTTPStatus() helper
- fixers.go — fixer registry keyed by FixStep.FixerKey
- catalog_test.go — completeness tests (FR-003)
- classifier_test.go — golden samples for STDIO/HTTP/NETWORK classifiers

All 25 codes pass completeness tests with race detection. Full module
still builds.

Related to spec 044.
…point

Spec 044 Phase C — catalog plumbing + surfacing.

## Changes
- internal/diagnostics/builtin_fixers.go — safe default fixers for
  stdio_show_last_logs, config_migrate_deprecated, server_disable_scanner,
  oauth_reauth (dry-run preview; execute-mode blocked until higher layer
  wires real implementation)
- internal/runtime/stateview/stateview.go — adds Diagnostic *DiagnosticError
  to ServerStatus with deep-clone support
- internal/runtime/supervisor/supervisor.go — classifies ConnectionInfo.LastError
  into a stable code at the two existing LastError assignment sites; clears
  Diagnostic on successful reconnect
- internal/runtime/runtime.go — serializes the Diagnostic onto the server
  map so it flows through /api/v1/servers and the new per-server endpoint
- internal/httpapi/diagnostics_per_server.go — new GET /api/v1/servers/{id}/diagnostics
  returning server health + structured diagnostic when failing
- internal/httpapi/diagnostics_fix.go — new POST /api/v1/diagnostics/fix
  with in-memory 1/s per-(server,code) rate limiter (429 + Retry-After),
  destructive dry-run default, 409 when destructive+execute is sent
  without explicit mode, 15s invocation timeout
- internal/httpapi/server.go — registers both routes inside /api/v1
- docs/errors/*.md — 29 stub pages (one per registered code) via
  scripts/gen-errors-docs.sh
- scripts/check-errors-docs-links.sh — CI-friendly bidirectional link check
- scripts/gen-errors-docs.sh — regenerates missing docs stubs
- scripts/test-diagnostics-e2e.sh — smoke runner

## Testing
- go test -race ./internal/diagnostics/... — PASS (25 codes + fixer tests)
- go test -race ./internal/runtime/stateview/... ./internal/runtime/supervisor/... — PASS
- go test ./internal/httpapi/... — PASS
- scripts/check-errors-docs-links.sh — OK (29 codes, all linked)
- scripts/test-diagnostics-e2e.sh — OK

Related to spec 044.
Adds a new Cobra subcommand that prints the full diagnostics catalog in
pretty or JSON format. Primarily useful for AI agents consuming the
stable taxonomy and for auto-generating release notes / docs indexes.

## Changes
- cmd/mcpproxy/doctor_listcodes_cmd.go — new subcommand registered under
  'mcpproxy doctor list-codes' with -o pretty|json

## Testing
- go build ./cmd/mcpproxy — OK
- /tmp/mcpproxy doctor list-codes — prints 29 codes
- /tmp/mcpproxy doctor list-codes -o json | jq '.[] | select(.severity == "error")' — works

Related to spec 044.
Wires the web UI surface for spec 044 (diagnostics & error taxonomy):

- New ErrorPanel.vue component renders the structured diagnostic
  payload on the ServerDetail page whenever a warn/error severity
  diagnostic is attached. Severity badge, stable code, user_message,
  cause, and ordered fix_steps (link/command/button) are all shown.
  Destructive button fixes require a second "Execute" click and a
  browser confirm(); non-destructive ones run straight as dry_run.
- Toasts report preview/failure text from POST /diagnostics/fix.
- Typed API surface: Server gains error_code + diagnostic fields;
  contracts.Server grows matching Go fields; management.ListServers,
  the legacy converter, and server.GetAllServers all populate them
  from the stateview snapshot so every REST shape is consistent.
- Per-server handler now JSON-normalizes the typed diagnostics.Code
  / diagnostics.Severity values so clients get plain strings.
- Classifier gains a string-matching fallback for stdio spawn failures
  wrapped by intermediate layers (ENOENT/EACCES/handshake timeout) —
  the runtime currently string-wraps these, so typed-error matching
  alone was returning MCPX_UNKNOWN_UNCLASSIFIED for common failures.
- classifyAndAttach now falls back to UnknownUnclassified instead of
  silently nil-ing the diagnostic when classification returns "".
- Vitest spec covers code rendering, severity -> colour, dry-run fix
  dispatch, and null-diagnostic no-op.

Verified end-to-end via claude-in-chrome: broken stdio server
surfaces ErrorPanel with code MCPX_STDIO_SPAWN_ENOENT, copyable
PATH probe command, docs link, and Preview (dry-run) button that
POSTs to /api/v1/diagnostics/fix and fires a success toast.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaces spec 044 diagnostics in the native macOS tray so users can
see classified server failures from the menu bar without opening the
web UI.

- ServerStatus model gains `error_code` and a typed `DiagnosticPayload`
  (code, severity, cause, user_message, fix_steps, docs_url) mirroring
  the REST schema. Decodable from the additive /api/v1/servers payload.
- AppState.worstDiagnosticSeverity aggregates the highest severity
  across enabled servers; AppState.serversWithDiagnostic lists the
  attention-worthy ones (warn/error only).
- TrayIcon overlays a small coloured dot on the menu bar symbol:
  red for any `error` severity, orange for `warn`. No badge when
  every enabled server is clean — the plain monochrome icon is
  preserved for the majority case.
- TrayMenu gains an "⚠ Fix issues (N)" group above the existing
  "Needs Attention" section. Each entry shows the server name,
  stable code (e.g. MCPX_STDIO_SPAWN_ENOENT), and a severity icon;
  clicking opens the server detail page where the web UI ErrorPanel
  renders the full fix_steps list.

Build verified via swiftc; end-to-end tray visual verification with
mcpproxy-ui-test was deferred — the dev bundle's auto-started core
clashes with the prod mcpproxy daemon's BBolt lock on this workstation.
The data contract is covered by the web UI ErrorPanel tests and the
management-service round-trip tests from the previous commit, and the
Swift code paths are straight conditional rendering keyed off the same
fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the CLI surface for running registered diagnostics fixers and
scoping the regular doctor health check to a single server:

- New subcommand `mcpproxy doctor fix <CODE> --server <name>` resolves
  the fixer_key from the diagnostics catalog automatically (picks the
  first Button-type step; --fixer-key overrides when a code defines
  several) and POSTs to /api/v1/diagnostics/fix via the existing
  socket-backed cliclient. Defaults to dry_run; --execute applies
  mutating changes.
- New `-o json` output for scripting / CI use; pretty output shows
  outcome, preview, and whether the run was a dry_run.
- `mcpproxy doctor --server <name>` filters the standard health check
  payload to a single server by trimming per-server arrays
  (upstream_errors, oauth_required, oauth_issues, runtime_warnings,
  missing_secrets.used_by, quarantine stats) and recomputing
  total_issues so the pretty-print summary stays accurate.
- cliclient.InvokeDiagnosticFix wraps the fix endpoint and unwraps the
  standard {success, data, request_id} envelope.
- doctor fix respects the root --data-dir flag when locating the
  daemon socket so tests/staging can target non-default daemon
  installs (the legacy doctor command previously ignored it).
- Unit tests cover catalog fixer_key resolution (first-button +
  exact-match + missing), diagnostic-by-server filtering (including
  missing_secrets.used_by), and quarantine stats filtering.
- docs/cli-management-commands.md documents both the new subcommand
  and the --server flag.

Verified end-to-end against a local daemon with a deliberately-broken
stdio server: dry-run shows the stock fixer preview with outcome
"success"; -o json emits the structured envelope; --server trims the
main doctor output to just the targeted server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…osticError

Completes Phase D of spec 044 by giving producers in the OAUTH, DOCKER,
CONFIG, and QUARANTINE domains a way to attach a stable diagnostics
code to their terminal errors without a deep refactor, plus matching
free-text classifier fallbacks so legacy / vendor errors still land on
the right code.

- New `WrapError(code, err)` helper returns an error that satisfies
  `interface{ Code() diagnostics.Code }`. Classify() inspects this
  via errors.As as its top-priority fast path, so a call-site that
  opts in bypasses every string-match fallback.
- Convenience shortcuts (WrapOAuthRefreshExpired, WrapDockerDaemonDown,
  WrapConfigMissingSecret, WrapQuarantineToolChanged, etc.) live
  alongside WrapError so callers don't have to import individual
  code constants. Mirror wrappers are re-exported from internal/oauth
  (WrapRefreshExpired / WrapRefresh403 / WrapDiscoveryFailed) for the
  OAuth sentinel-error pattern already used there.
- Classifier gains string-match fallbacks for the four domains that
  previously fell through to MCPX_UNKNOWN_UNCLASSIFIED:
    OAUTH: refresh expired / 403 / discovery failed / callback timeout
           / redirect_uri mismatch
    DOCKER: daemon down / snap-apparmor / no permission / image pull failed
    CONFIG: deprecated field / parse error / missing secret
    QUARANTINE: pending approval / tool changed (rug pull)
- runtime.RefreshOAuthToken now attributes terminal refresh outcomes
  using the new wrappers, matching the "expired" / "403" /
  "invalid_grant" / "no refresh token" patterns emitted by the
  upstream manager + mcp-go. This is the first real producer-side
  wrapping; more sites can opt in incrementally.
- 13 new golden-sample classifier tests cover every domain + the
  typed-wrap fast path (including the nil-passthrough contract).

Kept the existing string-fallback matchers deliberately broad so that
error rewording in upstream/mcp-go doesn't regress classification; as
more producers opt into WrapError explicitly the fallbacks stop
running in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated oas/swagger.yaml + oas/docs.go to reflect the new
contracts.Diagnostic + contracts.DiagnosticFixStep + Server.diagnostic
+ Server.error_code fields introduced in the earlier spec 044 commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…S, tray SKIPPED)

Phase 1 (CLI) PASS: all 29 diagnostic codes registered, doctor --server
reports correctly, doctor fix dry-run returns preview, --execute is
rate-limited at 429 as the documented safety guard, classifier maps
zsh:1:no-such-file errors to MCPX_STDIO_SPAWN_ENOENT.

Phase 2 (Web UI) PASS: ErrorPanel renders full diagnostic payload
(severity band, error code, user message, cause snippet, three fix-step
rows with Copy + docs link + Preview dry-run button). Preview click fires
POST /api/v1/diagnostics/fix -> 200.

Phase 3 (macOS tray) SKIPPED: production tray running, cycling it mid-
session was too risky. Needs post-merge visual confirmation.

Three non-blocking findings captured:
  1. doctor CLI has no HTTP fallback when enable_socket: false
  2. Web UI Preview-button success is silent (no toast/inline render)
  3. Fix endpoint body uses "mode" string, not boolean "dry_run"

Production state untouched throughout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second-opinion critique from Gemini CLI (v0.38.1, model gemini-3.1-pro-preview)
run in --yolo mode with read access to all 4 related repos
(mcpproxy-go-diagnostics-taxonomy, mcpproxy-go-retention-telemetry,
mcpproxy-telemetry, mcpproxy-dash).

Findings: 4 P1, 4 P2, 2 P3. Notable P1s to triage:

1. ErrorPanel.vue hides Execute button on non-destructive steps — users can
   only click "Preview (dry-run)", making non-destructive fixes unreachable.
2. env_kind decision tree evaluates HasCIEnv before HasCloudIDEEnv, so
   GitHub Codespaces / Gitpod (which set CI=true) are misclassified as CI bots.
3. Activation funnel query reads server_configured from LATEST heartbeat —
   users who configured then deleted a server fall out of the lifetime funnel.
4. autostart.go caches a missing sidecar file as nil for a full hour,
   poisoning the critical first heartbeat on slow-tray startup races.

Content is verbatim Gemini output; no human edits except a header block
documenting invocation + model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…emini P1)

Non-destructive fix steps of type=button previously rendered only a
"Preview (dry-run)" button; there was no way to actually execute them
from the UI. Restructure so:

- Non-destructive: single primary Execute button, mode=execute.
- Destructive: Preview (dry-run) + Execute (gated by window.confirm()).

Updates vitest cases to cover both paths including the confirm() reject
and accept branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d code)

The 409 guard at diagnostics_fix.go was unreachable: mode was defaulted to
dry_run before the check, so `mode == ModeExecute && body.Mode != ModeExecute`
could never fire. Destructive fixers silently downgraded to dry_run when the
client forgot the mode field, masking implementation errors.

Move the check ahead of the default:
- destructive + body.Mode == "" -> 409 (force explicit intent)
- non-destructive + body.Mode == "" -> defaults to execute
- explicit dry_run/execute -> proceed as before

Adds diagnostics_fix_test.go with 4 subtests covering each path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dumbris Dumbris force-pushed the feat/diagnostics-taxonomy branch from 108aee3 to 6ea9ac7 Compare April 24, 2026 17:56
@Dumbris Dumbris merged commit 1a0646f into main Apr 24, 2026
14 of 15 checks passed
@Dumbris Dumbris deleted the feat/diagnostics-taxonomy branch April 24, 2026 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants