Skip to content

Observability

sarmakska edited this page May 31, 2026 · 2 revisions

Observability

Two layers: structured logs always, span export when you opt in.

Structured logs

The server logs through structlog as JSON, written to stderr so the stdio transport keeps stdout reserved for JSON-RPC. Every tool call emits a tool_call event with the tool name and duration; failures emit tool_call_failed with the error:

{"timestamp": "2026-05-31T12:34:56", "level": "info", "event": "tool_call", "tool": "list_files", "duration_ms": 1.42}

Pipe stderr to your log aggregator.

OpenTelemetry span export

Span export is opt-in. Set an OTLP collector endpoint and every tool call is exported as a span:

MCP_OTEL_ENDPOINT=https://otel.your-domain.com:4317
MCP_SERVICE_NAME=mcp-server-toolkit

Each span is named tool.<name> and carries mcp.tool.name, mcp.tool.argument_count, mcp.tool.duration_ms, and mcp.tool.error when the handler raises. The span is created in Registry.call, so it wraps validation and execution and records exceptions. When MCP_OTEL_ENDPOINT is unset, tracing is a no-op and only the structured logs flow.

Health endpoint

GET /health is always open and returns:

{"ok": true, "tools_registered": 6, "uptime_seconds": 12345.6}

Use it as a readiness or liveness probe; the container image already wires it into a Docker HEALTHCHECK.

What to alert on

Signal Threshold Why
5xx rate > 1% over 5 min Internal errors
Tool latency P99 (mcp.tool.duration_ms) > 5s Downstream slowness
401 rate sustained spike Possible credential attack
429 rate sustained Rate limit too tight or abuse
Memory growth sustained over 1h Likely leak

What to ignore

  • Occasional single tool timeouts; the client retries.
  • 401s from idle fuzzing of a public URL.
  • Cold-start latency on the first request after a deploy.

Clone this wiki locally