Skip to content

Observability

sarmakska edited this page May 3, 2026 · 2 revisions

Observability

Three layers, all on by default.

Structured logs

Every tool call logs as JSON via structlog:

{"timestamp": "2026-05-03T12:34:56", "level": "info", "event": "tool_call", "tool": "search_docs", "duration_ms": 142, "user": "anon"}

Pipe stdout to your log aggregator (BetterStack, Axiom, Datadog, Loki).

OpenTelemetry traces

Set MCP_OTEL_ENDPOINT to your collector:

MCP_OTEL_ENDPOINT=https://otel.your-domain.com:4317

Each tool call gets a span. Spans include the tool name, duration, success/failure, and any custom attributes you add via tracer.start_span().

Health endpoint

GET /health returns:

{"ok": true, "tools_registered": 7, "uptime_seconds": 12345}

Use as a readiness probe in Kubernetes or as a uptime check from your monitoring service.

What to alert on

Signal Threshold Why
5xx rate > 1% over 5 min Internal errors
Tool latency P99 > 5s Downstream slowness
Auth failure rate > 10/min Possible attack
Memory growth sustained over 1h Likely leak

What to ignore

  • Single tool calls timing out occasionally. The retry on the client side handles it.
  • 401s when nobody is in the office and someone is fuzzing your URL.
  • Cold-start latency on the first request after a deploy.

Clone this wiki locally