Skip to content

feat(apm): wire OTel SDK + SigNoz exporter#245

Merged
steventohme merged 3 commits into
mainfrom
steven/router-apm-signoz
May 25, 2026
Merged

feat(apm): wire OTel SDK + SigNoz exporter#245
steventohme merged 3 commits into
mainfrom
steven/router-apm-signoz

Conversation

@steventohme
Copy link
Copy Markdown
Collaborator

Summary

The router emits OTel decision spans today via a custom OTLP/HTTP emitter (internal/observability/otel), but it doesn't use the OTel SDK — no TracerProvider, no MeterProvider, no gin instrumentation. That means HTTP request traces and Go runtime metrics never reach SigNoz at https://apm.app.workweave.ai, so the router doesn't appear in the same dashboards as the rest of Weave.

This PR adds internal/observability/apm, modeled exactly on backend/internal/app/telemetry/otel.go, so the router publishes the standard service-level observability that the SigNoz APM view expects.

What's new

  • internal/observability/apm/apm.go — OTel SDK wiring:
    • sdktrace.TracerProvider + sdkmetric.MeterProvider
    • OTLP/gRPC exporters (otlptracegrpc, otlpmetricgrpc)
    • W3C TraceContext + Baggage propagators
    • Resource attributes: service.name=router, service.version, deployment.environment (matches backend pattern)
  • apm.Middleware() — wraps otelgin.Middleware so every HTTP request is a span with method/route/status. /health and /validate excluded.
  • apm.Init() / apm.Shutdown() — idempotent boot + graceful flush in cmd/router/main.go.

Config

  • WV_APM_OTLP_ENDPOINT — host:port of the SigNoz OTLP/gRPC collector. Unset = no-op (no behavior change for existing deploys).
  • WV_APM_OTLP_INSECURE — defaults to true (internal collector); set false if pointing at apm.app.workweave.ai directly during local testing.
  • ROUTER_DEPLOYMENT_ENV, ROUTER_VERSION — surface in resource attributes; fall back to ENV / unknown.

Coexistence with existing custom emitter

The existing custom OTLP/HTTP emitter on OTEL_EXPORTER_OTLP_ENDPOINT is untouched. It keeps publishing the per-decision spans (router.decision, router.cache_hit, etc.) that aren't worth duplicating through the SDK. The new SDK layer publishes:

  • HTTP server spans (one per request) via otelgin
  • Go runtime metrics (goroutines, heap, GC) via the SDK's auto-instrumented meter

They're independent pipelines pointed at potentially different endpoints, so neither blocks the other.

Test plan

  • go build ./... clean
  • go test ./... clean across all packages
  • Set WV_APM_OTLP_ENDPOINT in router Cloud Run staging; confirm router service appears in apm.app.workweave.ai with HTTP spans + runtime metrics
  • Confirm existing OTEL_EXPORTER_OTLP_ENDPOINT deployments still emit decision spans unchanged

🤖 Generated with Claude Code

…app.workweave.ai

Adds internal/observability/apm mirroring backend/internal/app/telemetry: SDK
tracer + meter providers, OTLP/gRPC exporters, otelgin middleware, resource
attributes matching the rest of Weave (service.name=router, deployment.env,
service.version).

Existing custom HTTP emitter (internal/observability/otel) keeps emitting the
per-decision spans on OTEL_EXPORTER_OTLP_ENDPOINT; this is a separate pipeline
driven by WV_APM_OTLP_ENDPOINT so existing deployments don't change behavior
until the env var is set.
@steventohme steventohme changed the title feat(apm): wire OTel SDK + SigNoz so the router shows up in apm.app.workweave.ai feat(apm): wire OTel SDK + SigNoz exporter May 25, 2026
Comment thread cmd/router/main.go Outdated
Comment thread internal/observability/apm/apm.go
Comment thread internal/observability/apm/apm.go Outdated
- Move apm.Shutdown() into graceful path with explicit 1.5s budget so SDK
  flush actually runs before Cloud Run SIGKILL (was deferred — never ran
  because srv.Shutdown + emitter.Shutdown consumed the full 10s window).
  Trimmed srv.Shutdown from 8s→6s and emitter.Shutdown from 2s→1.5s.
- Register otelruntime instrumentation after MeterProvider is set so
  goroutine / heap / GC / cgo metrics actually publish to SigNoz.
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6a44873. Configure here.

Comment thread cmd/router/main.go
- TLS by default for APM OTLP transport (WV_APM_OTLP_INSECURE now defaults
  to false). Operators opt into plaintext gRPC explicitly when the
  collector is on a trusted internal network. Closes the production-path
  TLS bypass that allowed on-path interception of span attributes.
- Flush APM in the serverErr branch too, so a ListenAndServe failure
  doesn't drop the SDK traces + metrics describing the failure itself.
@steventohme steventohme merged commit 515b302 into main May 25, 2026
7 checks passed
@steventohme steventohme deleted the steven/router-apm-signoz branch May 25, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant