feat(apm): wire OTel SDK + SigNoz exporter#245
Merged
Conversation
…app.workweave.ai Adds internal/observability/apm mirroring backend/internal/app/telemetry: SDK tracer + meter providers, OTLP/gRPC exporters, otelgin middleware, resource attributes matching the rest of Weave (service.name=router, deployment.env, service.version). Existing custom HTTP emitter (internal/observability/otel) keeps emitting the per-decision spans on OTEL_EXPORTER_OTLP_ENDPOINT; this is a separate pipeline driven by WV_APM_OTLP_ENDPOINT so existing deployments don't change behavior until the env var is set.
- Move apm.Shutdown() into graceful path with explicit 1.5s budget so SDK flush actually runs before Cloud Run SIGKILL (was deferred — never ran because srv.Shutdown + emitter.Shutdown consumed the full 10s window). Trimmed srv.Shutdown from 8s→6s and emitter.Shutdown from 2s→1.5s. - Register otelruntime instrumentation after MeterProvider is set so goroutine / heap / GC / cgo metrics actually publish to SigNoz.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6a44873. Configure here.
- TLS by default for APM OTLP transport (WV_APM_OTLP_INSECURE now defaults to false). Operators opt into plaintext gRPC explicitly when the collector is on a trusted internal network. Closes the production-path TLS bypass that allowed on-path interception of span attributes. - Flush APM in the serverErr branch too, so a ListenAndServe failure doesn't drop the SDK traces + metrics describing the failure itself.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
The router emits OTel decision spans today via a custom OTLP/HTTP emitter (
internal/observability/otel), but it doesn't use the OTel SDK — noTracerProvider, noMeterProvider, no gin instrumentation. That means HTTP request traces and Go runtime metrics never reach SigNoz at https://apm.app.workweave.ai, so the router doesn't appear in the same dashboards as the rest of Weave.This PR adds
internal/observability/apm, modeled exactly onbackend/internal/app/telemetry/otel.go, so the router publishes the standard service-level observability that the SigNoz APM view expects.What's new
internal/observability/apm/apm.go— OTel SDK wiring:sdktrace.TracerProvider+sdkmetric.MeterProviderotlptracegrpc,otlpmetricgrpc)service.name=router,service.version,deployment.environment(matches backend pattern)apm.Middleware()— wrapsotelgin.Middlewareso every HTTP request is a span with method/route/status./healthand/validateexcluded.apm.Init()/apm.Shutdown()— idempotent boot + graceful flush incmd/router/main.go.Config
WV_APM_OTLP_ENDPOINT— host:port of the SigNoz OTLP/gRPC collector. Unset = no-op (no behavior change for existing deploys).WV_APM_OTLP_INSECURE— defaults totrue(internal collector); setfalseif pointing at apm.app.workweave.ai directly during local testing.ROUTER_DEPLOYMENT_ENV,ROUTER_VERSION— surface in resource attributes; fall back toENV/unknown.Coexistence with existing custom emitter
The existing custom OTLP/HTTP emitter on
OTEL_EXPORTER_OTLP_ENDPOINTis untouched. It keeps publishing the per-decision spans (router.decision,router.cache_hit, etc.) that aren't worth duplicating through the SDK. The new SDK layer publishes:otelginThey're independent pipelines pointed at potentially different endpoints, so neither blocks the other.
Test plan
go build ./...cleango test ./...clean across all packagesWV_APM_OTLP_ENDPOINTin router Cloud Run staging; confirmrouterservice appears in apm.app.workweave.ai with HTTP spans + runtime metricsOTEL_EXPORTER_OTLP_ENDPOINTdeployments still emit decision spans unchanged🤖 Generated with Claude Code