feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing#8305
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing#8305
Conversation
Dependency Review✅ No vulnerabilities or OpenSSF Scorecard issues found.Scanned FilesNone |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
|
THIS TIME IT WILL WORK 🤞 ! |
f9d8c60 to
0806255
Compare
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
I think it might! I mean, I had to |
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
|
/try |
|
Okay, starting a try! I'll update this comment once it's running... |
- Monolithic mode with local filesystem storage - 14-day retention (336h block_retention) - Metrics generator enabled for trace-to-metrics features - 5m block duration for faster query availability in dev
- Replace jaeger service with tempo service - Expose Tempo on port 3200 (HTTP API/UI) - Mount tempo.yaml config from dev/config/tempo/ - Update otelcol dependency from jaeger to tempo
- Rename exporter from otlp to otlp/tempo - Update endpoint from jaeger:4317 to tempo:4317 - Update traces pipeline to use otlp/tempo exporter - Update logs pipeline to use otlp/tempo exporter
Tempo is a tracing backend and doesn't handle logs. Logs should be routed to Loki for proper storage and to enable trace-to-logs navigation in Grafana. - Add otlphttp/loki exporter - Update logs pipeline to use otlphttp/loki - Traces pipeline unchanged (still routes to otlp/tempo)
- Change datasource type from jaeger to tempo - Update URL to tempo:3200 - Add jsonData for enhanced integration: - tracesToLogs: enable jump from traces to Loki logs - tracesToMetrics: enable jump from traces to Prometheus metrics - serviceMap: enable service graph from traces - nodeGraph: enable node graph visualization
Add explicit uid fields to Tempo, Loki, and Prometheus datasources to ensure reliable cross-datasource references and follow Grafana best practices for datasource provisioning.
- Create dashboards.yml with Tempo Examples provider - Configure path to /etc/grafana/provisioning/dashboards/tempo - Dashboards will appear in 'Tempo' folder in Grafana
Dashboard shows trace-derived RED metrics: - Spans per service (request rate) - Error rate by service - P95 latency by service Demonstrates trace-to-metrics capability using Tempo's metrics generator and Prometheus integration.
Panel 1 should declare Prometheus datasource consistently with panels 2 and 3, since it queries trace-derived metrics from Prometheus, not Tempo directly.
Shows service dependency graph using Tempo's service graph metrics. Nodes are clickable to filter traces for specific services. Demonstrates service graph feature and cross-panel navigation.
Add processor configuration to enable service graph and span metrics generation. This allows Tempo to generate traces_service_graph_request_total metrics required for the Service Dependencies dashboard. Without this configuration, the service graph dashboard will be empty.
- Update telemetry group: jaeger -> tempo - Add tempo to compose_services with link to port 3200 - Remove jaeger from compose_services Developers can access Tempo UI via Tilt at localhost:3200
- Update tracing service from Jaeger to Tempo - Update port from 16686 to 3200 - Note Grafana integration for trace visualization
- Update tracing documentation from Jaeger to Tempo - Document Grafana as primary interface for traces - Add Tempo evaluation context - Update port references (16686 -> 3200)
Signed-off-by: Fletcher Nichol <fletcher@systeminit.com>
This change allows our team to control the versioning of the upstream container image as we do with other services. Signed-off-by: Fletcher Nichol <fletcher@systeminit.com>
0806255 to
eaf922e
Compare
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
This change migrates the local development environment from Jaeger to
Grafana Tempo for distributed tracing. Prior to this change, we used
Jaeger as our trace backend with limited integration capabilities.
With this change, we gain enhanced observability through Tempo's
metrics generation, improved Grafana integration, and modern trace
query capabilities via TraceQL.
The migration architecture leverages our existing OTLP infrastructure,
requiring zero service code changes. The OpenTelemetry Collector
continues to receive traces via OTLP from application services and now
forwards them to Tempo instead of Jaeger. This clean separation
demonstrates the value of the collector pattern for backend
portability.
Tempo runs in monolithic mode for local development simplicity (single
container vs Jaeger's multi-service architecture). Storage uses
ephemeral local filesystem within the container, matching our pattern
with Loki. Block retention is configured for 14 days (336h) to preserve
debugging data across development sessions, with a 5-minute block
duration for faster query availability during active development.
The Grafana datasource configuration enables cross-observability
features that are the primary evaluation goal:
tracesToLogs: click a span to jump to corresponding Loki logstracesToMetrics: click a service to view Prometheus metricsserviceMap: visualize service dependencies derived from tracesnodeGraph: interactive service topology explorationTempo's metrics generator produces RED metrics (Rate, Errors, Duration)
from trace spans, eliminating the need for separate instrumentation for
these foundational metrics. Two example dashboards demonstrate this
capability: a Trace Overview showing per-service request rates, error
rates, and P95 latency, and a Service Dependencies graph showing the
service topology derived from trace relationships.
This migration positions us to evaluate Tempo for future production
use. The local development environment provides a realistic testbed for
validating Grafana integration depth, TraceQL query expressiveness, and
operational characteristics before making production deployment
decisions.
Configuration changes:
The OTEL Collector required a logs pipeline correction: logs now route
to Loki via
otlphttp/lokiexporter instead of the Tempo exporter,since Tempo is a tracing backend and doesn't handle logs. This was
discovered during testing and corrected to maintain proper observability
data flow.
Building a downstream container image for Tempo (
component/tempo/)allows our team to control upstream version progression as we do with
other services, preventing unexpected breaking changes during
development.