Skip to content
This repository was archived by the owner on Feb 6, 2026. It is now read-only.

feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing#8305

Merged
fnichol merged 16 commits intomainfrom
fnichol/tempo-tracing
Jan 21, 2026
Merged

feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing#8305
fnichol merged 16 commits intomainfrom
fnichol/tempo-tracing

Conversation

@fnichol
Copy link
Contributor

@fnichol fnichol commented Jan 19, 2026

This change migrates the local development environment from Jaeger to
Grafana Tempo for distributed tracing. Prior to this change, we used
Jaeger as our trace backend with limited integration capabilities.
With this change, we gain enhanced observability through Tempo's
metrics generation, improved Grafana integration, and modern trace
query capabilities via TraceQL.

The migration architecture leverages our existing OTLP infrastructure,
requiring zero service code changes. The OpenTelemetry Collector
continues to receive traces via OTLP from application services and now
forwards them to Tempo instead of Jaeger. This clean separation
demonstrates the value of the collector pattern for backend
portability.

Tempo runs in monolithic mode for local development simplicity (single
container vs Jaeger's multi-service architecture). Storage uses
ephemeral local filesystem within the container, matching our pattern
with Loki. Block retention is configured for 14 days (336h) to preserve
debugging data across development sessions, with a 5-minute block
duration for faster query availability during active development.

The Grafana datasource configuration enables cross-observability
features that are the primary evaluation goal:

  • tracesToLogs: click a span to jump to corresponding Loki logs
  • tracesToMetrics: click a service to view Prometheus metrics
  • serviceMap: visualize service dependencies derived from traces
  • nodeGraph: interactive service topology exploration

Tempo's metrics generator produces RED metrics (Rate, Errors, Duration)
from trace spans, eliminating the need for separate instrumentation for
these foundational metrics. Two example dashboards demonstrate this
capability: a Trace Overview showing per-service request rates, error
rates, and P95 latency, and a Service Dependencies graph showing the
service topology derived from trace relationships.

This migration positions us to evaluate Tempo for future production
use. The local development environment provides a realistic testbed for
validating Grafana integration depth, TraceQL query expressiveness, and
operational characteristics before making production deployment
decisions.

Configuration changes:

  • Add Tempo service to docker-compose (port 3200)
  • Update OTEL Collector exporters (jaeger:4317 -> tempo:4317)
  • Configure Grafana Tempo datasource with enhanced integration
  • Add dashboard provisioning for Tempo examples
  • Update Tiltfile telemetry group (jaeger -> tempo)
  • Fix OTEL Collector logs routing to Loki (Tempo is traces-only)
  • Build downstream Tempo container image for version control
  • Update documentation references (README, DEV_DOCS)

The OTEL Collector required a logs pipeline correction: logs now route
to Loki via otlphttp/loki exporter instead of the Tempo exporter,
since Tempo is a tracing backend and doesn't handle logs. This was
discovered during testing and corrected to maintain proper observability
data flow.

Building a downstream container image for Tempo (component/tempo/)
allows our team to control upstream version progression as we do with
other services, preventing unexpected breaking changes during
development.

@github-actions github-actions bot added the A-otelcol Area: OpenTelemetry Collector development image label Jan 19, 2026
@github-actions
Copy link

github-actions bot commented Jan 19, 2026

Dependency Review

✅ No vulnerabilities or OpenSSF Scorecard issues found.

Scanned Files

None

@fnichol
Copy link
Contributor Author

fnichol commented Jan 19, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 19, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol
Copy link
Contributor Author

fnichol commented Jan 20, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 20, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol
Copy link
Contributor Author

fnichol commented Jan 20, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 20, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol
Copy link
Contributor Author

fnichol commented Jan 20, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 20, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol
Copy link
Contributor Author

fnichol commented Jan 20, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 20, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@nickgerace
Copy link
Contributor

nickgerace commented Jan 20, 2026

THIS TIME IT WILL WORK 🤞 !

@fnichol fnichol force-pushed the fnichol/tempo-tracing branch from f9d8c60 to 0806255 Compare January 20, 2026 23:24
@fnichol
Copy link
Contributor Author

fnichol commented Jan 20, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 20, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol
Copy link
Contributor Author

fnichol commented Jan 21, 2026

THIS TIME IT WILL WORK 🤞 !

I think it might! I mean, I had to retry the unit tests (which passed the second time), but I'm hopeful!

nickgerace
nickgerace previously approved these changes Jan 21, 2026
@fnichol
Copy link
Contributor Author

fnichol commented Jan 21, 2026

/try

@github-actions
Copy link

github-actions bot commented Jan 21, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2026
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2026
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@nickgerace
Copy link
Contributor

/try

@github-actions
Copy link

github-actions bot commented Jan 21, 2026

Okay, starting a try! I'll update this comment once it's running...
🚀 Try running here! 🚀

- Monolithic mode with local filesystem storage
- 14-day retention (336h block_retention)
- Metrics generator enabled for trace-to-metrics features
- 5m block duration for faster query availability in dev
- Replace jaeger service with tempo service
- Expose Tempo on port 3200 (HTTP API/UI)
- Mount tempo.yaml config from dev/config/tempo/
- Update otelcol dependency from jaeger to tempo
- Rename exporter from otlp to otlp/tempo
- Update endpoint from jaeger:4317 to tempo:4317
- Update traces pipeline to use otlp/tempo exporter
- Update logs pipeline to use otlp/tempo exporter
Tempo is a tracing backend and doesn't handle logs. Logs
should be routed to Loki for proper storage and to enable
trace-to-logs navigation in Grafana.

- Add otlphttp/loki exporter
- Update logs pipeline to use otlphttp/loki
- Traces pipeline unchanged (still routes to otlp/tempo)
- Change datasource type from jaeger to tempo
- Update URL to tempo:3200
- Add jsonData for enhanced integration:
  - tracesToLogs: enable jump from traces to Loki logs
  - tracesToMetrics: enable jump from traces to Prometheus metrics
  - serviceMap: enable service graph from traces
  - nodeGraph: enable node graph visualization
Add explicit uid fields to Tempo, Loki, and Prometheus datasources
to ensure reliable cross-datasource references and follow Grafana
best practices for datasource provisioning.
- Create dashboards.yml with Tempo Examples provider
- Configure path to /etc/grafana/provisioning/dashboards/tempo
- Dashboards will appear in 'Tempo' folder in Grafana
Dashboard shows trace-derived RED metrics:
- Spans per service (request rate)
- Error rate by service
- P95 latency by service

Demonstrates trace-to-metrics capability using Tempo's
metrics generator and Prometheus integration.
Panel 1 should declare Prometheus datasource consistently with
panels 2 and 3, since it queries trace-derived metrics from
Prometheus, not Tempo directly.
Shows service dependency graph using Tempo's service graph
metrics. Nodes are clickable to filter traces for specific
services.

Demonstrates service graph feature and cross-panel navigation.
Add processor configuration to enable service graph and span metrics
generation. This allows Tempo to generate traces_service_graph_request_total
metrics required for the Service Dependencies dashboard.

Without this configuration, the service graph dashboard will be empty.
- Update telemetry group: jaeger -> tempo
- Add tempo to compose_services with link to port 3200
- Remove jaeger from compose_services

Developers can access Tempo UI via Tilt at localhost:3200
- Update tracing service from Jaeger to Tempo
- Update port from 16686 to 3200
- Note Grafana integration for trace visualization
- Update tracing documentation from Jaeger to Tempo
- Document Grafana as primary interface for traces
- Add Tempo evaluation context
- Update port references (16686 -> 3200)
Signed-off-by: Fletcher Nichol <fletcher@systeminit.com>
This change allows our team to control the versioning of the upstream
container image as we do with other services.

Signed-off-by: Fletcher Nichol <fletcher@systeminit.com>
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
github-merge-queue bot pushed a commit that referenced this pull request Jan 21, 2026
feat(tempo): replace Jaeger with Grafana Tempo for distributed tracing
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jan 21, 2026
@fnichol fnichol added this pull request to the merge queue Jan 21, 2026
Merged via the queue into main with commit 7ed9656 Jan 21, 2026
11 checks passed
@fnichol fnichol deleted the fnichol/tempo-tracing branch January 21, 2026 18:44
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A-otelcol Area: OpenTelemetry Collector development image

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants