Skip to content

feat(observability): OTel instrumentation + HTTP idle-pool options#28

Open
bdchatham wants to merge 13 commits intomainfrom
feat/scale-idle-pool-with-workers
Open

feat(observability): OTel instrumentation + HTTP idle-pool options#28
bdchatham wants to merge 13 commits intomainfrom
feat/scale-idle-pool-with-workers

Conversation

@bdchatham
Copy link
Copy Markdown

@bdchatham bdchatham commented Apr 21, 2026

Summary

Implements the observability design specified in sei-protocol/platform#128 (docs/designs/sei-load-observability.md), consumed by autobake and the forthcoming benchmarks namespace on harbor.

Two bundled pieces of scope, captured in one PR because they share the same code path and dependency set:

  1. HTTP client idle-pool tuning (the original scope of this PR) — parameterize newHttpClient via a functional-options pattern with raised defaults (MaxIdleConns: 500, MaxIdleConnsPerHost: 50) so high-concurrency callers can tune without churning the constructor.
  2. Full OTel instrumentation — new observability package, run-scope Resource, full tracing with exemplar support, W3C context propagation, optional OTLP push exporter, and new application metrics.

OTel scope — what's in

New observability package (observability/{setup.go,setup_test.go,README.md}):

  • Setup(ctx, Config) installs MeterProvider + TracerProvider with a Resource built from SEILOAD_RUN_ID, SEILOAD_CHAIN_ID, SEILOAD_COMMIT_ID (+ derived commit_id_short), SEILOAD_WORKLOAD, SEILOAD_SERVICE_VERSION, and SEILOAD_INSTANCE_ID (falls back to os.Hostname() so cluster-of-seiload deployments disambiguate by pod name without explicit wiring).
  • Prometheus reader (always) + OTLP gRPC exporter (gated on OTEL_EXPORTER_OTLP_ENDPOINT).
  • Composite TraceContext + Baggage propagator installed globally.

Package-level meters work before Setup. sender/metrics.go and stats/metrics.go acquire their meter via var meter = otel.Meter(...) at package-init time, before observability.Setup installs the real MeterProvider. This is safe because OTel Go's global is a delegating provider — meters and instruments acquired against it forward to the real provider once SetMeterProvider runs. Documented inline so the next reader doesn't have to re-derive it.

Transport instrumentation (sender/worker.go):

  • newHttpClient wraps the base transport with otelhttp.NewTransport (W3C traceparent injection + auto http.client.* metrics).

Tracing (full scope — one-shot):

  • seiload.run top-level span in main.go wrapping the entire invocation.
  • sender.send_tx, sender.check_receipt, sender.dial_endpoint child spans in sender/worker.go.
  • All histogram emissions happen inside sampled span contexts, so trace-based exemplar filter (OTel default) attaches trace IDs automatically.

New metrics:

Metric Type Purpose
seiload.txs.accepted Counter Accepted submissions — alert denominator
seiload.txs.rejected Counter + reason Rejection taxonomy
seiload.http.errors Counter + status_code HTTP failure breakdown
seiload.tps.achieved Observable gauge Live TPS (producer pump wired in follow-up)
seiload.run.tps.final Gauge Final TPS at run end — 1 series per run via Resource
seiload.run.duration.seconds Gauge Run wall-clock duration
seiload.run.txs.accepted.total Gauge Run's accepted-txs total

Run-summary gauges are 1-series-per-run-per-metric after the Resource join — the correct cardinality shape for cross-run benchmark dashboards.

Review rounds addressed

Tri-specialist pre-merge sanity check (platform-engineer + kubernetes-specialist + OpenTelemetry-expert):

  1. promhttp.Handler() silently drops exemplars. Default opts leave EnableOpenMetrics off → handler never negotiates OpenMetrics regardless of scraper Accept header → SDK-computed exemplars get stripped. Fixed with promhttp.HandlerFor(..., HandlerOpts{EnableOpenMetrics: true}). Highest-impact finding — without it, the entire exemplar path was inert.
  2. Shutdown order lost final OTLP batches. Provider shutdown is the only path that flushes the PeriodicReader / BatchSpanProcessor buffers; explicit exporter shutdowns must come AFTER providers (or be omitted entirely). Reordered.
  3. Cardinality on send_latency/receipt_latency. Dropped worker_id from labels — per-worker fan-out multiplied series by ~50 at realistic autobake scale, risking the 50k guardrail in platform PR #128's SeiLoadCardinalityHigh alert. Worker-level attribution lives on span attributes (accessible via exemplar pivot).
  4. defer cancel() inside a for loop in Worker.watchTransactions accumulated unbounded deferred state. Explicit cancel at end-of-iteration.

@amir-deris review feedback:

  • OTEL_SERVICE_VERSIONSEILOAD_SERVICE_VERSION (the env var is sei-load's own version, distinct from seiload.commit_id which names what's under test; OTEL_SERVICE_VERSION isn't an OTel spec env var anyway — the spec path is OTEL_RESOURCE_ATTRIBUTES=service.version=…).
  • receipt_latency description distinguished from send_latency.
  • OTel delegation pattern documented inline in sender/metrics.go so the pattern is self-explaining.
  • ethclient.DialContext(dialCtx, ...) instead of ethclient.Dial so the dial honors the span context.
  • go.mod bumped to 1.25.1 to match CI toolchain.
  • /metrics server moved to http.Server{} + deferred Shutdown(ctx) so the scrape endpoint stays available through the OTel shutdown window.
  • HttpClientOption / WithMaxIdleConns / WithMaxIdleConnsPerHost kept exported intentionally as the opt-in tuning API for a follow-up that plumbs settings.HttpMaxIdleConns[PerHost] through.

Test plan

  • go build ./... clean
  • go test ./... passes across all packages
  • golangci-lint run ./... clean
  • observability/setup_test.go covers Resource population from env, hostname fallback for service.instance.id, propagator install, OTLP gating, scheme stripping, short-commit derivation
  • Existing sender tests updated for the split between newHttpTransport (inner, inspectable) and newHttpClient (outer, otelhttp-wrapped)

Post-merge validation on harbor (via autobake's first run after both PRs land):

  • /metrics scrape response includes # {trace_id=...} exemplar lines (OpenMetrics content-type negotiation)
  • target_info{seiload_run_id="...",seiload_commit_id_short="..."} series present in harbor Prometheus
  • seiload_run_duration_seconds emitted at run end (AutobakeRunFailed alert clears)

Known follow-ups (tracked separately, NOT in this PR)

  • Wire the RecordTPSSample producer from stats.Collector so seiload.tps.achieved emits datapoints (currently registered + inert).
  • Platform-side: set SEILOAD_INSTANCE_ID via downward API metadata.name in autobake Job template (works today via hostname fallback, but explicit downward API is cleaner for cluster-of-seiload).
  • Tempo deployment on harbor — platform#127. Until it lands, spans are created + propagated + dropped at export.
  • Consider otlpmetricgrpc.WithEndpointURL vs the stripScheme helper; bump otelhttp to latest for fresher HTTP semconv.
  • Thread HttpClientOption through from settings so the exported tuning API has a caller.

🤖 Generated with Claude Code

The default MaxIdleConnsPerHost of 10 caps the keep-alive pool far below
the per-worker goroutine count set by settings.Workers, forcing TCP
re-dial on most completions and exhausting ephemeral source ports under
load. Bumping the defaults alone still leaves the pool static for larger
runs, so parameterize the client via an options pattern:

- newHttpClient(opts ...HttpClientOption) with defaults 500 / 50
- WithMaxIdleConns(n) overrides the global pool
- WithMaxIdleConnsPerHost(n) overrides the per-host pool

The sender Worker shares one client across settings.Workers goroutines
targeting a single endpoint, so it sizes both to the worker count.
Unit tests cover defaults, each option in isolation, and composition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread sender/worker.go Outdated
Per review: pulling w.workers into the HTTP client at the call site is
implicit tuning. Use the raised defaults (500 global / 50 per-host) for
now; future follow-up can expose MaxIdleConns[PerHost] as explicit
settings.* fields if callers need to tune beyond defaults.

Options API is retained — WithMaxIdleConns / WithMaxIdleConnsPerHost
remain available for any consumer that wants to opt in from a config
path later.
@bdchatham
Copy link
Copy Markdown
Author

Addressed in bb74d4d — dropped the Worker.Run call-site override so we rely on the 500/50 defaults. Kept the WithMaxIdleConns/WithMaxIdleConnsPerHost options available for explicit opt-in later, and the tests still validate the options API so a follow-up that plumbs settings.HttpMaxIdleConns[PerHost] through to newHttpClient has test coverage ready.

bdchatham and others added 2 commits April 21, 2026 10:24
Implements the design in sei-protocol/platform#128 (docs/designs/
sei-load-observability.md). Wires OTel run-scope via Resource (not
per-sample labels), full tracing with exemplar support, W3C
propagation for future cross-process stitching with seid, and a
dual-exporter topology (Prometheus scrape + optional OTLP push).

New package: observability
- Setup(ctx, Config) installs MeterProvider + TracerProvider, Resource
  from SEILOAD_* env (RunID, ChainID, CommitID, CommitIDShort,
  Workload, InstanceID), Prometheus reader, optional OTLP gRPC
  exporter gated on OTEL_EXPORTER_OTLP_ENDPOINT, and the composite
  W3C TraceContext+Baggage propagator.
- RunScopeFromEnv reads SEILOAD_* env vars.
- service.instance.id defaults to hostname when SEILOAD_INSTANCE_ID
  unset — lets cluster-of-seiload deployments disambiguate instances
  sharing a RunID/ChainID via a stable per-pod attribute.
- Meter(name) / Tracer(name) are thin wrappers around otel.Meter /
  otel.Tracer so callers can use sync.OnceValue(...) at package scope
  for lazy acquisition. Fixes the existing NoOp-capture bug where
  package-init var meter = otel.Meter(...) grabs the NoOp provider
  before main runs Setup, silently dropping emissions.

Migrations:
- sender/metrics.go and stats/metrics.go now build their instrument
  bundle via sync.OnceValue(...) — no more capture-at-init.
- sender/worker.go wraps newHttpClient's Transport with
  otelhttp.NewTransport so outbound requests inject W3C traceparent
  and auto-emit http.client.* metrics. newHttpTransport split out so
  tests can inspect the unwrapped transport directly.
- sender.sendTransaction, sender.waitForReceipt,
  sender.watchTransactions (Dial) now emit spans (sender.send_tx,
  sender.check_receipt, sender.dial_endpoint). Metrics recorded inside
  these spans automatically get trace-id exemplars under OTel's
  trace_based exemplar filter (default since SDK v1.28).
- main.go wraps runLoadTest in a seiload.run root span; per-request
  spans inherit from it so a full run is a single trace.

New metrics (registered; producers wired where applicable):
- seiload.txs.accepted (counter) — per successful submission
- seiload.txs.rejected (counter, reason label) — per failed submission
- seiload.http.errors (counter, status_code label) — non-200 responses
- seiload.tps.achieved (observable gauge) — registered; sample pump
  is a follow-up
- seiload.run.tps.final (gauge) — emitted once at run end via
  Collector.EmitRunSummary
- seiload.run.duration.seconds (gauge) — emitted once at run end
- seiload.run.txs.accepted.total (gauge) — emitted once at run end

Run-summary metrics are Gauges intentionally: single emission at
run end produces exactly one series per metric per run after the
Resource join, which is the right shape for cross-run comparison
dashboards in the benchmarks namespace.

Tests:
- observability/setup_test.go covers Resource population from env,
  hostname fallback for service.instance.id, propagator install,
  endpoint scheme stripping, and short-commit derivation.
- sender/worker_test.go split into TestNewHttpTransport_* (inspects
  the bare transport factory) and TestNewHttpClient_Smoke (verifies
  the otelhttp wrap is present without depending on its internals).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lity

Addresses four findings from the tri-specialist review (Kubernetes,
Platform, OpenTelemetry) on this PR:

1. Enable OpenMetrics on the /metrics handler. promhttp.Handler()'s
   default opts leave EnableOpenMetrics off, which silently strips
   exemplars from responses even when the OTel SDK computed them and
   the Prometheus server is configured for exemplar-storage. Switch
   to promhttp.HandlerFor(..., HandlerOpts{EnableOpenMetrics: true}).
   Pairs with platform's enableFeatures: [exemplar-storage] for the
   end-to-end exemplar path to actually deliver trace IDs.

2. Fix shutdown ordering. Provider shutdown is the only path that
   flushes the OTLP PeriodicReader and BatchSpanProcessor batches;
   shutting the exporters down first made that flush fail silently
   and lost the final batch. New order: MeterProvider and
   TracerProvider shut down first, cascading their flushes into the
   still-live exporters. The Prometheus pull-reader is immune but we
   keep the uniform ordering invariant.

3. Drop worker_id from send_latency / receipt_latency histogram
   labels. Per-worker fan-out multiplied series by ~50 at realistic
   autobake scale, bumping against the 50k cardinality guardrail
   before adding any new dimensions. Worker-level investigation is
   preserved via the span attributes (worker_id flows through the
   trace, accessible from any exemplar pivot).

4. Fix defer cancel() inside the receipt loop. Each transaction's
   context deferred its cancel until watchTransactions returned,
   accumulating unbounded deferred state across a long run. Cancel
   explicitly at end-of-iteration.

All tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham changed the title refactor(sender): configurable HTTP idle-conn pool (defaults 500/50) feat(observability): OTel instrumentation + HTTP idle-pool options Apr 21, 2026
bdchatham and others added 3 commits April 21, 2026 13:39
The sync.OnceValue bundles and observability.Meter/Tracer indirections
were defensive scaffolding — added on the assumption that package-level
var meter = otel.Meter(...) would capture the NoOp provider forever.
Not true for OTel Go v1.19+: the internal/global package uses a
delegation mechanism where meters and instruments created before
SetMeterProvider are transparently rebound when the real provider is
installed (internal/global/meter.go setDelegate).

Current SDK is v1.37, well past that fix. Revert to the standard OTel
Go idiom: package-level var meter + instruments as package vars,
acquired with otel.Meter / otel.Tracer directly. Matches every
upstream OTel Go tutorial and example.

- Delete observability/meter.go (wrappers only).
- sender/metrics.go: var meter = otel.Meter(...); instruments as
  package vars; drop the metricsBundle + senderMetrics OnceValue.
- sender/worker.go: var tracer = otel.Tracer(...); drop senderTracer
  OnceValue.
- stats/metrics.go: same simplification.
- stats/block_collector.go: rename local gasUsed -> gas to avoid
  shadowing the package-level metric.
- stats/run_summary.go: direct metric refs.

observability.Setup remains the single-responsibility bootstrap — it
still owns Resource construction, propagator install, exporter
configuration, and shutdown orchestration. That's where the value
actually is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the current stable HTTP semconv attribute names
(http.request.method, server.address, etc.) on the auto-generated
http.client.* metrics and client spans emitted by otelhttp.NewTransport.
Our bespoke seiload_* metrics + span attributes don't depend on these
names so no query/dashboard impact; cosmetic alignment with current
OTel HTTP semconv.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review feedback on comment density. Principle applied: keep only
the non-obvious WHY, delete anything that restates the code. Multi-line
rationale moves to observability/README.md where it belongs (invariants
on run-scope cardinality, shutdown order, exemplar requirements,
cluster-of-seiload extensibility).

Also drops the stale Setup docstring paragraph that still referenced
the long-deleted observability.Meter/Tracer accessors.

No behavior change. Net ~80 lines removed from code comments, 33 lines
added to README.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTel SDK deps (v1.43+) declare go 1.25.0 in their own go.mod, which
bumped this module's go directive via `go mod tidy`. The pinned CI Go
1.24.5 and golangci-lint v1.64.8 (built on Go 1.24) can't load configs
targeting Go 1.25, so the lint step fails at startup.

- .github/workflows/build-and-test.yml: go-version matrix 1.24.5 -> 1.25.1;
  golangci-lint-action @V3 -> @v6 with v2.1.6 binary (built on Go 1.25).
- Dockerfile: golang:1.24.5-bullseye -> golang:1.25.1-bullseye to match.

No source changes — pure toolchain bump.
Three in my changes + two pre-existing on main unmasked by the newer
default ruleset.

Mine:
- Move observable-gauge registration (worker_queue_length, tps_achieved)
  from package vars into init(). The instrument handles are never read
  by app code — the callbacks run via OTel on each scrape — so holding
  them in named vars reads as dead code to the unused analyzer. init()
  with discarded returns makes the side-effect-registration intent
  explicit.

Pre-existing (caught now because we moved from golangci-lint v1.64 to
v2.11 whose default enables errcheck + staticcheck ST1005):
- stats/logger.go: guard `reportFile.Close()` with `_ =` via deferred
  closure (errcheck).
- sender/ramper.go: lowercase the error-string first letter to match
  Go idiom (ST1005).
Comment thread observability/setup.go Outdated
Comment thread sender/metrics.go Outdated
Comment thread sender/metrics.go
Comment thread sender/worker.go
Comment thread sender/worker.go Outdated
Comment thread .github/workflows/build-and-test.yml
Comment thread main.go
- Rename OTEL_SERVICE_VERSION → SEILOAD_SERVICE_VERSION (matches the
  SEILOAD_* run-scope namespace; OTEL_SERVICE_VERSION was never an OTel
  spec env var anyway, spec uses OTEL_RESOURCE_ATTRIBUTES).
- Distinguish receipt_latency description from send_latency.
- Document OTel delegation pattern at package-level meter acquisition so
  the next reader doesn't re-derive why this works before Setup runs.
- ethclient.DialContext(dialCtx, ...) so the dial honors the span ctx.
- go.mod 1.25.0 → 1.25.1 to match CI toolchain.
- Graceful metrics-server shutdown via http.Server + Shutdown, with a
  5s timeout ordered after obsShutdown so the final Prometheus scrape
  path stays up during provider flush.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@bdchatham
Copy link
Copy Markdown
Author

@amir-deris thanks for the careful pass. All addressed in 95fe37c:

# Comment Resolution
1 OTEL_SERVICE_VERSION naming Renamed to SEILOAD_SERVICE_VERSION throughout (observability/setup.go:38, test, PR body). You're right that OTEL_SERVICE_VERSION isn't a spec env var — OTel's path is OTEL_RESOURCE_ATTRIBUTES=service.version=…. SEILOAD_* namespace is the right home.
2 receipt_latency description Updated to "Latency from transaction submission to receipt confirmation in seconds".
3 Stale sync.OnceValue claim in PR body PR body rewritten. The actual pattern is package-level var meter = otel.Meter(...) relying on OTel's delegating global provider. Added an inline comment at sender/metrics.go:12 documenting the mechanism so the next reader doesn't have to re-derive it. stats/metrics.go points to that same comment.
4 HttpClientOption / WithMaxIdleConns[PerHost] exported with no external caller Leaving exported intentionally as the opt-in tuning API for a follow-up that plumbs settings.HttpMaxIdleConns[PerHost] through to newHttpClient. Called out in the PR body's follow-up list. Tests validate the mechanism so the wiring PR lands with coverage ready.
5 ethclient.DialDialContext Fixed in sender/worker.go:156ethclient.DialContext(dialCtx, ...) so the dial honors the span context.
6 go.mod 1.25.0 vs CI 1.25.1 go.mod bumped to 1.25.1 to match.
7 Graceful metrics server shutdown Switched to http.Server{} with a deferred Shutdown(ctx) using a 5s timeout. Also set ReadHeaderTimeout: 5s while there. Ordered so metricsServer.Shutdown runs before obsShutdown returns (first deferred = last executed), keeping the scrape endpoint available during the OTel provider flush.

Lint + tests clean locally.

@bdchatham bdchatham enabled auto-merge (squash) April 22, 2026 18:21
@bdchatham bdchatham disabled auto-merge April 22, 2026 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants