Skip to content

feat(mempool): observability for the TxMempool rewrite invariants#3519

Merged
bdchatham merged 12 commits into
mainfrom
chore/mempool-observability-metrics
May 29, 2026
Merged

feat(mempool): observability for the TxMempool rewrite invariants#3519
bdchatham merged 12 commits into
mainfrom
chore/mempool-observability-metrics

Conversation

@bdchatham

@bdchatham bdchatham commented May 28, 2026

Copy link
Copy Markdown
Contributor

Adds two metrics that expose the TxMempool rewrite's compaction behavior without changing logic: tendermint_mempool_compact_total{trigger} and tendermint_mempool_compact_duration_seconds. The trigger label distinguishes the three call sites of compact() (insert_overflow, update, reap); rate-of-insert_overflow is the capacity-pressure signal. Emitted via OpenTelemetry (otel.Meter("tendermint_mempool")) following the sei-db convention.

Companion to platform sei-protocol/platform#743 (recording rules, dashboard panels, alerts). The reactor-side gossip-bytes counter and any pending→ready promotion signal are separate followups — the latter is already directly tested by mempool_pending_promotion_test.js in the release-test suite, so a metric for it isn't yet pulling weight.

Metric rationale

These metrics let us answer four questions about the rewrite that we couldn't ask cleanly before.

What each metric is for, when stress-testing

compact_total{trigger="insert_overflow"} is the capacity-pressure signal. When this counter ticks, the mempool was momentarily over hardLimit (= 2× softLimit). A non-zero rate means ingress is outpacing block consumption — the only operational question is whether admission control is supposed to be catching it. If both check_tx_met_drop_utilisation_threshold and insert_overflow are rising, pressure escaped the gate.

compact_total{trigger="update"} is the block heartbeat. It fires once per Update, so its rate is a block-rate proxy. If it flatlines while seiload is still pushing, the consensus loop is stuck — different signal from block_height_delta because it's measured at the mempool, not the indexer.

compact_total{trigger="reap"} is this node's proposer share. Validators that propose blocks Reap; followers don't. Useful for telling "is this node doing block-proposal work?" without consulting Tendermint internals.

compact_duration_seconds is where we'll catch the rewrite's real failure mode. compact() is O(m log m) over the full mempool. As load grows, this latency grows. The number we'll watch is rate(compact_total[5m]) × avg(compact_duration_seconds) — wall-time-fraction spent inside compact. Once that approaches ~50%, compact is now blocking block production. The histogram lets us see this coming long before p99 actually hits block-interval (~200ms).

The composite questions only these together can answer

  1. Does the rewrite's amortization actually hold under load? rate(insert_overflow) / rate(inserted_txs) should stay sub-linear as throughput rises. If the ratio goes to 1, the 2× hardLimit amortization broke and we're back to per-insert prune.
  2. Is compact CPU-bound or block-bound? Compare per-call duration to per-block compact rate. compact_duration_p99 rising independently of insert_overflow rate means individual compacts got expensive (likely GC pressure on 100k+ entries); both rising means we're pruning more often AND each prune is slower — the failure mode.
  3. Is admission control doing its job or just delaying the problem? rate(check_tx_met_drop_utilisation_threshold) / rate(insert_overflow) — if it's high, the gate is absorbing the spike. If it's near zero while overflow is high, the gate's threshold is mistuned.

What we're still blind to

These metrics don't tell us why a compact was slow (sort vs. map-rebuild vs. GC) — that needs pprof or per-phase timing inside compact(). Per-peer gossip bandwidth is the next PR. The pending→ready promotion behavior is directly tested by mempool_pending_promotion_test.js; a continuous prod signal for it can be added later if operational need surfaces.

Test plan

  • go build ./... clean
  • go test ./internal/mempool/ -count=1 passes
  • Validate emissions in a nightly chaos run
  • Wire up platform-side recording rules + dashboard panels per do not unregister if out of rent #743

References: design doc sei-protocol/platform#743, TxMempool rewrite #3476.

🤖 Generated with Claude Code

Adds four mempool metrics that expose the load-bearing invariants of the
TxMempool rewrite (#3476) without changing any behavior:

- tendermint_mempool_compact_total{reason} — counts compact() invocations
  labeled by the triggering call site: insert_overflow (Insert above
  hardLimit), update (per-block recompute), or reap. Rate of reason=
  insert_overflow is the actual capacity-pressure signal; reason=update
  is per-block and benign.

- tendermint_mempool_compact_duration_seconds — histogram of compact()
  wall-clock. compact() is O(m log m) over the full mempool with an
  index rebuild, so buckets extend to 10s to accommodate 100k-entry
  mempools under GC pressure.

- tendermint_mempool_promotion_total — counts pending→ready transitions
  inside the inline EVM promotion loop in txStore.insert. Cosmos txs
  are auto-ready and not counted.

- tendermint_mempool_utilisation — gauge mirroring the same ratio the
  CheckTx drop gate evaluates (txmp.utilisation()). Exposed directly so
  recording rules don't redrive total/(Size+PendingSize) in PromQL,
  which invites drift from the gate's own definition.

Design captured in platform PR #743 (observability appendix to the
mempool rewrite test coverage 1-pager). This PR is the upstream
implementation of section B of that appendix; the reactor-side
gossip-bytes counter is a separate followup since it requires threading
*Metrics through NewReactor (no metrics field today).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 29, 2026, 3:01 PM

Platform-engineer:
- Utilisation gauge was stale after Flush() and ReapTxs(remove=true). Both
  paths change Size implicitly. Set Size / PendingSize / TotalTxsSizeBytes
  / Utilisation at end of Flush (to 0) and end of ReapTxs (to current
  store state, when remove=true). Same staleness existed for the older
  gauges — fix together.

Observability-platform-engineer:
- Rename compact_total label reason → trigger (Prometheus convention:
  "reason" connotes failure cause; "trigger" reads as caller dimension).
  Cheap pre-scrape, expensive after.
- Add tx_type label to promotion_total. Locks EVM scope into a queryable
  dimension instead of metric-name lore; preserves the door for
  tx_type="cosmos" counts without renaming.
- Extend compact_duration_seconds buckets to 20s and 30s. The metric's
  own docstring sets expectation of 5-10s under GC pressure — without
  higher buckets, p99 saturates at +Inf the moment GC bites.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

Coral cross-review applied (commit c92945f)

Both platform-engineer and observability-platform-engineer reviewed the initial commit. All must-fix findings landed in this revision:

Platform-engineer findings (Utilisation gauge staleness):

  • Flush() cleared the store but left gauges stale → set Size/PendingSize/TotalTxsSizeBytes/Utilisation to 0 after Clear().
  • ReapTxs(remove=true) evicts via compact("reap") but never re-emits the gauges → set them at end of ReapTxs when remove=true. (Same bug class existed for the pre-existing Size gauge — fixed together.)

Observability findings:

  • compact_total{reason}compact_total{trigger} — Prometheus convention bias; reason reads as failure cause, trigger reads as caller dimension. Free pre-scrape; expensive after.
  • promotion_totalpromotion_total{tx_type="evm"} — locks the EVM scope into a queryable dimension instead of metric-name lore; preserves the door for tx_type="cosmos" later without renaming.
  • compact_duration_seconds buckets extended to …, 5, 10, 20, 30 — the docstring set expectation of 5-10s under GC pressure, but the upper bucket of 10s meant p99 would saturate at +Inf exactly when needed most.

Non-blocking findings (deferred):

  • One-line panic-defer comment (cosmetic)
  • Hoisting .With() outside the defer (micro-opt; compact is per-block, not nanosecond-hot)

Bug found in the consumer-side platform PR #743: several PromQL queries reference Go-case names (tendermint_mempool_PendingSize, tendermint_mempool_Size) but the actual scrape names are snake_case (pending_size, size). Filing a fix-up to platform#743 separately.

Ready for human review when convenient.

Comment thread sei-tendermint/internal/mempool/tx.go Outdated
@codecov

codecov Bot commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.22%. Comparing base (8567422) to head (bd7f503).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3519      +/-   ##
==========================================
- Coverage   59.03%   58.22%   -0.82%     
==========================================
  Files        2199     2129      -70     
  Lines      182207   174096    -8111     
==========================================
- Hits       107569   101363    -6206     
+ Misses      64975    63729    -1246     
+ Partials     9663     9004     -659     
Flag Coverage Δ
sei-chain-pr 73.54% <100.00%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/internal/mempool/mempool.go 81.86% <100.00%> (+0.74%) ⬆️
sei-tendermint/internal/mempool/metrics.go 100.00% <ø> (ø)
sei-tendermint/internal/mempool/tx.go 92.85% <100.00%> (+0.22%) ⬆️

... and 70 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Keep compact() pure — no string parameter, no metrics emission inside.
The three call sites (Insert overflow, Update recompute, Reap clear)
each time the call and emit compact_total{trigger} + compact_duration_seconds
directly. Matches the original compact() signature and reads as
"instrumentation at the call site" instead of "function knows about metrics".

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread sei-tendermint/internal/mempool/tx.go Outdated
Comment thread sei-tendermint/internal/mempool/metrics.go Outdated
bdchatham and others added 2 commits May 28, 2026 16:48
- Named constants for label keys and values (labelTrigger, labelTxType,
  triggerInsertOverflow/Update/Reap, txTypeEVM). Cleaner call sites and
  one canonical place to grep for label values.

- Drop the Utilisation gauge. The same value is trivially derivable as
  (tendermint_mempool_size + tendermint_mempool_pending_size) /
  (cfg.Size + cfg.PendingSize) via PromQL; not worth a dedicated metric
  + 4 emit sites.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Other Metrics fields use 1-2 line descriptors. Match that style.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham marked this pull request as ready for review May 29, 2026 00:12
@cursor

cursor Bot commented May 29, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
Observability-only changes around existing compact() and gauge updates; no mempool admission, ordering, or consensus logic is modified.

Overview
Adds OpenTelemetry instrumentation around the TxMempool rewrite’s compact() path without changing compaction logic. New signals are tendermint_mempool_compact_total (with trigger = insert_overflow, update, or reap) and tendermint_mempool_compact_duration_seconds, recorded at the three existing call sites in txStore (Insert overflow, Update, Reap with removal).

TxMempool now keeps existing go-kit gauges in sync when the store is cleared or txs are reaped: Flush zeros size gauges, and ReapTxs refreshes ready/pending/byte gauges when remove is true (matching post-CheckTx / Update behavior).

Reviewed by Cursor Bugbot for commit bd7f503. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread sei-tendermint/internal/mempool/tx.go Outdated
Comment thread sei-tendermint/internal/mempool/metrics.go Outdated
bdchatham and others added 3 commits May 28, 2026 22:41
Cursor bugbot caught: compact() resets account.nextNonce=firstNonce then
calls insert() for every retained tx. The inline promotion loop in
insert() would then re-promote every already-ready EVM tx and re-count
the metric, inflating PromotionTotal by the full ready-EVM population
on every block (compact runs at least once per block via the update
trigger).

Add a countPromotion flag to insert() — true for the public Insert path,
false for compact's internal re-insertion. PromotionTotal now reflects
only new pending→ready transitions from user-driven inserts, restoring
the rate(promotion_total) ≈ rate(inserted_txs) invariant claimed in the
PR description.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Per Amir's review note: go-kit/kit/metrics is unmaintained (last commit
on the upstream go-kit repo is years old) and OTel is the strategic
direction. sei-db has been on OTel for some time
(sei-db/db_engine/pebbledb/mvcc/metrics.go, etc).

Pull the 3 new mempool metrics out of the go-kit Metrics struct and emit
them via otel.Meter("tendermint_mempool"). Attribute sets are package-
level vars so no allocation per emit. Emit sites use context.Background()
matching the sei-db convention (the txStore API doesn't carry ctx).

The other 22 mempool metrics remain on go-kit — they'll move in a
follow-up that migrates the whole package consistently.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@bdchatham

bdchatham commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

Both reviewer comments addressed:

@cursor[bot] — fixed the PromotionTotal double-counting in compact() (commit `d5d4a1a`). The metric was incrementing for every re-inserted EVM tx during compact's account.nextNonce = firstNonce reset, so the rate would have been dominated by mempool size × block rate instead of new external promotions. insert() now takes a countPromotion bool flag that the public Insert() sets to true and compact()'s re-insertion path sets to false. The PR-description invariant rate(promotion_total) ≈ rate(inserted_txs) now holds.

@amir-deris — converted the 3 new metrics to OTel native (commit `105bfd2`). Following the sei-db convention (sei-db/db_engine/pebbledb/mvcc/metrics.go):

var (
    mempoolMeter = otel.Meter("tendermint_mempool")
    otelMetrics = struct {
        compactTotal           metric.Int64Counter
        compactDurationSeconds metric.Float64Histogram
        promotionTotal         metric.Int64Counter
    }{ /* ... */ }
    triggerInsertOverflowAttr = metric.WithAttributes(attribute.String("trigger", "insert_overflow"))
    /* ... */
)

// Emit site:
otelMetrics.compactTotal.Add(context.Background(), 1, triggerInsertOverflowAttr)
otelMetrics.compactDurationSeconds.Record(context.Background(), time.Since(start).Seconds())

Attribute sets are package-level vars so emission is allocation-free on the hot path. The existing 22 go-kit metrics in this package remain untouched — they should move to OTel in a follow-up that handles the whole package consistently rather than mixing per-metric.

bdchatham and others added 2 commits May 28, 2026 22:51
It's an indirect indicator (the rewrite's invariant 3 is directly tested
by mempool_pending_promotion_test.js in the release-test suite). The
end-to-end RPC test is the ground truth; a separate metric for the same
thing isn't pulling weight yet, and dropping it removes the
countPromotion flag plumbing from insert() entirely.

Compaction metrics (compact_total{trigger}, compact_duration_seconds)
stay — they expose the rewrite's actual new failure mode (O(m log m)
prune + rebuild) which we don't have a corresponding test for.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
}
start := time.Now()
s.compact(inner, true)
otelMetrics.compactTotal.Add(context.Background(), 1, triggerUpdateAttr)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a follow-up to properly pass context in

observability-platform-engineer caught: the OTel→Prometheus exporter in
sei-chain registers with namespace="sei-chain" (utils/metrics/metrics_util.go:20).
The meter name "tendermint_mempool" becomes the otel_scope_name label, NOT
a series prefix. With the short metric names "compact_total" etc, the
exported series would be sei_chain_compact_total — platform #743's
recording rules expecting tendermint_mempool_compact_total would silently
return no data.

Prefix the metric names so the exported series is
sei_chain_tendermint_mempool_compact_total — verbose but matches the
existing tendermint_mempool_* discoverability (size, pending_size, etc).
Same precedent as app_tx_count_total → sei_chain_app_tx_count_total.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

Final Coral review round — one must-fix landed

Both experts re-reviewed at `04eee9659`. Platform-engineer returned "ready"; observability caught a real blocker.

Naming mismatch — fixed in `a36521af`. The OTel→Prom exporter in sei-chain registers with `namespace="sei-chain"` (`utils/metrics/metrics_util.go`), so the meter name ("tendermint_mempool") becomes the `otel_scope_name` label, not a series prefix. With the short names `compact_total` / `compact_duration_seconds`, the exported series would have been `sei_chain_compact_total` — the platform PR #743's recording rules expecting `tendermint_mempool_compact_total` would have silently returned no data.

Fix: prefixed the metric names with `tendermint_mempool_` so the exported series is `sei_chain_tendermint_mempool_compact_total`. Matches the existing precedent (`sei_chain_app_tx_count_total` from `meter="app"`, metric="tx_count"`) and sits alongside the existing go-kit `tendermint_mempool_*` metrics in queries.

Platform appendix in sei-protocol/platform#743 updated in parallel (`f39756b`) to reference `sei_chain_tendermint_mempool_*` in PromQL.

Other findings — all green:

  • Histogram bucket spec preserved (14 boundaries + `+Inf` + `_count` + `_sum` = 17 series/instance)
  • Cardinality recount: ~5,000 series net-new
  • Mixed go-kit + OTel coexists cleanly on the same `/metrics` endpoint
  • Attribute-set pre-baking is safe (no aliasing)
  • OTel hot-path emit cost ~50-200ns/call, comparable to go-kit
  • Bugbot's pre-fix finding was a real bug we should have caught in round 1

Ready for human review.

Comment thread sei-tendermint/internal/mempool/metrics.go Outdated
Comment thread sei-tendermint/internal/mempool/metrics.go Outdated
Comment thread sei-tendermint/internal/mempool/tx.go
- Replace local must[V any] helper with libs/utils.OrPanic1
- Switch compact_duration_seconds buckets to stdprometheus.
  ExponentialBucketsRange(0.001, 30, 14) instead of a hand-tuned list
- Keep timing at the 3 call sites (rather than moving inside compact()
  to dedupe) so compact() stays pure and the per-trigger metric
  emission remains explicit at the caller

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@bdchatham

Copy link
Copy Markdown
Contributor Author

@pompon0 — addressed in `bd7f50392`:

  1. `utils.OrPanic1` — swapped the local `must[V any]` helper for the shared one.
  2. Exponential buckets — switched to `stdprometheus.ExponentialBucketsRange(0.001, 30, 14)` (the same helper used in `internal/consensus/metrics.gen.go`). 14 boundaries spanning 1ms to 30s with a ~2.32× factor per bucket.
  3. Timing inside `compact()` — keeping it at the call sites. `compact()` stays pure (no OTel emission inside) and the per-`trigger` counter increment is already at the caller, so the timing sits naturally next to it. The duplication is 2 lines × 3 sites and visually clear; moving timing inside would either require a label parameter (which the prior review pass cut) or split the histogram/counter emission across two locations.

@bdchatham bdchatham enabled auto-merge May 29, 2026 15:01
metric.WithDescription("Number of compact() invocations, labeled by call site (insert_overflow, update, reap)."),
)),
compactDurationSeconds: utils.OrPanic1(mempoolMeter.Float64Histogram(
"tendermint_mempool_compact_duration_seconds",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably doesn't need _seconds in the name. That will be added by default.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit bd7f503. Configure here.

start := time.Now()
s.compact(inner, false)
otelMetrics.compactTotal.Add(context.Background(), 1, triggerInsertOverflowAttr)
otelMetrics.compactDurationSeconds.Record(context.Background(), time.Since(start).Seconds())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration histogram missing trigger attribute at all call sites

Medium Severity

The compactDurationSeconds.Record() calls omit the trigger attribute (triggerInsertOverflowAttr, triggerUpdateAttr, triggerReapAttr) that is passed to compactTotal.Add(). All three call sites record duration into an unlabeled histogram, making it impossible to break down compact latency by trigger. This defeats the stated justification for keeping timing outside compact() ("The dimension of the metric depends on where this is called").

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit bd7f503. Configure here.

@bdchatham bdchatham added this pull request to the merge queue May 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026
@bdchatham bdchatham added this pull request to the merge queue May 29, 2026
Merged via the queue into main with commit cd0db11 May 29, 2026
55 of 56 checks passed
@bdchatham bdchatham deleted the chore/mempool-observability-metrics branch May 29, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants