Skip to content

feat(metrics): migrate sei-cosmos to OpenTelemetry (PLT-353)#3467

Merged
amir-deris merged 22 commits into
mainfrom
amir/plt-352-migrate-sei-cosmos-to-otel
May 26, 2026
Merged

feat(metrics): migrate sei-cosmos to OpenTelemetry (PLT-353)#3467
amir-deris merged 22 commits into
mainfrom
amir/plt-352-migrate-sei-cosmos-to-otel

Conversation

@amir-deris
Copy link
Copy Markdown
Contributor

@amir-deris amir-deris commented May 19, 2026

Adds OTel instrumentation to sei-cosmos following the same pattern as PLT-329, PLT-330, PLT-336, PLT-339, and PLT-343.

New instruments

baseapp (meter seicosmos_baseapp)

  • mid_block_duration — histogram, seconds
  • end_block_duration — histogram, seconds
  • deliver_tx_duration — histogram, seconds
  • tx — counter, total delivered transactions
  • tx_result — counter, result label (successful/failed)
  • tx_gas_used — gauge
  • tx_gas_wanted — gauge
  • commit_duration — histogram, seconds
  • abci_query_duration — histogram, seconds, path label
  • process_proposal_duration — histogram, seconds
  • finalize_block_duration — histogram, seconds
  • get_tx_priority_hint_duration — histogram, seconds
  • run_tx_duration — histogram, seconds, mode label (replaces MeasureThroughputSinceWithLabels for TxCount)
  • run_msgs_duration — histogram, seconds (replaces MeasureThroughputSinceWithLabels for MessageCount)
  • run_msg_latency — histogram, seconds, type label (replaces both sei.cosmos.run.msg.latency and cosmos.run.msg.latency)

storev2/rootmulti (meter seicosmos_storev2_rootmulti)

  • sc_commit_latency — histogram, seconds
  • ss_version — gauge
  • historical_abci_query — counter, success + proof labels
  • iavl_total_key_bytes — gauge, store_name label
  • iavl_total_value_bytes — gauge, store_name label
  • iavl_total_num_keys — gauge, store_name label
  • state_sync_keys_exported — counter

tasks (meter seicosmos_tasks)

  • scheduler_retries — counter
  • scheduler_incarnations — counter

store/types (meter seicosmos_store_types)

  • gas_exceeded — counter, error + descriptor labels
  • bounded_cache — gauge, type label

x/upgrade (meter seicosmos_x_upgrade)

  • begin_blocker_duration — histogram, seconds
  • plan_height — gauge, name label

x/upgrade/keeper (meter seicosmos_x_upgrade_keeper)

  • plan_height — gauge, name label

x/auth/vesting (meter seicosmos_x_auth_vesting)

  • new_account — counter
  • account_amount — gauge, denom label

x/bank/keeper (meter seicosmos_x_bank_keeper)

  • send_amount — gauge, denom label

x/distribution/keeper (meter seicosmos_x_distribution_keeper)

  • withdraw_reward_amount — gauge, denom label
  • withdraw_commission_amount — gauge, denom label

x/staking/keeper (meter seicosmos_x_staking_keeper)

  • delegate — counter
  • delegate_amount — gauge, denom label
  • redelegate — counter
  • redelegate_amount — gauge, denom label
  • undelegate — counter
  • undelegate_amount — gauge, denom label

x/gov/keeper (meter seicosmos_x_gov_keeper)

  • proposal — counter
  • vote — counter, proposal_id label
  • deposit — counter, proposal_id label

Notes

  • All packages use dual-emit with TODO(PLT-353) comments pending dashboard verification.

@cursor
Copy link
Copy Markdown

cursor Bot commented May 19, 2026

PR Summary

Low Risk
Changes are primarily additive instrumentation and a mechanical Query signature update with context; dual-emit preserves existing telemetry behavior until PLT-353 cleanup.

Overview
This PR migrates sei-cosmos observability to OpenTelemetry while keeping existing Armon/go-metrics telemetry in parallel (dual-emit with TODO(PLT-353) until dashboards are verified).

BaseApp gains histograms/counters/gauges for ABCI paths (mid_block, end_block, deliver_tx, commit, process_proposal, finalize_block, get_tx_priority_hint), tx execution (run_tx, run_msgs, per-message run_msg_latency, tx count/result/gas), and abci_query_duration with cardinality-safe route labels via new abciQueryMetricRoute (registered gRPC paths, bounded app/ / store/ / custom/ buckets, not raw client paths).

Store layer: types.Queryable.Query now takes context.Context; call sites are updated (handleQueryStore, rootmulti, storev2, evmrpc proofs). storev2/rootmulti and store/types add OTel for SC commit latency, SS version, historical ABCI queries, IAVL export stats, gas exceeded, and bounded-cache evictions. OCC scheduler (tasks) records retries and max incarnations.

Module keepers (bank, staking, gov, distribution, vesting, upgrade) emit OTel counters/gauges for high-value txs and upgrade plans; token amounts use telemetry.DenomClass (usei / ibc / factory / other) instead of raw denoms where applicable.

Reviewed by Cursor Bugbot for commit 84551ab. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 19, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedMay 26, 2026, 4:13 PM

@amir-deris amir-deris changed the title Added otel to sei-cosmos package feat(metrics): migrate sei-cosmos to OpenTelemetry (PLT-353) May 19, 2026
@amir-deris amir-deris requested review from bdchatham and masih May 19, 2026 23:48
Comment thread sei-cosmos/baseapp/abci.go Outdated
Comment thread sei-cosmos/x/gov/keeper/msg_server.go Outdated
Comment thread sei-cosmos/x/upgrade/abci.go
@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

❌ Patch coverage is 83.46457% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.06%. Comparing base (0d24524) to head (84551ab).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sei-cosmos/x/gov/keeper/msg_server.go 70.58% 10 Missing ⚠️
sei-cosmos/baseapp/abci.go 87.50% 6 Missing ⚠️
sei-cosmos/baseapp/abci_query_metrics.go 92.10% 2 Missing and 1 partial ⚠️
sei-cosmos/baseapp/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/store/types/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/storev2/rootmulti/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/tasks/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/x/auth/vesting/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/x/bank/keeper/metrics.go 50.00% 1 Missing and 1 partial ⚠️
sei-cosmos/x/distribution/keeper/metrics.go 50.00% 1 Missing and 1 partial ⚠️
... and 5 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3467      +/-   ##
==========================================
+ Coverage   59.04%   59.06%   +0.02%     
==========================================
  Files        2187     2200      +13     
  Lines      181488   181679     +191     
==========================================
+ Hits       107163   107317     +154     
- Misses      64703    64729      +26     
- Partials     9622     9633      +11     
Flag Coverage Δ
sei-chain-pr 72.20% <83.46%> (?)
sei-db 70.41% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
evmrpc/state.go 59.05% <100.00%> (ø)
sei-cosmos/baseapp/baseapp.go 75.78% <100.00%> (+0.48%) ⬆️
sei-cosmos/store/rootmulti/dbadapter.go 44.00% <100.00%> (ø)
sei-cosmos/store/rootmulti/store.go 59.31% <100.00%> (+0.45%) ⬆️
sei-cosmos/store/types/cache.go 72.22% <100.00%> (+0.52%) ⬆️
sei-cosmos/store/types/gas.go 91.66% <100.00%> (+0.21%) ⬆️
sei-cosmos/store/types/store.go 73.68% <ø> (ø)
sei-cosmos/storev2/commitment/store.go 53.12% <100.00%> (ø)
sei-cosmos/storev2/rootmulti/store.go 65.08% <100.00%> (+0.99%) ⬆️
sei-cosmos/storev2/state/store.go 26.31% <100.00%> (ø)
... and 22 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread sei-cosmos/x/upgrade/keeper/metrics.go Outdated
planHeight metric.Int64Gauge
}{
planHeight: must(meter.Int64Gauge(
"plan_height",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instrument-name collision with sei-cosmos/x/upgrade/metrics.go.

Both files declare a Int64Gauge named plan_height with identical name + info attribute schema. The OTel-Prometheus exporter does not propagate the meter scope (seicosmos_x_upgrade vs seicosmos_x_upgrade_keeper) into the Prometheus series name — both collapse to sei_chain_plan_height. Result is either a duplicate-registration warning with one side silently dropped, or two coexisting series differentiated only by otel_scope_name where naive PromQL sum(sei_chain_plan_height) double-counts.

The x/upgrade/abci.go BeginBlocker site is the load-bearing one (fires on every block while a plan exists). The keeper-side recording at ScheduleUpgrade is one-shot and redundant — drop it, or rename to something like upgrade_pending_plan_height.

Separately: the info label is a free-form proposal string. Cardinality-unfriendly and not useful as a metric axis — keep name (bounded by released upgrade handlers, ~20 lifetime values), drop info. Move it to a span attribute / event at upgrade-schedule time if you need it.

Comment thread sei-cosmos/x/gov/keeper/msg_server.go Outdated
},
)
defer func() {
govMetrics.voteTotal.Add(goCtx, 1, otelmetric.WithAttributes(attribute.String("proposal_id", strconv.FormatUint(msg.ProposalId, 10))))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded proposal_id label on voteTotal (and depositTotal below, and VoteWeighted).

Every governance proposal ever cast a vote on lives in the TSDB forever; multiplied by per-validator series, this grows monotonically without bound. Sei mainnet is at ~200 historical proposals today and continues linear growth.

Drop the label. A plain vote_total / deposit_total counter answers "how many votes per block?" which is the only thing the metric is actually useful for. Per-proposal vote/deposit breakdown is a query-time concern against on-chain events — not a metric axis.

Comment thread sei-cosmos/x/bank/keeper/msg_server.go Outdated
defer func() {
for _, a := range msg.Amount {
if a.Amount.IsInt64() {
bankMetrics.sendAmount.Record(goCtx, a.Amount.Int64(), otelmetric.WithAttributes(attribute.String("denom", a.Denom)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded user-controlled denom label — most severe cardinality issue in this PR.

Sei supports tokenfactory denoms (factory/<creator>/<subdenom> — any address can mint) plus IBC denoms with ~100-char hash traces. Sei mainnet already has 10k+ factory denoms, growing unbounded. This same pattern appears on delegateAmount/redelegateAmount/undelegateAmount (staking), withdrawRewardAmount/withdrawCommissionAmount (distribution), and accountAmount (vesting) — every *Amount gauge across this PR.

Options, in order of preference:

  1. Drop denom entirely — record a per-module amount metric without the denom axis. If you need a breakdown, that's a log/span attribute.
  2. Bucket into denom_class — small helper that returns "usei" / "ibc" / "factory" / "other" based on prefix. Preserves the useful axis (native vs synthetic flow) without exposing the registry to thousands of long-string labels.
  3. If a specific dashboard needs aggregate-by-denom for native denoms, allowlist a fixed set via Prometheus metric_relabel_configs rather than emitting them as labels.

This would be a hard block on a single-metric PR; flagging now so it doesn't go out as part of a larger migration.

}

func (c *BoundedCache) emitKeysEvictedMetrics(keysToEvict int) {
storeMetrics.boundedCache.Record(context.Background(), int64(keysToEvict), otelmetric.WithAttributes(attribute.String("type", "keys_evicted")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we take a follow-up to wire the context through here or is there a blocker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also looked into not using context.Background(), but this Cache type is currently only used in tests, and threading context meaningfully from ABCI → KV store → cache backend is a non-trivial SDK-wide change with marginal metric value. I would prefer to leave that for now if that's ok?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, definitely okay to take as a follow-up for getting tracing wired


// cosmos_tx_gas_exceeded
func (g *basicGasMeter) incrGasExceededCounter(errorType string, descriptor string) {
storeMetrics.gasExceeded.Add(context.Background(), 1, otelmetric.WithAttributes(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same callout for context wiring.

commitStartTime := time.Now()
defer telemetry.MeasureSince(commitStartTime, "storeV2", "sc", "commit", "latency")
defer func() {
storev2Metrics.scCommitLatency.Record(context.Background(), time.Since(commitStartTime).Seconds())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same context wiring callout

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can address this one since the historicalAbciQuery counter would get exemplar linkage — we could correlate "this rate-limited query" directly to a specific gRPC trace and that's the most actionable OTel win here.

Comment thread sei-cosmos/baseapp/metrics.go
Comment thread sei-cosmos/storev2/rootmulti/store.go

upgradeMetrics.planHeight.Record(ctx.Context(), plan.Height, otelmetric.WithAttributes(
attribute.String("name", plan.Name),
))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing info attribute on upgrade plan_height metric

Low Severity

The PR description specifies plan_height (and pendingPlanHeight) gauges carry both name + info labels, and the legacy telemetry call right below emits both. However, the new OTel Record calls only include attribute.String("name", plan.Name) — the info attribute is silently dropped, so the OTel metric loses the upgrade metadata that the legacy metric provides. This applies to both x/upgrade/abci.go and x/upgrade/keeper/keeper.go.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e1348dc. Configure here.

@bdchatham
Copy link
Copy Markdown
Contributor

Thanks for the quick turnaround — all four blockers look resolved. A few non-blocking callouts to consider for this PR or a followup:

  • Staking dropped the denom label entirely while bank/distribution/vesting went with denom_class. Worth adding denom_class on staking too for forward-compat with LSM / non-usei staking — cheaper than a schema break later.
  • DenomClass lives in sei-cosmos/telemetry/wrapper.go (the legacy go-metrics bridge) but it's OTel-only — probably belongs in its own file in the telemetry package.
  • Grafana panels don't surface OTel descriptions. The tightened "...in the last X" wording is great for the data sheet but invisible in panel legends. Either rename to *_amount_last or document on the dashboards when we author them.
  • abci_query_duration bucket ceiling tops at 10s. Archive store queries can run 10–60s — worth adding 30s/60s/120s buckets for this histogram now that path is bounded.
  • Recording rule + alert opportunity. With path bounded, route:sei_chain_abci_query_duration:p99_5m is cheap to author, and an AbciQueryRouteUnknownRatio alert would catch new paths leaking into */unknown before they go invisible.

LGTM otherwise. Happy to file these as separate issues if useful.

stakingMetrics.delegateTotal.Add(goCtx, 1)
// TODO(PLT-353): remove once staking_delegate_total verified
telemetry.IncrCounter(1, types.ModuleName, "delegate")
stakingMetrics.delegateAmount.Record(goCtx, msg.Amount.Amount.Int64())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staking OTel amount gauges missing denom_class attribute

Medium Severity

The staking OTel gauges (delegateAmount, redelegateAmount, undelegateAmount) record without any denomination attribute, while the equivalent bank, distribution, and vesting OTel gauges all consistently include a denom_class attribute via telemetry.DenomClass(). The legacy metrics at the same call sites still emit a denom label. The PR description also lists a denom label for these staking metrics. This inconsistency means the staking OTel metrics lose denom-level observability that the other modules preserve.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 47ba551. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is the delegateAmount, redelegateAmount and undelegateAmount are in base denom (usei) and no need to include the attribute for this case.

@amir-deris
Copy link
Copy Markdown
Contributor Author

Staking dropped the denom label entirely while bank/distribution/vesting went with denom_class. Worth adding denom_class on staking too for forward-compat with LSM / non-usei staking — cheaper than a schema break later.

For denom class, Delegate/Redelegate/Undelegate all operate on msg.Amount, which is validated against k.BondDenom(ctx) — always usei on sei-chain. Also the metric unit {usei} reflects that assumption so I didn't add denom class attribute for those metrics. Also updated the metric unit to {utoken} for cases where denom is specified.

DenomClass lives in sei-cosmos/telemetry/wrapper.go (the legacy go-metrics bridge) but it's OTel-only — probably belongs in its own file in the telemetry package.

Good point! Moved it to its own file.

Grafana panels don't surface OTel descriptions. The tightened "...in the last X" wording is great for the data sheet but invisible in panel legends. Either rename to *_amount_last or document on the dashboards when we author them.

Sounds good! Added last to metric names.

abci_query_duration bucket ceiling tops at 10s. Archive store queries can run 10–60s — worth adding 30s/60s/120s buckets for this histogram now that path is bounded.

Added extra buckets for abci_query_duration.

Recording rule + alert opportunity. With path bounded, route:sei_chain_abci_query_duration:p99_5m is cheap to author, and an AbciQueryRouteUnknownRatio alert would catch new paths leaking into */unknown before they go invisible.

Good point. Once we get to new dashboards/alerts, we can revisit this topic.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6fdfa8f. Configure here.

Comment thread sei-cosmos/baseapp/abci.go
@amir-deris amir-deris added this pull request to the merge queue May 26, 2026
@amir-deris amir-deris removed this pull request from the merge queue due to a manual request May 26, 2026
@amir-deris amir-deris added this pull request to the merge queue May 26, 2026
Merged via the queue into main with commit 2667d96 May 26, 2026
50 checks passed
@amir-deris amir-deris deleted the amir/plt-352-migrate-sei-cosmos-to-otel branch May 26, 2026 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants