Refactor and fix the Prometheus metrics code by aditya1702 · Pull Request #554 · stellar/wallet-backend

aditya1702 · 2026-03-30T19:28:51Z

PRs:

* metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * make check * Add metrics tests * Add CollectAndCompare tests

* metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

Copilot

Pull request overview

Refactors the wallet-backend observability layer by replacing the old MetricsService interface with a concrete metrics.Metrics struct (and sub-metrics structs), and updates ingestion/serve/loadtest/indexer/data paths to emit Prometheus metrics directly via those collectors.

Changes:

Introduces typed metrics groups (DB, RPC, Ingestion, HTTP, GraphQL, Auth) and updates call sites to use Prometheus collectors directly.
Adds/updates ingestion metrics (retries/exhaustions/errors, fetch duration, lag) and GraphQL operation/field middleware metrics.
Updates DB pool configuration to support QueryExecMode and defaults serve to QueryExecModeExec for PgBouncer compatibility.

Reviewed changes

Copilot reviewed 81 out of 81 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
internal/services/token_ingestion_test.go	Switch tests to real DB metrics
internal/services/mocks.go	Remove RPC heartbeat channel mock
internal/services/ingest.go	Wire typed metrics into ingest service
internal/services/ingest_live.go	Emit ingestion metrics (lag/errors/durations)
internal/services/ingest_backfill.go	Emit ingestion metrics in backfill paths
internal/services/fee_bump_service_test.go	Switch tests to real DB metrics
internal/serve/serve.go	Instantiate metrics registry; wire middleware/resolvers
internal/serve/middleware/middleware.go	Use AuthMetrics instead of MetricsService
internal/serve/middleware/middleware_test.go	Update auth middleware test wiring
internal/serve/middleware/metrics_middleware.go	Use HTTPMetrics collectors
internal/serve/middleware/graphql_operation_metrics.go	New operation-level GraphQL metrics
internal/serve/middleware/graphql_operation_metrics_test.go	Tests for op-level GraphQL metrics
internal/serve/middleware/graphql_field_metrics.go	Simplify to deprecated-field counting
internal/serve/middleware/graphql_field_metrics_test.go	Tests for deprecated-field metrics
internal/serve/middleware/complexity_logger.go	Use GraphQLMetrics complexity histogram
internal/serve/httphandler/health_test.go	Switch tests to real DB metrics
internal/serve/graphql/resolvers/transaction_resolvers_test.go	Switch tests to real DB metrics
internal/serve/graphql/resolvers/statechange_resolvers_test.go	Switch tests to real DB metrics
internal/serve/graphql/resolvers/resolver.go	Inject *metrics.Metrics into resolvers
internal/serve/graphql/resolvers/queries_resolvers_test.go	Switch tests to real DB metrics
internal/serve/graphql/resolvers/operation_resolvers_test.go	Switch tests to real DB metrics
internal/serve/graphql/resolvers/account_resolvers_test.go	Switch tests to real DB metrics
internal/metrics/metrics.go	Define Metrics root struct + helpers
internal/metrics/metrics_test.go	Tests for Metrics root struct
internal/metrics/db.go	New DBMetrics collectors
internal/metrics/db_test.go	DBMetrics tests
internal/metrics/rpc.go	New RPCMetrics collectors
internal/metrics/rpc_test.go	RPCMetrics tests
internal/metrics/ingestion.go	New IngestionMetrics collectors
internal/metrics/ingestion_test.go	IngestionMetrics tests
internal/metrics/http.go	New HTTPMetrics collectors
internal/metrics/http_test.go	HTTPMetrics tests
internal/metrics/graphql.go	New GraphQLMetrics collectors
internal/metrics/graphql_test.go	GraphQLMetrics tests
internal/metrics/auth.go	New AuthMetrics collectors
internal/metrics/auth_test.go	AuthMetrics tests
internal/metrics/pool.go	New pool + DB pool stat collectors
internal/metrics/pool_test.go	Pool metrics tests
internal/metrics/mocks.go	Remove old MetricsService mock
internal/loadtest/runner.go	Wire typed metrics into loadtest
internal/integrationtests/infrastructure/main_setup.go	Wire typed metrics into integration setup
internal/ingest/ingest.go	Wire typed metrics into ingest cmd
internal/indexer/indexer.go	Pass IngestionMetrics into processors
internal/indexer/processors/metrics.go	Remove old processor metrics interface
internal/indexer/processors/state_change_builder.go	Use IngestionMetrics in builder
internal/indexer/processors/state_change_builder_test.go	Update builder tests to typed metrics
internal/indexer/processors/accounts.go	Emit state-change processor duration
internal/indexer/processors/trustlines.go	Emit state-change processor duration
internal/indexer/processors/effects.go	Emit state-change processor duration
internal/indexer/processors/contract_deploy.go	Emit state-change processor duration
internal/indexer/processors/token_transfer.go	Emit state-change processor duration
internal/indexer/processors/sac_balances.go	Use typed ingestion metrics
internal/indexer/processors/contracts/sac.go	Use typed ingestion metrics
internal/db/db.go	Add PoolConfig.QueryExecMode support
internal/data/models.go	Inject *metrics.DBMetrics into models
internal/data/accounts.go	Switch to DBMetrics collectors
internal/data/accounts_test.go	Switch tests to real DB metrics
internal/data/transactions.go	Switch to DBMetrics collectors
internal/data/transactions_test.go	Switch tests to real DB metrics
internal/data/operations.go	Switch to DBMetrics collectors
internal/data/operations_test.go	Switch tests to real DB metrics
internal/data/statechanges.go	Switch to DBMetrics collectors
internal/data/ingest_store.go	Switch to DBMetrics collectors
internal/data/ingest_store_test.go	Switch tests to real DB metrics
internal/data/trustline_balances.go	Switch to DBMetrics collectors
internal/data/trustline_balances_test.go	Switch tests to real DB metrics
internal/data/trustline_assets.go	Switch to DBMetrics collectors
internal/data/trustline_assets_test.go	Switch tests to real DB metrics
internal/data/native_balances.go	Switch to DBMetrics collectors
internal/data/native_balances_test.go	Switch tests to real DB metrics
internal/data/sac_balances.go	Switch to DBMetrics collectors
internal/data/sac_balances_test.go	Switch tests to real DB metrics
internal/data/contract_tokens.go	Switch to DBMetrics collectors
internal/data/contract_tokens_test.go	Switch tests to real DB metrics
internal/data/account_contract_tokens.go	Switch to DBMetrics collectors
cmd/channel_account.go	Wire typed metrics into CLI
go.mod	Add indirect dependency

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-30T19:33:52Z

internal/services/ingest_live.go

+	// Set initial lag now that the backend buffer is populated
+	if backendTip, lagErr := m.ledgerBackend.GetLatestLedgerSequence(ctx); lagErr == nil {
+		m.appMetrics.Ingestion.LagLedgers.Set(float64(backendTip - startLedger))
+	}


backendTip - startLedger can underflow if the backend tip is behind startLedger (or if startLedger is misconfigured ahead of the tip), producing a huge lag value due to uint wraparound. Consider guarding with if backendTip >= startLedger { ... } else { LagLedgers.Set(0) } (or use signed arithmetic) so the gauge never reports a bogus negative/overflowed lag.

Copilot · 2026-03-30T19:33:52Z

internal/services/ingest_live.go

+		// Update lag metric (non-blocking atomic read)
+		if backendTip, lagErr := m.ledgerBackend.GetLatestLedgerSequence(ctx); lagErr == nil {
+			m.appMetrics.Ingestion.LagLedgers.Set(float64(backendTip - currentLedger))
+		}


Same potential underflow here: backendTip - currentLedger will wrap if currentLedger ever exceeds backendTip, causing LagLedgers to spike to a huge value. Please clamp at 0 (or use signed arithmetic) before setting the gauge.

Copilot · 2026-03-30T19:33:53Z

internal/data/transactions.go

+			duration := time.Since(start).Seconds()
+			m.Metrics.QueryDuration.WithLabelValues("BatchCopy", "transactions").Observe(duration)
+			m.Metrics.BatchSize.WithLabelValues("BatchCopy", "transactions").Observe(float64(len(txs)))
+			m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "transactions").Inc()


On the transactions_accounts COPY error path, QueryErrors is incremented for transactions_accounts but QueriesTotal is incremented for transactions (and transactions_accounts total is never incremented). This makes per-table error rates inconsistent (errors without corresponding totals). Consider incrementing QueriesTotal for transactions_accounts (and, if desired, observing duration for that table) in this error branch.

Suggested change

m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "transactions").Inc()

m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "transactions_accounts").Inc()

Copilot · 2026-03-30T19:33:53Z

internal/data/operations.go

+			duration := time.Since(start).Seconds()
+			m.Metrics.QueryDuration.WithLabelValues("BatchCopy", "operations").Observe(duration)
+			m.Metrics.BatchSize.WithLabelValues("BatchCopy", "operations").Observe(float64(len(operations)))
+			m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "operations").Inc()


On the operations_accounts COPY error path, QueryErrors is incremented for operations_accounts but QueriesTotal is incremented for operations (and operations_accounts total is never incremented). This skews per-table error-rate calculations. Consider incrementing QueriesTotal for operations_accounts (and optionally recording duration for that table) in this error branch.

Suggested change

m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "operations").Inc()

m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "operations_accounts").Inc()

Copilot · 2026-03-30T19:33:53Z

internal/metrics/pool_test.go

+func TestRegisterPoolMetrics_Lint(t *testing.T) {
+	reg := prometheus.NewRegistry()
+	pool := pond.NewPool(1)
+	defer pool.StopAndWait()
+
+	RegisterPoolMetrics(reg, "test", pool)
+
+	families, err := reg.Gather()
+	require.NoError(t, err)
+
+	for _, f := range families {
+		for _, c := range collectorsFromFamily(f) {
+			problems, lintErr := testutil.CollectAndLint(c)
+			require.NoError(t, lintErr)
+			assert.Empty(t, problems, "lint problems for %s", f.GetName())
+		}
+	}
+}
+
+// collectorsFromFamily creates minimal GaugeFunc/CounterFunc wrappers to lint gathered families.
+// Since GaugeFunc/CounterFunc are not directly lintable after registration, we lint via Gather.
+func collectorsFromFamily(f *dto.MetricFamily) []prometheus.Collector {
+	// For GaugeFunc/CounterFunc, the simplest approach is to skip per-collector lint
+	// and rely on the gather-level validation. Return empty.
+	_ = f
+	return nil
+}


This test is named as if it lints the registered pool metrics, but collectorsFromFamily always returns nil, so no linting is actually performed and the assertions never run. Either remove this test, rename it to reflect what it does, or rework RegisterPoolMetrics/the test to lint the actual collectors before/after registration.

Copilot · 2026-03-30T19:33:54Z

internal/metrics/db.go

+	// QueryDuration tracks the latency of individual database queries — the primary DB performance metric.
+	// Use to detect slow queries and set SLOs on database response times.
+	//
+	//	histogram_quantile(0.99, rate(wallet_db_query_duration_seconds_bucket[5m]))
+	//
+	// Labels: query_type (e.g. "select", "insert", "upsert"), table.
+	QueryDuration *prometheus.HistogramVec
+
+	// QueriesTotal counts completed database queries.
+	// Use for throughput dashboards and per-table query volume analysis.
+	// Labels: query_type (e.g. "select", "insert", "upsert"), table.
+	QueriesTotal *prometheus.CounterVec


The doc comments say the query_type label is values like "select"/"insert"/"upsert", but call sites in this PR use function names such as "GetByHash", "BatchCopy", etc. Please update the label docs (and examples) to match the actual label cardinality/semantics, or align call sites to the documented label values.

aditya1702 added 2 commits March 30, 2026 15:16

Copilot AI review requested due to automatic review settings March 30, 2026 19:28

Copilot started reviewing on behalf of aditya1702 March 30, 2026 19:29 View session

aditya1702 merged commit e50d576 into main Mar 30, 2026
12 checks passed

aditya1702 deleted the feature/finalize-metrics branch March 30, 2026 19:33

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor and fix the Prometheus metrics code#554

Refactor and fix the Prometheus metrics code#554
aditya1702 merged 2 commits intomainfrom
feature/finalize-metrics

aditya1702 commented Mar 30, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Copilot AI Mar 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "transactions").Inc()
	m.Metrics.QueriesTotal.WithLabelValues("BatchCopy", "transactions_accounts").Inc()

Conversation

aditya1702 commented Mar 30, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants