Fix all metrics by aditya1702 · Pull Request #545 · stellar/wallet-backend

aditya1702 · 2026-03-19T14:55:48Z

Closes #548

What

This PR overhauls the observability layer across all subsystems of the wallet-backend — DB queries, connection pools, ingestion pipeline, RPC client, and GraphQL API. The changes fall into six areas:

1. DB Query Metrics — Correctness & Error Classification

Fixed metric emission ordering: QueriesTotal is now incremented before the error check, so every query (success or failure) is counted.
Previously, early-return error paths skipped the counter, under-reporting total query volume.
Error classification: Error labels now use utils.GetDBErrorType(err) instead of a generic "query_error" string, enabling breakdown by error
category (timeout, connection, constraint violation, etc.).
Removed unused TransactionsTotal / TransactionDuration metrics from DBMetrics — these were never populated by any call site.

2. Ingestion Metrics — Retry, Error & Lag Observability

Added RetriesTotal, RetryExhaustionsTotal, and ErrorsTotal counters with a phase label (ledger_fetch, db_persist, batch_flush,
ingest_live) to track retry behavior and failure modes.
Added LedgerFetchDuration histogram to measure the full retry-inclusive latency of fetching a ledger from the backend.
Added LagLedgers gauge that captures how far behind live ingestion is from the RPC tip at startup.
Changed Duration from a HistogramVec (with no labels) to a plain Histogram — removing the unnecessary label dimension.
Tuned histogram buckets to better match real-world ingestion latencies.

3. RPC Metrics — Histograms, Gauges & Structure

Added ServiceHealth gauge (1 = healthy, 0 = unhealthy) set during the heartbeat loop and on GetHealth failures.
Added LatestLedger gauge updated on each successful health check.
Replaced the transport-level RequestDuration summary with a histogram and added explicit RequestsTotal and RequestErrors counters at
the transport layer.
Added tailored histogram buckets (rpcDurationBuckets) covering 10ms–10s to match Stellar RPC latency profiles.
Removed the unused GetHeartbeatChannel() method from the RPCServiceInterface.
Added PromQL-ready doc comments to every metric field.

4. Pool Metrics — Worker Pool + DB Connection Pool

Worker pool: renamed the channel label to pool_name (using ConstLabels) and added wallet_pool_tasks_waiting and
wallet_pool_tasks_submitted_total gauges for queue-depth and throughput visibility.
DB connection pool: expanded from 4 to 12 metrics — added constructing_conns, acquire_total, empty_acquire_total, canceled_acquire_total,
new_conns_total, max_lifetime_destroy_total, max_idle_destroy_total to cover the full pgxpool stat surface.
Renamed the misleading acquire_duration_seconds (a monotonic counter of cumulative wait time) to proper counters.

5. GraphQL Metrics — Operation-Level Observability

Added a new operation-level metrics middleware (graphql_operation_metrics.go) that tracks end-to-end operation latency, in-flight operations,
response size, and error counts — all labeled by operation name and type (query/mutation).
Refactored the existing field-level middleware to be simpler: removed error tracking and duration from field interceptors (now handled at the operation
level) and kept only field-level resolution counts.
Added ResponseSize histogram and InFlightOperations gauge.
Added comprehensive test coverage for both middleware layers.

6. DB Pool Config

Added QueryExecMode to PoolConfig, allowing callers to override pgx's default CacheStatement mode when needed.

Why

The existing metrics had several gaps that made production debugging and alerting difficult:

Silent failures: Query error counters were skipped on early-return paths, making dashboards under-report errors.
No retry visibility: Ingestion retries and exhaustions were invisible — an operator couldn't tell if the system was recovering or silently failing.
Coarse RPC observability: Only summaries (no histograms) were available, making it impossible to compute proper percentile alerts. There was no
health gauge, so degraded RPC state required log-grepping.
Pool blind spots: Worker pool queue depth and DB connection pool lifecycle stats (new connections, destroys, acquire contention) weren't tracked —
making it hard to diagnose connection exhaustion or pool saturation.
GraphQL metrics were field-only: There was no way to see operation-level latency, throughput, or in-flight concurrency — critical for API SLO
tracking.

Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs.

Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go.

Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers.

Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics.

Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface

Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed.

Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler.

Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics.

Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion.

Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior.

Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5.

aristidesstaffieri · 2026-03-30T16:41:11Z

Code review

Found 3 issues:

ErrorsTotal comment claims a closed set of 7 error types, but classifyGraphQLError has an unbounded default branch. The default case at line 90 passes the raw code string through as a label value. Any unrecognized extension code creates a new time series, risking unbounded Prometheus cardinality. Fix: map unknown codes to "unknown" in the default branch.

wallet-backend/internal/metrics/graphql.go

Lines 37 to 40 in 6599e2f

    
           // ErrorsTotal counts GraphQL errors classified by type at the operation level. 
        
           // Types: validation_error, parse_error, bad_input, auth_error, forbidden, internal_error, unknown. 
        
           // Labels: operation_name, error_type. 
        
           ErrorsTotal *prometheus.CounterVec

wallet-backend/internal/serve/middleware/graphql_operation_metrics.go

Lines 88 to 91 in 6599e2f

    
           	return "internal_error" 
        
           default: 
        
           	return code 
        
           }

heartbeatChannel is dead code -- allocated but unreachable after GetHeartbeatChannel() was removed from the interface. The channel is still created in NewRPCService (line 67) and stored in the struct (line 42), but no production code reads or writes to it. Additionally, RPCServiceMock.GetHeartbeatChannel() in mocks.go:61 is an orphaned mock method for a method that no longer exists on the interface.

wallet-backend/internal/services/rpc_service.go

Lines 41 to 44 in 6599e2f

    
           httpClient                 utils.HTTPClient 
        
           heartbeatChannel           chan entities.RPCGetHealthResult 
        
           metrics                    *metrics.RPCMetrics 
        
           healthCheckWarningInterval time.Duration

wallet-backend/internal/services/rpc_service.go

Lines 66 to 72 in 6599e2f

    
           heartbeatChannel := make(chan entities.RPCGetHealthResult, 1) 
        
           return &rpcService{ 
        
           	rpcURL:                     rpcURL, 
        
           	httpClient:                 httpClient, 
        
           	heartbeatChannel:           heartbeatChannel, 
        
           	metrics:                    rpcMetrics,

InFlightOperations gauge can permanently inflate on panic. Inc() fires at line 30 before next(ctx) is called. Dec() only fires inside the returned response handler at line 64. If next(ctx) or the response handler panics, gqlgen's RecoverHandler catches it at a higher level but Dec() never runs. The codebase has explicit panic() calls in generated GraphQL code and resolvers, making this realistic in practice. Fix: use defer m.metrics.InFlightOperations.Dec() immediately after Inc(), or guard with a defer-recover.

wallet-backend/internal/serve/middleware/graphql_operation_metrics.go

Lines 29 to 35 in 6599e2f

    
           func (m *GraphQLOperationMetrics) Middleware(ctx context.Context, next graphql.OperationHandler) graphql.ResponseHandler { 
        
           	m.metrics.InFlightOperations.Inc() 
        
           	startTime := time.Now() 
        
           	responseHandler := next(ctx) 
        
           	return func(ctx context.Context) *graphql.Response {

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

* Break up the huge `MetricsService` interface (#543) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * make check * Add metrics tests * Add CollectAndCompare tests * Fix all metrics (#545) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

* metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * remove envelope_xdr and meta_xdr - 1 * fix all tests * Add back the envelopeXDR and metaXDR temporarily for tests * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions. * Break up the huge `MetricsService` interface (#543) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * make check * Add metrics tests * Add CollectAndCompare tests * Fix all metrics (#545) * metrics: add concrete metric structs with wallet_ namespace prefix Phase 1 of metrics refactor: create domain-specific metric structs (DBMetrics, RPCMetrics, IngestionMetrics, HTTPMetrics, GraphQLMetrics, AuthMetrics) with constructors taking prometheus.Registerer. Add pool registration functions. Rewrite metrics.go to compose sub-structs in a top-level Metrics struct. The legacy MetricsService interface is kept temporarily and now delegates to the new structs. * metrics: migrate data models to use concrete *DBMetrics struct Phase 2: Replace MetricsService interface with *metrics.DBMetrics in all 11 data model structs. Call sites now use direct Prometheus API (e.g., m.Metrics.QueryDuration.WithLabelValues(...).Observe(...)). Add DBMetrics() bridge method to legacy MetricsService interface for callers that still create via NewMetricsService(). Update NewModels() signature and all wiring in serve.go, ingest.go, and loadtest/runner.go. * metrics: migrate RPC service to use concrete *RPCMetrics struct Phase 3: Replace MetricsService interface with *metrics.RPCMetrics in rpcService. Call sites now use direct Prometheus API (e.g., r.metrics.MethodCallsTotal.WithLabelValues(...).Inc()). Add RPCMetrics() bridge method to legacy MetricsService interface. Update all NewRPCService callers. * metrics: migrate middleware to use concrete metric structs Phase 4: Replace MetricsService interface in all middleware: - MetricsMiddleware: accepts *metrics.HTTPMetrics - GraphQLFieldMetrics: accepts *metrics.GraphQLMetrics - ComplexityLogger: accepts *metrics.GraphQLMetrics - AuthenticationMiddleware: accepts *metrics.AuthMetrics Update serve.go wiring to pass sub-structs from *metrics.Metrics. * metrics: migrate ingestion, indexer, and processors to concrete structs Phase 5+7: Replace MetricsService in ingestion pipeline: - IngestServiceConfig.Metrics now holds *metrics.Metrics - ingestService uses m.appMetrics.Ingestion.* for all metric calls - Indexer accepts *metrics.IngestionMetrics directly - All processors accept *metrics.IngestionMetrics instead of MetricsServiceInterface, calling StateChangeProcessingDuration directly - loadtest/runner.go and ingest/ingest.go create *metrics.Metrics directly instead of going through the legacy interface * metrics: migrate all tests to real registries, delete legacy interface Phase 6: Replace MockMetricsService + .On().Maybe() chains with real prometheus.NewRegistry() + metrics.NewMetrics(reg) in all 23 test files. Delete MetricsService interface, metricsService struct, mocks.go, processors/metrics.go, and metrics_test.go (to be rewritten). Update resolver.go to accept *metrics.Metrics directly. Remove legacy MetricsService field from serve.go handlerDeps. Update cmd/channel_account to use *metrics.Metrics. Net effect: -2050 lines of mock boilerplate removed. * refactor db metrics * make check * Add metrics tests * Add CollectAndCompare tests * fix db test * Add operation-level GraphQL metrics and middleware Introduce operation-level Prometheus collectors (operation duration histogram, operations counter, in-flight gauge, response size histogram) and rename the constructor to NewGraphQLMetrics. Replace heavy per-field timing/counters with a lightweight deprecated-field counter and complexity/response histograms to reduce cardinality and provide SLO-friendly metrics. Add GraphQLOperationMetrics middleware to record duration, throughput, errors and response size; add tests for operation and field middleware and update existing tests and registrations. Wire the new operation and field middlewares into the server handler. * Create graphql_field_metrics_test.go * make check * Add comments for DB metrics * Refactor ingestion metrics; add retries/errors Refactors Prometheus ingestion metrics and updates instrumentation across ingestion code. Duration was changed from a HistogramVec to a Histogram (calls updated), several metric names were renamed (ledgers/transactions/operations totals), BatchSize removed, and new metrics added: LagLedgers, LedgerFetchDuration, RetriesTotal, RetryExhaustionsTotal, ErrorsTotal (and adjusted Participants metric name/buckets). Instrumentation now observes ledger fetch duration, increments retry and exhaustion counters in fetch/flush/persist paths, reports errors on live ingestion failures, and updates lag when available. Tests updated to match new metric types, bucket counts, and include unit tests for the new metrics. * Enhance RPC metrics with histograms and gauges Refactor and expand RPC Prometheus instrumentation for better SLOs and observability. - Replace per-endpoint summary metrics and separate success/failure counters with: - wallet_rpc_request_duration_seconds (HistogramVec by method) - wallet_rpc_request_duration_seconds and wallet_rpc_method_duration_seconds use explicit rpcDurationBuckets - wallet_rpc_requests_total now has (method,status) labels for success/failure - Add wallet_rpc_in_flight_requests (Gauge) and wallet_rpc_response_size_bytes (HistogramVec) - Convert MethodDuration to a histogram and keep MethodErrorsTotal and MethodCallsTotal counters - Update registration to include new collectors and remove deprecated ones. - Update tests to assert new metrics, add histogram and bucket checks, and adjust transport counter tests to use (method,status) labels. - RPC service changes: - Remove heartbeat channel accessor from the interface and implementation - GetHealth now sets ServiceHealth and LatestLedger based on response and marks health=0 on errors - sendRPCRequest now tracks InFlightRequests, observes RequestDuration, records ResponseSizeBytes, and increments RequestsTotal with success/failure labels instead of old endpoint counters These changes improve latency and size visibility, simplify error/success accounting, and provide gauges useful for detecting RPC node stalls or connection exhaustion. * Update rpc.go * Rename pool label and expand pool/DB metrics Replace the pond pool "channel" label with a clearer "pool_name" label and rename the RegisterPoolMetrics parameter accordingly. Update pool metrics (use wallet_pool_tasks_dropped_total instead of tasks_completed) and tests to reflect the label/name changes. Add extensive documentation comments and new Prometheus metrics for pgxpool (constructing_conns gauge, acquire/empty-acquire counters, wait time counters, new_conns/canceled/max_lifetime/max_idle destroy counters) and improve help text for several metrics to provide better observability of pool and DB connection behavior. * Add QueryExecMode to DB pool config Expose pgx.QueryExecMode on PoolConfig and apply it when opening the connection pool. If non-zero, the value is copied into cfg.ConnConfig.DefaultQueryExecMode so callers can override pgx's default (cached prepared statements). The serve config now sets QueryExecMode to Exec to avoid server-side prepared statement caching which conflicts with PgBouncer in transaction pooling mode (SQLSTATE 42P05), and imports github.com/jackc/pgx/v5. * Refactor GraphQL metrics and remove RPC heartbeat Ensure GraphQL operation metrics properly decrement InFlightOperations exactly once by adding a responded guard and defer. Normalize GraphQL error labels: unrecognized extension codes now map to "unknown" (and the comment documents the closed set). Remove the heartbeatChannel from rpcService and its mock/tests, simplifying the RPC service surface and cleaning up related test assertions.

aditya1702 added 21 commits March 18, 2026 16:45

refactor db metrics

3d65ee8

make check

c3b624d

Add metrics tests

c91fa31

Add CollectAndCompare tests

a3ad0a6

Merge branch 'feature/remove-metrics-interface' into feature/db-metrics

4c6d429

fix db test

14661b9

Create graphql_field_metrics_test.go

bc117ef

make check

12f661a

Add comments for DB metrics

40655b2

Update rpc.go

9908e41

aditya1702 marked this pull request as ready for review March 27, 2026 20:47

aristidesstaffieri approved these changes Mar 30, 2026

View reviewed changes

Base automatically changed from feature/remove-metrics-interface to feature/finalize-metrics March 30, 2026 19:16

Merge branch 'feature/finalize-metrics' into feature/fix-all-metrics

5d50854

aditya1702 merged commit 4f95a33 into feature/finalize-metrics Mar 30, 2026
5 checks passed

aditya1702 deleted the feature/fix-all-metrics branch March 30, 2026 19:23

aditya1702 mentioned this pull request Mar 30, 2026

Refactor and fix the Prometheus metrics code #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix all metrics#545

Fix all metrics#545
aditya1702 merged 23 commits intofeature/finalize-metricsfrom
feature/fix-all-metrics

aditya1702 commented Mar 19, 2026 •

edited

Loading

Uh oh!

aristidesstaffieri commented Mar 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aditya1702 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

1. DB Query Metrics — Correctness & Error Classification

2. Ingestion Metrics — Retry, Error & Lag Observability

3. RPC Metrics — Histograms, Gauges & Structure

4. Pool Metrics — Worker Pool + DB Connection Pool

5. GraphQL Metrics — Operation-Level Observability

6. DB Pool Config

Why

Uh oh!

aristidesstaffieri commented Mar 30, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aditya1702 commented Mar 19, 2026 •

edited

Loading