Skip to content

Architecture Observability

Tiana_ edited this page May 30, 2026 · 1 revision

Architecture - Observability

Logs, metrics, traces, dashboards, alerting. The three pillars done right for fintech. Companion to Architecture-Overview, Architecture-SLA-SLI-SLO, Architecture-Resilience.


Three pillars

Pillar Purpose Tool Storage
Metrics What's happening at scale Micrometer → Prometheus Prometheus 30d + remote-write to long-term store
Logs What happened to a specific request SLF4J + Logback + structured JSON → Loki Loki 14d + S3 archive
Traces How a request flowed across services OpenTelemetry → Tempo Tempo 7d, sampled

Plus a fourth implicit pillar: Audit - what humans/services did. Stored separately, regulatory retention. See Architecture-Security.


1. Metrics (Micrometer + Prometheus)

1.1. What we measure

Three categories per service:

Golden signals (Google SRE)

  • Latency - request duration histogram per endpoint
  • Traffic - requests per second per endpoint
  • Errors - error rate per endpoint, broken down by status code
  • Saturation - DB pool usage, thread pool usage, queue depths

Business metrics

  • ledger.transactions.posted.total - counter, label by currency
  • ledger.transactions.reversed.total - counter
  • ledger.entries.written.total - counter
  • ledger.invariant.violation.total - counter (should always be 0; alert on any)
  • payments.initiated.total - counter
  • payments.completed.total - counter
  • payments.failed.total - counter, label by reason
  • payments.permanently_failed.total - counter
  • decision.evaluations.total - counter, label by decision (APPROVE/REJECT/REVIEW)
  • decision.evaluation.duration - histogram (target p99 < 10ms)
  • aml.alerts.created.total - counter, label by riskScoreBucket
  • compliance.cases.opened.total - counter
  • compliance.cases.resolved.total - counter, label by decision
  • kyc.sessions.created.total - counter
  • kyc.sessions.approved.total - counter
  • kyc.sessions.rejected.total - counter
  • webhooks.delivered.total - counter, label by subscriptionId, status
  • webhooks.permanently_failed.total - counter

Resilience metrics

  • outbox.events.pending - gauge per schema
  • outbox.events.failed - counter (alert on >0)
  • outbox.dispatcher.lag.seconds - histogram (publish-time minus row-create-time)
  • kafka.consumer.lag - gauge per consumer group
  • resilience4j.circuitbreaker.state - gauge per breaker
  • resilience4j.circuitbreaker.failure.rate - gauge per breaker
  • cache.hit.rate - gauge per cache
  • hikari.connections.active / idle / pending - gauges per pool
  • saga.requires_manual_intervention.total - counter (alert on any)

JVM metrics (Spring Boot Actuator default)

  • Heap, non-heap, GC pauses, thread states, file descriptors
  • Class loading
  • HikariCP pool metrics
  • Spring's http.server.requests histogram

1.2. Naming conventions

  • Snake_case for metric names
  • Past tense for counters: created, posted, delivered, failed
  • Units in name when relevant: duration_seconds, lag_seconds, size_bytes
  • Cardinality bounded: max ~100 unique label values per label
  • No PII as labels: only IDs, status codes, enum values

1.3. Configuration

management:
  endpoints:
    web:
      exposure:
        include: health, info, metrics, prometheus, threaddump, heapdump
      base-path: /actuator
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when-authorized
    metrics:
      enabled: true
    prometheus:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99
      sla:
        http.server.requests: 50ms, 100ms, 500ms, 1s, 5s
    tags:
      application: ${spring.application.name}
      environment: ${ENVIRONMENT:unknown}
      version: ${BUILD_VERSION:dev}

1.4. Custom counters & timers

@Component
class TransactionMetrics(meterRegistry: MeterRegistry) {
    private val posted = Counter.builder("ledger.transactions.posted.total")
        .description("Total ledger transactions successfully posted")
        .tag("application", "ledger-service")
        .register(meterRegistry)

    private val postLatency = Timer.builder("ledger.transactions.post.duration")
        .description("Time to post a double-entry transaction")
        .publishPercentiles(0.5, 0.95, 0.99)
        .publishPercentileHistogram(true)
        .register(meterRegistry)

    fun recordPosted(currency: String) =
        Counter.builder("ledger.transactions.posted.total")
            .tag("currency", currency)
            .register(meterRegistry)
            .increment()

    fun timePosting(): Timer.Sample = Timer.start(meterRegistry)
    fun stopTimer(sample: Timer.Sample) = sample.stop(postLatency)
}

1.5. Prometheus scrape

ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ledger-service
spec:
  selector:
    matchLabels:
      app: ledger-service
  endpoints:
    - port: actuator
      path: /actuator/prometheus
      interval: 15s
      scrapeTimeout: 10s

Or static config:

scrape_configs:
  - job_name: fincore-ledger
    metrics_path: /actuator/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: ['ledger-service:8080']
        labels: { service: ledger }

2. Logs (Logback + JSON + Loki)

2.1. Format

All production logs are structured JSON. Example:

{
  "@timestamp": "2026-04-25T10:00:00.123Z",
  "level": "INFO",
  "logger": "com.fincore.ledger.application.TransactionServiceImpl",
  "message": "Transaction posted",
  "thread": "http-nio-8080-exec-3",
  "application": "ledger-service",
  "environment": "production",
  "version": "0.1.0",
  "correlationId": "01HX...",
  "requestId": "01HX...",
  "userId": "01HX...",
  "actorType": "USER",
  "transactionId": "tx_01HX...",
  "reference": "demo-001",
  "currency": "EUR",
  "entriesCount": 2,
  "duration_ms": 47
}

2.2. Logback configuration

<configuration>
    <appender name="JSON_STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <includeMdcKeyName>correlationId</includeMdcKeyName>
            <includeMdcKeyName>requestId</includeMdcKeyName>
            <includeMdcKeyName>userId</includeMdcKeyName>
            <includeMdcKeyName>actorType</includeMdcKeyName>
            <includeMdcKeyName>tenantId</includeMdcKeyName>
            <fieldNames>
                <timestamp>@timestamp</timestamp>
                <message>message</message>
                <thread>thread</thread>
                <level>level</level>
                <logger>logger</logger>
            </fieldNames>
            <customFields>{"application":"${spring.application.name}","environment":"${ENVIRONMENT}","version":"${BUILD_VERSION}"}</customFields>
            <stackTraceConverter class="net.logstash.logback.stacktrace.ShortenedThrowableConverter">
                <maxDepthPerThrowable>20</maxDepthPerThrowable>
                <maxLength>2048</maxLength>
                <shortenedClassNameLength>30</shortenedClassNameLength>
            </stackTraceConverter>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="JSON_STDOUT"/>
    </root>
</configuration>

2.3. MDC propagation

A request-scoped filter adds correlationId, requestId, userId to MDC on entry, removes on exit:

@Component
class CorrelationIdFilter : OncePerRequestFilter() {
    override fun doFilterInternal(req: HttpServletRequest, resp: HttpServletResponse, chain: FilterChain) {
        val correlationId = req.getHeader("X-Correlation-Id") ?: UUID.randomUUID().toString()
        val requestId = UUID.randomUUID().toString()

        try {
            MDC.put("correlationId", correlationId)
            MDC.put("requestId", requestId)
            resp.setHeader("X-Correlation-Id", correlationId)
            chain.doFilter(req, resp)
        } finally {
            MDC.clear()
        }
    }
}

For Kafka consumers and async tasks, MDC is propagated via TaskDecorator:

@Bean
fun mdcTaskDecorator() = TaskDecorator { runnable ->
    val context = MDC.getCopyOfContextMap()
    Runnable {
        val previous = MDC.getCopyOfContextMap()
        if (context != null) MDC.setContextMap(context) else MDC.clear()
        try { runnable.run() } finally {
            if (previous != null) MDC.setContextMap(previous) else MDC.clear()
        }
    }
}

2.4. Log levels

Level Use
ERROR System failure that requires investigation. Always include exception.
WARN Degraded behavior, fallback used, retry scheduled.
INFO Significant state transitions: account created, payment completed, rule activated.
DEBUG Flow detail useful for support. Off in production by default.
TRACE Verbose data dumps. Off in production.

Production default: INFO. Adjustable per logger via Spring Boot Admin or env var:

LOGGING_LEVEL_COM_FINCORE_PAYMENTS=DEBUG

2.5. Ship to Loki

# Promtail config (or Vector / Fluent Bit)
clients:
  - url: https://loki.example.com/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            correlationId: correlationId
            application: application
      - labels:
          level:
          application:

Loki labels are bounded: only application, environment, level, pod. Never correlationId or userId as label (cardinality explosion).

2.6. Sensitive data scrubbing

Forbidden in logs:

  • Full IBAN, card PAN, SSN, government ID, full name, full address
  • API tokens, passwords, secrets
  • Full JWT bearer tokens
  • Customer email/phone in clear

Allowed:

  • IDs (UUIDs)
  • Last 4 of IBAN/card (****-1234)
  • Hashed identifiers (sha256(email))
  • Partial info with intent: "Account in EU country" instead of country code

A custom Logback filter scrubs known patterns:

class PiiScrubbingConverter : ClassicConverter() {
    private val IBAN_PATTERN = Regex("""\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b""")
    private val CARD_PATTERN = Regex("""\b\d{13,19}\b""")
    private val EMAIL_PATTERN = Regex("""\b[\w.+-]+@[\w-]+\.[\w.-]+\b""")

    override fun convert(event: ILoggingEvent): String =
        event.formattedMessage
            .replace(IBAN_PATTERN) { it.value.takeLast(4).padStart(it.value.length, '*') }
            .replace(CARD_PATTERN) { "****-****-****-${it.value.takeLast(4)}" }
            .replace(EMAIL_PATTERN) { it.value.replace(Regex("""(?<=.{2}).+(?=@)"""), "***") }
}

Better: don't log PII in the first place.


3. Traces (OpenTelemetry + Tempo)

3.1. What we trace

  • HTTP server requests (every inbound)
  • HTTP client requests (calls to external providers, KYC, bank, LLM)
  • JDBC (every DB query, sampled)
  • Kafka producer/consumer (every published/consumed event)
  • Redis operations (sampled)
  • Spring @Async boundaries
  • Custom spans for use-case orchestration

3.2. Sampling strategy

100% sampling = expensive at scale. Strategy:

  • Errors: 100% sampled (always)
  • Slow requests (>1s p95 boundary): 100% sampled (tail-based)
  • Normal: 10% head-based sampling
  • Trace continuation: if upstream sampled, downstream also samples
otel:
  exporter:
    otlp:
      endpoint: http://tempo:4317
  traces:
    sampler: parentbased_traceidratio
    sampler-arg: 0.1

For high-value flows (payment initiation), force sampling:

@Trace(samplingPriority = TraceSamplingPriority.HIGH)
suspend fun initiatePayment(cmd: InitiatePaymentCommand): Payment { ... }

3.3. Span attributes

Span names follow lowercase dotted convention:

  • http.request POST /v1/payments
  • db.query INSERT INTO payments
  • kafka.publish ledger.events
  • payment.initiate
  • decision.evaluate

Standard attributes (semconv-aligned):

  • http.method, http.status_code, http.url
  • db.system, db.statement (parameterized, never with values)
  • messaging.system, messaging.destination
  • Custom: fincore.aggregate.type, fincore.aggregate.id, fincore.tenant.id

3.4. Cross-service propagation

W3C Trace Context (traceparent, tracestate) propagates through:

  • HTTP headers (Spring auto-instruments)
  • Kafka headers (Spring Kafka auto-instruments)
  • Async boundaries (TaskDecorator)

3.5. Configuration

otel:
  service:
    name: ${spring.application.name}
  resource:
    attributes:
      service.namespace: fincore-engine
      service.version: ${BUILD_VERSION}
      deployment.environment: ${ENVIRONMENT}
  exporter:
    otlp:
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://tempo:4317}
      protocol: grpc
  instrumentation:
    spring-webmvc.enabled: true
    jdbc.enabled: true
    kafka.enabled: true
    redisson.enabled: true

4. Dashboards (Grafana)

4.1. Dashboard set

Dashboard Audience Update frequency
Service Health Overview on-call, eng manager every 30s
API Latency & Errors on-call every 30s
Ledger Throughput eng, finance every 1m
Payments Lifecycle eng, finance, ops every 1m
Compliance & AML compliance officer every 5m
Decision Engine risk team every 1m
Outbox & Event Flow eng every 30s
Resilience on-call every 30s
Database (HikariCP, Postgres) DBA, eng every 1m
Kafka & Consumers platform eng every 30s
Cost / capacity eng manager every 1h

4.2. Service Health Overview (key panels)

Panel Query
RPS by service sum by (application) (rate(http_server_requests_total[1m]))
Error rate sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m])) / sum by (application) (rate(http_server_requests_total[5m]))
p99 latency histogram_quantile(0.99, sum by (le, application) (rate(http_server_requests_seconds_bucket[5m])))
Pod count kube_deployment_status_replicas{namespace="fincore-engine"}
Heap usage jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}
GC pauses rate(jvm_gc_pause_seconds_sum[5m])
DB pool utilization hikari_connections_active / hikari_connections_max

4.3. Dashboard JSONs

Stored in deploy/grafana/dashboards/*.json, applied via Grafana provisioning:

apiVersion: 1
providers:
  - name: fincore
    folder: 'FinCore Engine'
    type: file
    options:
      path: /var/lib/grafana/dashboards/fincore

4.4. Logs/Traces correlation in Grafana

Loki's derivedFields:

derivedFields:
  - datasourceName: Tempo
    matcherRegex: '"correlationId":"([^"]+)"'
    name: correlationId
    url: '$${__value.raw}'

Click correlationId in log → jump to all traces with that ID. Click span in trace → jump to logs of that span. Critical for debugging.


5. Alerting

5.1. Alert classification

Severity Definition Channel SLA
P0 - Critical Customer-impacting, money at risk, data loss imminent PagerDuty + phone call 5 min ack
P1 - High Service degraded, SLO at risk PagerDuty 15 min ack
P2 - Medium Single subsystem degraded, error budget burning Slack #engineering 1 hour ack
P3 - Low Anomaly, no immediate impact Slack #monitoring next business day

5.2. Alert catalog (Prometheus rules)

groups:
  - name: fincore-availability
    rules:
      - alert: ServiceDown
        expr: up{job=~"fincore-.*"} == 0
        for: 2m
        labels:
          severity: P0
        annotations:
          summary: "Service {{ $labels.application }} is down"

      - alert: HighErrorRate
        expr: |
          sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m]))
          / sum by (application) (rate(http_server_requests_total[5m]))
          > 0.05
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Error rate >5% for {{ $labels.application }}"

      - alert: LedgerInvariantViolation
        expr: rate(ledger_invariant_violation_total[1m]) > 0
        for: 0m   # immediate
        labels: { severity: P0 }
        annotations:
          summary: " Ledger invariant violation - possible data corruption"
          runbook: "https://github.com/tiana-code/fincore-engine/wiki/Runbook#ledger-invariant"

      - alert: OutboxBacklog
        expr: outbox_events_pending > 1000
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Outbox backlog growing in {{ $labels.schema }}"

      - alert: ConsumerLag
        expr: kafka_consumergroup_lag > 10000
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Consumer lag for {{ $labels.consumergroup }}"

      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state == 1
        for: 2m
        labels: { severity: P1 }
        annotations:
          summary: "Circuit OPEN: {{ $labels.name }}"

      - alert: DLQNonZero
        expr: kafka_topic_log_size{topic=~".*\\.dlq"} > 0
        for: 0m
        labels: { severity: P2 }
        annotations:
          summary: "DLQ {{ $labels.topic }} has messages"

      - alert: SagaManualIntervention
        expr: saga_requires_manual_intervention_total > 0
        for: 0m
        labels: { severity: P1 }
        annotations:
          summary: "Saga requires manual intervention"

5.3. Burn rate alerts (SLO-driven)

For SLO-based alerting (multi-window, multi-burn-rate per Google SRE workbook):

- alert: AvailabilityBurnRateFast
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
      /
      sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
    ) > (14.4 * 0.0005)  # 14.4× burn means 30d budget exhausted in 2h
  for: 2m
  labels: { severity: P1, slo: availability }

- alert: AvailabilityBurnRateSlow
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
      /
      sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
    ) > (3 * 0.0005)
  for: 1h
  labels: { severity: P2, slo: availability }

5.4. Alert hygiene rules

  • Every alert has a runbook link in annotations
  • Alerts that fire >5 times/week without action are reviewed (delete or fix)
  • Silence policy: silences expire (max 24h), require justification
  • Alert fatigue is a P1 risk - we measure noise per on-call shift

6. Audit (separate from logs)

Distinct from operational logs. See Architecture-Security#audit.

  • Stored in dedicated audit_events table (per service or central)
  • Retention: 7 years (regulatory)
  • Format: JSON, append-only
  • Shipped to SIEM in addition to local storage
  • Never displayed in operational dashboards
  • Queryable by auditors via dedicated read replica

7. Runbooks

Every alert links to a runbook. Runbook structure:

# Runbook: Ledger Invariant Violation

## Severity: P0 - Customer-impacting, possible data corruption

## What this means
The deferred trigger detected a transaction where SUM(entries.amount) ≠ 0 per currency.
This should be impossible - investigate immediately.

## Detection
- Alert: LedgerInvariantViolation
- Metric: ledger_invariant_violation_total
- Logs: `application="ledger-service" AND level=ERROR AND message~"invariant"`

## Immediate actions
1. Check the metric is real, not a glitch (compare instances)
2. Identify the offending transaction:

SELECT * FROM transactions WHERE id IN ( SELECT transaction_id FROM entries GROUP BY transaction_id, currency HAVING SUM(amount) <> 0 );

3. If a recent deploy: roll back
4. If not, page eng manager + DBA

## Investigation
- Was the deferred trigger disabled? `\df+ verify_double_entry_invariant`
- Is the materialized view stale?
- Is there a data import that bypassed the trigger?

## Recovery
- DO NOT delete the offending entries (immutable journal)
- Determine intended correct state
- Post a compensating transaction if reverse is mathematically valid
- Otherwise: escalate to data-integrity working group

## Post-incident
- Mandatory post-mortem within 48h
- Root cause must include: how was invariant bypassed, why didn't tests catch it

Runbooks live in runbooks/ directory and on Wiki.


8. Local development observability

docker compose --profile observability up brings up:

  • Grafana (port 3000, default creds: admin/admin)
  • Prometheus (port 9090)
  • Loki (port 3100)
  • Tempo (port 4317 OTLP, 3200 query)
  • All services pre-wired with metrics scrape, log shipping, OTLP export

For dev, sampling = 100%, log level = DEBUG, retention shortened.


9. Cost considerations

Observability costs grow superlinearly. Budgets:

Tier Volume
Metrics < 100k active series per service
Logs < 10 GB/day per service
Traces < 1 GB/day per service (after sampling)

Cost optimization:

  • Drop high-cardinality labels (per-userId, per-correlationId)
  • Sample at source (don't ship 100% to drop 90% downstream)
  • Aggregate before shipping (Vector / Fluent Bit)
  • Use exemplars (single trace ID per histogram bucket) instead of full traces

10. SLI ↔ Metric mapping

Specific SLIs from Architecture-SLA-SLI-SLO map to:

SLI Metric expression
Ledger Post Availability sum(rate(http_server_requests_total{path="/v1/transactions",status!~"5.."}[5m])) / sum(rate(http_server_requests_total{path="/v1/transactions"}[5m]))
Ledger Post p99 Latency histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m]))
Decision Eval p99 Latency histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m]))
Webhook Delivery Success Within 30s sum(rate(webhook_delivered_within_30s_total[5m])) / sum(rate(webhook_delivered_total[5m]))
Outbox Lag p99 histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m]))
Idempotency Correctness 1 - (rate(idempotency_violation_total[5m]) / rate(idempotency_check_total[5m]))

These power both the SLO dashboard and burn-rate alerts.


11. Observability checklist for new code

When adding a feature:

  • Counter/timer added for the operation
  • Logs at INFO for state transitions
  • Logs at DEBUG for flow detail
  • No PII in logs
  • Span created via @WithSpan or programmatic API
  • Span attributes set (aggregate.type, aggregate.id, business context)
  • Error path emits ERROR log with stacktrace
  • Health check considers this feature if critical
  • Integration test verifies metric is incremented
  • If failure mode added: alert configured
  • If alert added: runbook written

12. Tools summary

Concern Tool Why
Metrics scrape Prometheus Industry standard, pull-based, multi-dimensional
Long-term metrics Mimir / Cortex / VictoriaMetrics Scale Prometheus to years
Logs aggregation Loki Cheap, label-based, integrates with Grafana
Traces Tempo Cheap, no external deps, integrates with Grafana
Visualization Grafana Single pane of glass for metrics + logs + traces
Alerting Prometheus Alertmanager → PagerDuty/Slack Standard, reliable
Profiling (continuous) Pyroscope (optional, Y1 H2) CPU/memory hotspots in production
Synthetic monitoring k6 / Blackbox Exporter External "is the API up" checks
Real User Monitoring Sentry (optional) Frontend errors when dashboard ships (Phase E v0.4)

Related reading

Clone this wiki locally