Architecture Observability

Architecture - Observability

Logs, metrics, traces, dashboards, alerting. The three pillars done right for fintech. Companion to Architecture-Overview, Architecture-SLA-SLI-SLO, Architecture-Resilience.

Three pillars

Pillar	Purpose	Tool	Storage
Metrics	What's happening at scale	Micrometer → Prometheus	Prometheus 30d + remote-write to long-term store
Logs	What happened to a specific request	SLF4J + Logback + structured JSON → Loki	Loki 14d + S3 archive
Traces	How a request flowed across services	OpenTelemetry → Tempo	Tempo 7d, sampled

Plus a fourth implicit pillar: Audit - what humans/services did. Stored separately, regulatory retention. See Architecture-Security.

1. Metrics (Micrometer + Prometheus)

1.1. What we measure

Three categories per service:

Golden signals (Google SRE)

Latency - request duration histogram per endpoint
Traffic - requests per second per endpoint
Errors - error rate per endpoint, broken down by status code
Saturation - DB pool usage, thread pool usage, queue depths

Business metrics

ledger.transactions.posted.total - counter, label by currency
ledger.transactions.reversed.total - counter
ledger.entries.written.total - counter
ledger.invariant.violation.total - counter (should always be 0; alert on any)
payments.initiated.total - counter
payments.completed.total - counter
payments.failed.total - counter, label by reason
payments.permanently_failed.total - counter
decision.evaluations.total - counter, label by decision (APPROVE/REJECT/REVIEW)
decision.evaluation.duration - histogram (target p99 < 10ms)
aml.alerts.created.total - counter, label by riskScoreBucket
compliance.cases.opened.total - counter
compliance.cases.resolved.total - counter, label by decision
kyc.sessions.created.total - counter
kyc.sessions.approved.total - counter
kyc.sessions.rejected.total - counter
webhooks.delivered.total - counter, label by subscriptionId, status
webhooks.permanently_failed.total - counter

Resilience metrics

outbox.events.pending - gauge per schema
outbox.events.failed - counter (alert on >0)
outbox.dispatcher.lag.seconds - histogram (publish-time minus row-create-time)
kafka.consumer.lag - gauge per consumer group
resilience4j.circuitbreaker.state - gauge per breaker
resilience4j.circuitbreaker.failure.rate - gauge per breaker
cache.hit.rate - gauge per cache
hikari.connections.active / idle / pending - gauges per pool
saga.requires_manual_intervention.total - counter (alert on any)

JVM metrics (Spring Boot Actuator default)

Heap, non-heap, GC pauses, thread states, file descriptors
Class loading
HikariCP pool metrics
Spring's http.server.requests histogram

1.2. Naming conventions

Snake_case for metric names
Past tense for counters: created, posted, delivered, failed
Units in name when relevant: duration_seconds, lag_seconds, size_bytes
Cardinality bounded: max ~100 unique label values per label
No PII as labels: only IDs, status codes, enum values

1.3. Configuration

management:
  endpoints:
    web:
      exposure:
        include: health, info, metrics, prometheus, threaddump, heapdump
      base-path: /actuator
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when-authorized
    metrics:
      enabled: true
    prometheus:
      enabled: true
  metrics:
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.95, 0.99
      sla:
        http.server.requests: 50ms, 100ms, 500ms, 1s, 5s
    tags:
      application: ${spring.application.name}
      environment: ${ENVIRONMENT:unknown}
      version: ${BUILD_VERSION:dev}

1.4. Custom counters & timers

@Component
class TransactionMetrics(meterRegistry: MeterRegistry) {
    private val posted = Counter.builder("ledger.transactions.posted.total")
        .description("Total ledger transactions successfully posted")
        .tag("application", "ledger-service")
        .register(meterRegistry)

    private val postLatency = Timer.builder("ledger.transactions.post.duration")
        .description("Time to post a double-entry transaction")
        .publishPercentiles(0.5, 0.95, 0.99)
        .publishPercentileHistogram(true)
        .register(meterRegistry)

    fun recordPosted(currency: String) =
        Counter.builder("ledger.transactions.posted.total")
            .tag("currency", currency)
            .register(meterRegistry)
            .increment()

    fun timePosting(): Timer.Sample = Timer.start(meterRegistry)
    fun stopTimer(sample: Timer.Sample) = sample.stop(postLatency)
}

1.5. Prometheus scrape

ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ledger-service
spec:
  selector:
    matchLabels:
      app: ledger-service
  endpoints:
    - port: actuator
      path: /actuator/prometheus
      interval: 15s
      scrapeTimeout: 10s

Or static config:

scrape_configs:
  - job_name: fincore-ledger
    metrics_path: /actuator/prometheus
    scrape_interval: 15s
    static_configs:
      - targets: ['ledger-service:8080']
        labels: { service: ledger }

2. Logs (Logback + JSON + Loki)

2.1. Format

All production logs are structured JSON. Example:

{
  "@timestamp": "2026-04-25T10:00:00.123Z",
  "level": "INFO",
  "logger": "com.fincore.ledger.application.TransactionServiceImpl",
  "message": "Transaction posted",
  "thread": "http-nio-8080-exec-3",
  "application": "ledger-service",
  "environment": "production",
  "version": "0.1.0",
  "correlationId": "01HX...",
  "requestId": "01HX...",
  "userId": "01HX...",
  "actorType": "USER",
  "transactionId": "tx_01HX...",
  "reference": "demo-001",
  "currency": "EUR",
  "entriesCount": 2,
  "duration_ms": 47
}

2.2. Logback configuration

<configuration>
    <appender name="JSON_STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder class="net.logstash.logback.encoder.LogstashEncoder">
            <includeMdcKeyName>correlationId</includeMdcKeyName>
            <includeMdcKeyName>requestId</includeMdcKeyName>
            <includeMdcKeyName>userId</includeMdcKeyName>
            <includeMdcKeyName>actorType</includeMdcKeyName>
            <includeMdcKeyName>tenantId</includeMdcKeyName>
            <fieldNames>
                <timestamp>@timestamp</timestamp>
                <message>message</message>
                <thread>thread</thread>
                <level>level</level>
                <logger>logger</logger>
            </fieldNames>
            <customFields>{"application":"${spring.application.name}","environment":"${ENVIRONMENT}","version":"${BUILD_VERSION}"}</customFields>
            <stackTraceConverter class="net.logstash.logback.stacktrace.ShortenedThrowableConverter">
                <maxDepthPerThrowable>20</maxDepthPerThrowable>
                <maxLength>2048</maxLength>
                <shortenedClassNameLength>30</shortenedClassNameLength>
            </stackTraceConverter>
        </encoder>
    </appender>

    <root level="INFO">
        <appender-ref ref="JSON_STDOUT"/>
    </root>
</configuration>

2.3. MDC propagation

A request-scoped filter adds correlationId, requestId, userId to MDC on entry, removes on exit:

@Component
class CorrelationIdFilter : OncePerRequestFilter() {
    override fun doFilterInternal(req: HttpServletRequest, resp: HttpServletResponse, chain: FilterChain) {
        val correlationId = req.getHeader("X-Correlation-Id") ?: UUID.randomUUID().toString()
        val requestId = UUID.randomUUID().toString()

        try {
            MDC.put("correlationId", correlationId)
            MDC.put("requestId", requestId)
            resp.setHeader("X-Correlation-Id", correlationId)
            chain.doFilter(req, resp)
        } finally {
            MDC.clear()
        }
    }
}

For Kafka consumers and async tasks, MDC is propagated via TaskDecorator:

@Bean
fun mdcTaskDecorator() = TaskDecorator { runnable ->
    val context = MDC.getCopyOfContextMap()
    Runnable {
        val previous = MDC.getCopyOfContextMap()
        if (context != null) MDC.setContextMap(context) else MDC.clear()
        try { runnable.run() } finally {
            if (previous != null) MDC.setContextMap(previous) else MDC.clear()
        }
    }
}

2.4. Log levels

Level	Use
ERROR	System failure that requires investigation. Always include exception.
WARN	Degraded behavior, fallback used, retry scheduled.
INFO	Significant state transitions: account created, payment completed, rule activated.
DEBUG	Flow detail useful for support. Off in production by default.
TRACE	Verbose data dumps. Off in production.

Production default: INFO. Adjustable per logger via Spring Boot Admin or env var:

LOGGING_LEVEL_COM_FINCORE_PAYMENTS=DEBUG

2.5. Ship to Loki

# Promtail config (or Vector / Fluent Bit)
clients:
  - url: https://loki.example.com/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    pipeline_stages:
      - json:
          expressions:
            level: level
            correlationId: correlationId
            application: application
      - labels:
          level:
          application:

Loki labels are bounded: only application, environment, level, pod. Never correlationId or userId as label (cardinality explosion).

2.6. Sensitive data scrubbing

Forbidden in logs:

Full IBAN, card PAN, SSN, government ID, full name, full address
API tokens, passwords, secrets
Full JWT bearer tokens
Customer email/phone in clear

Allowed:

IDs (UUIDs)
Last 4 of IBAN/card (****-1234)
Hashed identifiers (sha256(email))
Partial info with intent: "Account in EU country" instead of country code

A custom Logback filter scrubs known patterns:

class PiiScrubbingConverter : ClassicConverter() {
    private val IBAN_PATTERN = Regex("""\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b""")
    private val CARD_PATTERN = Regex("""\b\d{13,19}\b""")
    private val EMAIL_PATTERN = Regex("""\b[\w.+-]+@[\w-]+\.[\w.-]+\b""")

    override fun convert(event: ILoggingEvent): String =
        event.formattedMessage
            .replace(IBAN_PATTERN) { it.value.takeLast(4).padStart(it.value.length, '*') }
            .replace(CARD_PATTERN) { "****-****-****-${it.value.takeLast(4)}" }
            .replace(EMAIL_PATTERN) { it.value.replace(Regex("""(?<=.{2}).+(?=@)"""), "***") }
}

Better: don't log PII in the first place.

3. Traces (OpenTelemetry + Tempo)

3.1. What we trace

HTTP server requests (every inbound)
HTTP client requests (calls to external providers, KYC, bank, LLM)
JDBC (every DB query, sampled)
Kafka producer/consumer (every published/consumed event)
Redis operations (sampled)
Spring @Async boundaries
Custom spans for use-case orchestration

3.2. Sampling strategy

100% sampling = expensive at scale. Strategy:

Errors: 100% sampled (always)
Slow requests (>1s p95 boundary): 100% sampled (tail-based)
Normal: 10% head-based sampling
Trace continuation: if upstream sampled, downstream also samples

otel:
  exporter:
    otlp:
      endpoint: http://tempo:4317
  traces:
    sampler: parentbased_traceidratio
    sampler-arg: 0.1

For high-value flows (payment initiation), force sampling:

@Trace(samplingPriority = TraceSamplingPriority.HIGH)
suspend fun initiatePayment(cmd: InitiatePaymentCommand): Payment { ... }

3.3. Span attributes

Span names follow lowercase dotted convention:

http.request POST /v1/payments
db.query INSERT INTO payments
kafka.publish ledger.events
payment.initiate
decision.evaluate

Standard attributes (semconv-aligned):

http.method, http.status_code, http.url
db.system, db.statement (parameterized, never with values)
messaging.system, messaging.destination
Custom: fincore.aggregate.type, fincore.aggregate.id, fincore.tenant.id

3.4. Cross-service propagation

W3C Trace Context (traceparent, tracestate) propagates through:

HTTP headers (Spring auto-instruments)
Kafka headers (Spring Kafka auto-instruments)
Async boundaries (TaskDecorator)

3.5. Configuration

otel:
  service:
    name: ${spring.application.name}
  resource:
    attributes:
      service.namespace: fincore-engine
      service.version: ${BUILD_VERSION}
      deployment.environment: ${ENVIRONMENT}
  exporter:
    otlp:
      endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://tempo:4317}
      protocol: grpc
  instrumentation:
    spring-webmvc.enabled: true
    jdbc.enabled: true
    kafka.enabled: true
    redisson.enabled: true

4. Dashboards (Grafana)

4.1. Dashboard set

Dashboard	Audience	Update frequency
Service Health Overview	on-call, eng manager	every 30s
API Latency & Errors	on-call	every 30s
Ledger Throughput	eng, finance	every 1m
Payments Lifecycle	eng, finance, ops	every 1m
Compliance & AML	compliance officer	every 5m
Decision Engine	risk team	every 1m
Outbox & Event Flow	eng	every 30s
Resilience	on-call	every 30s
Database (HikariCP, Postgres)	DBA, eng	every 1m
Kafka & Consumers	platform eng	every 30s
Cost / capacity	eng manager	every 1h

4.2. Service Health Overview (key panels)

Panel	Query
RPS by service	`sum by (application) (rate(http_server_requests_total[1m]))`
Error rate	`sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m])) / sum by (application) (rate(http_server_requests_total[5m]))`
p99 latency	`histogram_quantile(0.99, sum by (le, application) (rate(http_server_requests_seconds_bucket[5m])))`
Pod count	`kube_deployment_status_replicas{namespace="fincore-engine"}`
Heap usage	`jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"}`
GC pauses	`rate(jvm_gc_pause_seconds_sum[5m])`
DB pool utilization	`hikari_connections_active / hikari_connections_max`

4.3. Dashboard JSONs

Stored in deploy/grafana/dashboards/*.json, applied via Grafana provisioning:

apiVersion: 1
providers:
  - name: fincore
    folder: 'FinCore Engine'
    type: file
    options:
      path: /var/lib/grafana/dashboards/fincore

4.4. Logs/Traces correlation in Grafana

Loki's derivedFields:

derivedFields:
  - datasourceName: Tempo
    matcherRegex: '"correlationId":"([^"]+)"'
    name: correlationId
    url: '$${__value.raw}'

Click correlationId in log → jump to all traces with that ID. Click span in trace → jump to logs of that span. Critical for debugging.

5. Alerting

5.1. Alert classification

Severity	Definition	Channel	SLA
P0 - Critical	Customer-impacting, money at risk, data loss imminent	PagerDuty + phone call	5 min ack
P1 - High	Service degraded, SLO at risk	PagerDuty	15 min ack
P2 - Medium	Single subsystem degraded, error budget burning	Slack #engineering	1 hour ack
P3 - Low	Anomaly, no immediate impact	Slack #monitoring	next business day

5.2. Alert catalog (Prometheus rules)

groups:
  - name: fincore-availability
    rules:
      - alert: ServiceDown
        expr: up{job=~"fincore-.*"} == 0
        for: 2m
        labels:
          severity: P0
        annotations:
          summary: "Service {{ $labels.application }} is down"

      - alert: HighErrorRate
        expr: |
          sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m]))
          / sum by (application) (rate(http_server_requests_total[5m]))
          > 0.05
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Error rate >5% for {{ $labels.application }}"

      - alert: LedgerInvariantViolation
        expr: rate(ledger_invariant_violation_total[1m]) > 0
        for: 0m   # immediate
        labels: { severity: P0 }
        annotations:
          summary: " Ledger invariant violation - possible data corruption"
          runbook: "https://github.com/tiana-code/fincore-engine/wiki/Runbook#ledger-invariant"

      - alert: OutboxBacklog
        expr: outbox_events_pending > 1000
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Outbox backlog growing in {{ $labels.schema }}"

      - alert: ConsumerLag
        expr: kafka_consumergroup_lag > 10000
        for: 5m
        labels: { severity: P1 }
        annotations:
          summary: "Consumer lag for {{ $labels.consumergroup }}"

      - alert: CircuitBreakerOpen
        expr: resilience4j_circuitbreaker_state == 1
        for: 2m
        labels: { severity: P1 }
        annotations:
          summary: "Circuit OPEN: {{ $labels.name }}"

      - alert: DLQNonZero
        expr: kafka_topic_log_size{topic=~".*\\.dlq"} > 0
        for: 0m
        labels: { severity: P2 }
        annotations:
          summary: "DLQ {{ $labels.topic }} has messages"

      - alert: SagaManualIntervention
        expr: saga_requires_manual_intervention_total > 0
        for: 0m
        labels: { severity: P1 }
        annotations:
          summary: "Saga requires manual intervention"

5.3. Burn rate alerts (SLO-driven)

For SLO-based alerting (multi-window, multi-burn-rate per Google SRE workbook):

- alert: AvailabilityBurnRateFast
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
      /
      sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
    ) > (14.4 * 0.0005)  # 14.4× burn means 30d budget exhausted in 2h
  for: 2m
  labels: { severity: P1, slo: availability }

- alert: AvailabilityBurnRateSlow
  expr: |
    (
      sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
      /
      sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
    ) > (3 * 0.0005)
  for: 1h
  labels: { severity: P2, slo: availability }

5.4. Alert hygiene rules

Every alert has a runbook link in annotations
Alerts that fire >5 times/week without action are reviewed (delete or fix)
Silence policy: silences expire (max 24h), require justification
Alert fatigue is a P1 risk - we measure noise per on-call shift

6. Audit (separate from logs)

Distinct from operational logs. See Architecture-Security#audit.

Stored in dedicated audit_events table (per service or central)
Retention: 7 years (regulatory)
Format: JSON, append-only
Shipped to SIEM in addition to local storage
Never displayed in operational dashboards
Queryable by auditors via dedicated read replica

7. Runbooks

Every alert links to a runbook. Runbook structure:

# Runbook: Ledger Invariant Violation

## Severity: P0 - Customer-impacting, possible data corruption

## What this means
The deferred trigger detected a transaction where SUM(entries.amount) ≠ 0 per currency.
This should be impossible - investigate immediately.

## Detection
- Alert: LedgerInvariantViolation
- Metric: ledger_invariant_violation_total
- Logs: `application="ledger-service" AND level=ERROR AND message~"invariant"`

## Immediate actions
1. Check the metric is real, not a glitch (compare instances)
2. Identify the offending transaction:

SELECT * FROM transactions WHERE id IN ( SELECT transaction_id FROM entries GROUP BY transaction_id, currency HAVING SUM(amount) <> 0 );

3. If a recent deploy: roll back
4. If not, page eng manager + DBA

## Investigation
- Was the deferred trigger disabled? `\df+ verify_double_entry_invariant`
- Is the materialized view stale?
- Is there a data import that bypassed the trigger?

## Recovery
- DO NOT delete the offending entries (immutable journal)
- Determine intended correct state
- Post a compensating transaction if reverse is mathematically valid
- Otherwise: escalate to data-integrity working group

## Post-incident
- Mandatory post-mortem within 48h
- Root cause must include: how was invariant bypassed, why didn't tests catch it

Runbooks live in runbooks/ directory and on Wiki.

8. Local development observability

docker compose --profile observability up brings up:

Grafana (port 3000, default creds: admin/admin)
Prometheus (port 9090)
Loki (port 3100)
Tempo (port 4317 OTLP, 3200 query)
All services pre-wired with metrics scrape, log shipping, OTLP export

For dev, sampling = 100%, log level = DEBUG, retention shortened.

9. Cost considerations

Observability costs grow superlinearly. Budgets:

Tier	Volume
Metrics	< 100k active series per service
Logs	< 10 GB/day per service
Traces	< 1 GB/day per service (after sampling)

Cost optimization:

Drop high-cardinality labels (per-userId, per-correlationId)
Sample at source (don't ship 100% to drop 90% downstream)
Aggregate before shipping (Vector / Fluent Bit)
Use exemplars (single trace ID per histogram bucket) instead of full traces

10. SLI ↔ Metric mapping

Specific SLIs from Architecture-SLA-SLI-SLO map to:

SLI	Metric expression
Ledger Post Availability	`sum(rate(http_server_requests_total{path="/v1/transactions",status!~"5.."}[5m])) / sum(rate(http_server_requests_total{path="/v1/transactions"}[5m]))`
Ledger Post p99 Latency	`histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m]))`
Decision Eval p99 Latency	`histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m]))`
Webhook Delivery Success Within 30s	`sum(rate(webhook_delivered_within_30s_total[5m])) / sum(rate(webhook_delivered_total[5m]))`
Outbox Lag p99	`histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m]))`
Idempotency Correctness	`1 - (rate(idempotency_violation_total[5m]) / rate(idempotency_check_total[5m]))`

These power both the SLO dashboard and burn-rate alerts.

11. Observability checklist for new code

When adding a feature:

12. Tools summary

Concern	Tool	Why
Metrics scrape	Prometheus	Industry standard, pull-based, multi-dimensional
Long-term metrics	Mimir / Cortex / VictoriaMetrics	Scale Prometheus to years
Logs aggregation	Loki	Cheap, label-based, integrates with Grafana
Traces	Tempo	Cheap, no external deps, integrates with Grafana
Visualization	Grafana	Single pane of glass for metrics + logs + traces
Alerting	Prometheus Alertmanager → PagerDuty/Slack	Standard, reliable
Profiling (continuous)	Pyroscope (optional, Y1 H2)	CPU/memory hotspots in production
Synthetic monitoring	k6 / Blackbox Exporter	External "is the API up" checks
Real User Monitoring	Sentry (optional)	Frontend errors when dashboard ships (Phase E v0.4)

Architecture Observability

Architecture - Observability

Three pillars

1. Metrics (Micrometer + Prometheus)

1.1. What we measure

Golden signals (Google SRE)

Business metrics

Resilience metrics

JVM metrics (Spring Boot Actuator default)

1.2. Naming conventions

1.3. Configuration

1.4. Custom counters & timers

1.5. Prometheus scrape

2. Logs (Logback + JSON + Loki)

2.1. Format

2.2. Logback configuration

2.3. MDC propagation

2.4. Log levels

2.5. Ship to Loki

2.6. Sensitive data scrubbing

3. Traces (OpenTelemetry + Tempo)

3.1. What we trace

3.2. Sampling strategy

3.3. Span attributes

3.4. Cross-service propagation

3.5. Configuration

4. Dashboards (Grafana)

4.1. Dashboard set

4.2. Service Health Overview (key panels)

4.3. Dashboard JSONs

4.4. Logs/Traces correlation in Grafana

5. Alerting

5.1. Alert classification

5.2. Alert catalog (Prometheus rules)

5.3. Burn rate alerts (SLO-driven)

5.4. Alert hygiene rules

6. Audit (separate from logs)

7. Runbooks

8. Local development observability

9. Cost considerations

10. SLI ↔ Metric mapping

11. Observability checklist for new code

12. Tools summary

Related reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!