-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture Observability
Logs, metrics, traces, dashboards, alerting. The three pillars done right for fintech. Companion to Architecture-Overview, Architecture-SLA-SLI-SLO, Architecture-Resilience.
| Pillar | Purpose | Tool | Storage |
|---|---|---|---|
| Metrics | What's happening at scale | Micrometer → Prometheus | Prometheus 30d + remote-write to long-term store |
| Logs | What happened to a specific request | SLF4J + Logback + structured JSON → Loki | Loki 14d + S3 archive |
| Traces | How a request flowed across services | OpenTelemetry → Tempo | Tempo 7d, sampled |
Plus a fourth implicit pillar: Audit - what humans/services did. Stored separately, regulatory retention. See Architecture-Security.
Three categories per service:
- Latency - request duration histogram per endpoint
- Traffic - requests per second per endpoint
- Errors - error rate per endpoint, broken down by status code
- Saturation - DB pool usage, thread pool usage, queue depths
-
ledger.transactions.posted.total- counter, label by currency -
ledger.transactions.reversed.total- counter -
ledger.entries.written.total- counter -
ledger.invariant.violation.total- counter (should always be 0; alert on any) -
payments.initiated.total- counter -
payments.completed.total- counter -
payments.failed.total- counter, label by reason -
payments.permanently_failed.total- counter -
decision.evaluations.total- counter, label by decision (APPROVE/REJECT/REVIEW) -
decision.evaluation.duration- histogram (target p99 < 10ms) -
aml.alerts.created.total- counter, label by riskScoreBucket -
compliance.cases.opened.total- counter -
compliance.cases.resolved.total- counter, label by decision -
kyc.sessions.created.total- counter -
kyc.sessions.approved.total- counter -
kyc.sessions.rejected.total- counter -
webhooks.delivered.total- counter, label by subscriptionId, status -
webhooks.permanently_failed.total- counter
-
outbox.events.pending- gauge per schema -
outbox.events.failed- counter (alert on >0) -
outbox.dispatcher.lag.seconds- histogram (publish-time minus row-create-time) -
kafka.consumer.lag- gauge per consumer group -
resilience4j.circuitbreaker.state- gauge per breaker -
resilience4j.circuitbreaker.failure.rate- gauge per breaker -
cache.hit.rate- gauge per cache -
hikari.connections.active/idle/pending- gauges per pool -
saga.requires_manual_intervention.total- counter (alert on any)
- Heap, non-heap, GC pauses, thread states, file descriptors
- Class loading
- HikariCP pool metrics
- Spring's
http.server.requestshistogram
- Snake_case for metric names
-
Past tense for counters:
created,posted,delivered,failed -
Units in name when relevant:
duration_seconds,lag_seconds,size_bytes - Cardinality bounded: max ~100 unique label values per label
- No PII as labels: only IDs, status codes, enum values
management:
endpoints:
web:
exposure:
include: health, info, metrics, prometheus, threaddump, heapdump
base-path: /actuator
endpoint:
health:
probes:
enabled: true
show-details: when-authorized
metrics:
enabled: true
prometheus:
enabled: true
metrics:
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.95, 0.99
sla:
http.server.requests: 50ms, 100ms, 500ms, 1s, 5s
tags:
application: ${spring.application.name}
environment: ${ENVIRONMENT:unknown}
version: ${BUILD_VERSION:dev}@Component
class TransactionMetrics(meterRegistry: MeterRegistry) {
private val posted = Counter.builder("ledger.transactions.posted.total")
.description("Total ledger transactions successfully posted")
.tag("application", "ledger-service")
.register(meterRegistry)
private val postLatency = Timer.builder("ledger.transactions.post.duration")
.description("Time to post a double-entry transaction")
.publishPercentiles(0.5, 0.95, 0.99)
.publishPercentileHistogram(true)
.register(meterRegistry)
fun recordPosted(currency: String) =
Counter.builder("ledger.transactions.posted.total")
.tag("currency", currency)
.register(meterRegistry)
.increment()
fun timePosting(): Timer.Sample = Timer.start(meterRegistry)
fun stopTimer(sample: Timer.Sample) = sample.stop(postLatency)
}ServiceMonitor (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ledger-service
spec:
selector:
matchLabels:
app: ledger-service
endpoints:
- port: actuator
path: /actuator/prometheus
interval: 15s
scrapeTimeout: 10sOr static config:
scrape_configs:
- job_name: fincore-ledger
metrics_path: /actuator/prometheus
scrape_interval: 15s
static_configs:
- targets: ['ledger-service:8080']
labels: { service: ledger }All production logs are structured JSON. Example:
{
"@timestamp": "2026-04-25T10:00:00.123Z",
"level": "INFO",
"logger": "com.fincore.ledger.application.TransactionServiceImpl",
"message": "Transaction posted",
"thread": "http-nio-8080-exec-3",
"application": "ledger-service",
"environment": "production",
"version": "0.1.0",
"correlationId": "01HX...",
"requestId": "01HX...",
"userId": "01HX...",
"actorType": "USER",
"transactionId": "tx_01HX...",
"reference": "demo-001",
"currency": "EUR",
"entriesCount": 2,
"duration_ms": 47
}<configuration>
<appender name="JSON_STDOUT" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<includeMdcKeyName>correlationId</includeMdcKeyName>
<includeMdcKeyName>requestId</includeMdcKeyName>
<includeMdcKeyName>userId</includeMdcKeyName>
<includeMdcKeyName>actorType</includeMdcKeyName>
<includeMdcKeyName>tenantId</includeMdcKeyName>
<fieldNames>
<timestamp>@timestamp</timestamp>
<message>message</message>
<thread>thread</thread>
<level>level</level>
<logger>logger</logger>
</fieldNames>
<customFields>{"application":"${spring.application.name}","environment":"${ENVIRONMENT}","version":"${BUILD_VERSION}"}</customFields>
<stackTraceConverter class="net.logstash.logback.stacktrace.ShortenedThrowableConverter">
<maxDepthPerThrowable>20</maxDepthPerThrowable>
<maxLength>2048</maxLength>
<shortenedClassNameLength>30</shortenedClassNameLength>
</stackTraceConverter>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="JSON_STDOUT"/>
</root>
</configuration>A request-scoped filter adds correlationId, requestId, userId to MDC on entry, removes on exit:
@Component
class CorrelationIdFilter : OncePerRequestFilter() {
override fun doFilterInternal(req: HttpServletRequest, resp: HttpServletResponse, chain: FilterChain) {
val correlationId = req.getHeader("X-Correlation-Id") ?: UUID.randomUUID().toString()
val requestId = UUID.randomUUID().toString()
try {
MDC.put("correlationId", correlationId)
MDC.put("requestId", requestId)
resp.setHeader("X-Correlation-Id", correlationId)
chain.doFilter(req, resp)
} finally {
MDC.clear()
}
}
}For Kafka consumers and async tasks, MDC is propagated via TaskDecorator:
@Bean
fun mdcTaskDecorator() = TaskDecorator { runnable ->
val context = MDC.getCopyOfContextMap()
Runnable {
val previous = MDC.getCopyOfContextMap()
if (context != null) MDC.setContextMap(context) else MDC.clear()
try { runnable.run() } finally {
if (previous != null) MDC.setContextMap(previous) else MDC.clear()
}
}
}| Level | Use |
|---|---|
| ERROR | System failure that requires investigation. Always include exception. |
| WARN | Degraded behavior, fallback used, retry scheduled. |
| INFO | Significant state transitions: account created, payment completed, rule activated. |
| DEBUG | Flow detail useful for support. Off in production by default. |
| TRACE | Verbose data dumps. Off in production. |
Production default: INFO. Adjustable per logger via Spring Boot Admin or env var:
LOGGING_LEVEL_COM_FINCORE_PAYMENTS=DEBUG
# Promtail config (or Vector / Fluent Bit)
clients:
- url: https://loki.example.com/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
pipeline_stages:
- json:
expressions:
level: level
correlationId: correlationId
application: application
- labels:
level:
application:Loki labels are bounded: only application, environment, level, pod. Never correlationId or userId as label (cardinality explosion).
Forbidden in logs:
- Full IBAN, card PAN, SSN, government ID, full name, full address
- API tokens, passwords, secrets
- Full JWT bearer tokens
- Customer email/phone in clear
Allowed:
- IDs (UUIDs)
- Last 4 of IBAN/card (
****-1234) - Hashed identifiers (
sha256(email)) - Partial info with intent: "Account in EU country" instead of country code
A custom Logback filter scrubs known patterns:
class PiiScrubbingConverter : ClassicConverter() {
private val IBAN_PATTERN = Regex("""\b[A-Z]{2}\d{2}[A-Z0-9]{4,30}\b""")
private val CARD_PATTERN = Regex("""\b\d{13,19}\b""")
private val EMAIL_PATTERN = Regex("""\b[\w.+-]+@[\w-]+\.[\w.-]+\b""")
override fun convert(event: ILoggingEvent): String =
event.formattedMessage
.replace(IBAN_PATTERN) { it.value.takeLast(4).padStart(it.value.length, '*') }
.replace(CARD_PATTERN) { "****-****-****-${it.value.takeLast(4)}" }
.replace(EMAIL_PATTERN) { it.value.replace(Regex("""(?<=.{2}).+(?=@)"""), "***") }
}Better: don't log PII in the first place.
- HTTP server requests (every inbound)
- HTTP client requests (calls to external providers, KYC, bank, LLM)
- JDBC (every DB query, sampled)
- Kafka producer/consumer (every published/consumed event)
- Redis operations (sampled)
- Spring
@Asyncboundaries - Custom spans for use-case orchestration
100% sampling = expensive at scale. Strategy:
- Errors: 100% sampled (always)
- Slow requests (>1s p95 boundary): 100% sampled (tail-based)
- Normal: 10% head-based sampling
- Trace continuation: if upstream sampled, downstream also samples
otel:
exporter:
otlp:
endpoint: http://tempo:4317
traces:
sampler: parentbased_traceidratio
sampler-arg: 0.1For high-value flows (payment initiation), force sampling:
@Trace(samplingPriority = TraceSamplingPriority.HIGH)
suspend fun initiatePayment(cmd: InitiatePaymentCommand): Payment { ... }Span names follow lowercase dotted convention:
http.request POST /v1/paymentsdb.query INSERT INTO paymentskafka.publish ledger.eventspayment.initiatedecision.evaluate
Standard attributes (semconv-aligned):
-
http.method,http.status_code,http.url -
db.system,db.statement(parameterized, never with values) -
messaging.system,messaging.destination - Custom:
fincore.aggregate.type,fincore.aggregate.id,fincore.tenant.id
W3C Trace Context (traceparent, tracestate) propagates through:
- HTTP headers (Spring auto-instruments)
- Kafka headers (Spring Kafka auto-instruments)
- Async boundaries (TaskDecorator)
otel:
service:
name: ${spring.application.name}
resource:
attributes:
service.namespace: fincore-engine
service.version: ${BUILD_VERSION}
deployment.environment: ${ENVIRONMENT}
exporter:
otlp:
endpoint: ${OTEL_EXPORTER_OTLP_ENDPOINT:http://tempo:4317}
protocol: grpc
instrumentation:
spring-webmvc.enabled: true
jdbc.enabled: true
kafka.enabled: true
redisson.enabled: true| Dashboard | Audience | Update frequency |
|---|---|---|
| Service Health Overview | on-call, eng manager | every 30s |
| API Latency & Errors | on-call | every 30s |
| Ledger Throughput | eng, finance | every 1m |
| Payments Lifecycle | eng, finance, ops | every 1m |
| Compliance & AML | compliance officer | every 5m |
| Decision Engine | risk team | every 1m |
| Outbox & Event Flow | eng | every 30s |
| Resilience | on-call | every 30s |
| Database (HikariCP, Postgres) | DBA, eng | every 1m |
| Kafka & Consumers | platform eng | every 30s |
| Cost / capacity | eng manager | every 1h |
| Panel | Query |
|---|---|
| RPS by service | sum by (application) (rate(http_server_requests_total[1m])) |
| Error rate | sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m])) / sum by (application) (rate(http_server_requests_total[5m])) |
| p99 latency | histogram_quantile(0.99, sum by (le, application) (rate(http_server_requests_seconds_bucket[5m]))) |
| Pod count | kube_deployment_status_replicas{namespace="fincore-engine"} |
| Heap usage | jvm_memory_used_bytes{area="heap"} / jvm_memory_max_bytes{area="heap"} |
| GC pauses | rate(jvm_gc_pause_seconds_sum[5m]) |
| DB pool utilization | hikari_connections_active / hikari_connections_max |
Stored in deploy/grafana/dashboards/*.json, applied via Grafana provisioning:
apiVersion: 1
providers:
- name: fincore
folder: 'FinCore Engine'
type: file
options:
path: /var/lib/grafana/dashboards/fincoreLoki's derivedFields:
derivedFields:
- datasourceName: Tempo
matcherRegex: '"correlationId":"([^"]+)"'
name: correlationId
url: '$${__value.raw}'Click correlationId in log → jump to all traces with that ID. Click span in trace → jump to logs of that span. Critical for debugging.
| Severity | Definition | Channel | SLA |
|---|---|---|---|
| P0 - Critical | Customer-impacting, money at risk, data loss imminent | PagerDuty + phone call | 5 min ack |
| P1 - High | Service degraded, SLO at risk | PagerDuty | 15 min ack |
| P2 - Medium | Single subsystem degraded, error budget burning | Slack #engineering | 1 hour ack |
| P3 - Low | Anomaly, no immediate impact | Slack #monitoring | next business day |
groups:
- name: fincore-availability
rules:
- alert: ServiceDown
expr: up{job=~"fincore-.*"} == 0
for: 2m
labels:
severity: P0
annotations:
summary: "Service {{ $labels.application }} is down"
- alert: HighErrorRate
expr: |
sum by (application) (rate(http_server_requests_total{status=~"5.."}[5m]))
/ sum by (application) (rate(http_server_requests_total[5m]))
> 0.05
for: 5m
labels: { severity: P1 }
annotations:
summary: "Error rate >5% for {{ $labels.application }}"
- alert: LedgerInvariantViolation
expr: rate(ledger_invariant_violation_total[1m]) > 0
for: 0m # immediate
labels: { severity: P0 }
annotations:
summary: " Ledger invariant violation - possible data corruption"
runbook: "https://github.com/tiana-code/fincore-engine/wiki/Runbook#ledger-invariant"
- alert: OutboxBacklog
expr: outbox_events_pending > 1000
for: 5m
labels: { severity: P1 }
annotations:
summary: "Outbox backlog growing in {{ $labels.schema }}"
- alert: ConsumerLag
expr: kafka_consumergroup_lag > 10000
for: 5m
labels: { severity: P1 }
annotations:
summary: "Consumer lag for {{ $labels.consumergroup }}"
- alert: CircuitBreakerOpen
expr: resilience4j_circuitbreaker_state == 1
for: 2m
labels: { severity: P1 }
annotations:
summary: "Circuit OPEN: {{ $labels.name }}"
- alert: DLQNonZero
expr: kafka_topic_log_size{topic=~".*\\.dlq"} > 0
for: 0m
labels: { severity: P2 }
annotations:
summary: "DLQ {{ $labels.topic }} has messages"
- alert: SagaManualIntervention
expr: saga_requires_manual_intervention_total > 0
for: 0m
labels: { severity: P1 }
annotations:
summary: "Saga requires manual intervention"For SLO-based alerting (multi-window, multi-burn-rate per Google SRE workbook):
- alert: AvailabilityBurnRateFast
expr: |
(
sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[1h]))
/
sum(rate(http_server_requests_total{application="ledger-service"}[1h]))
) > (14.4 * 0.0005) # 14.4× burn means 30d budget exhausted in 2h
for: 2m
labels: { severity: P1, slo: availability }
- alert: AvailabilityBurnRateSlow
expr: |
(
sum(rate(http_server_requests_total{application="ledger-service",status=~"5.."}[6h]))
/
sum(rate(http_server_requests_total{application="ledger-service"}[6h]))
) > (3 * 0.0005)
for: 1h
labels: { severity: P2, slo: availability }- Every alert has a runbook link in annotations
- Alerts that fire >5 times/week without action are reviewed (delete or fix)
- Silence policy: silences expire (max 24h), require justification
- Alert fatigue is a P1 risk - we measure noise per on-call shift
Distinct from operational logs. See Architecture-Security#audit.
- Stored in dedicated
audit_eventstable (per service or central) - Retention: 7 years (regulatory)
- Format: JSON, append-only
- Shipped to SIEM in addition to local storage
- Never displayed in operational dashboards
- Queryable by auditors via dedicated read replica
Every alert links to a runbook. Runbook structure:
# Runbook: Ledger Invariant Violation
## Severity: P0 - Customer-impacting, possible data corruption
## What this means
The deferred trigger detected a transaction where SUM(entries.amount) ≠ 0 per currency.
This should be impossible - investigate immediately.
## Detection
- Alert: LedgerInvariantViolation
- Metric: ledger_invariant_violation_total
- Logs: `application="ledger-service" AND level=ERROR AND message~"invariant"`
## Immediate actions
1. Check the metric is real, not a glitch (compare instances)
2. Identify the offending transaction:SELECT * FROM transactions WHERE id IN ( SELECT transaction_id FROM entries GROUP BY transaction_id, currency HAVING SUM(amount) <> 0 );
3. If a recent deploy: roll back
4. If not, page eng manager + DBA
## Investigation
- Was the deferred trigger disabled? `\df+ verify_double_entry_invariant`
- Is the materialized view stale?
- Is there a data import that bypassed the trigger?
## Recovery
- DO NOT delete the offending entries (immutable journal)
- Determine intended correct state
- Post a compensating transaction if reverse is mathematically valid
- Otherwise: escalate to data-integrity working group
## Post-incident
- Mandatory post-mortem within 48h
- Root cause must include: how was invariant bypassed, why didn't tests catch it
Runbooks live in runbooks/ directory and on Wiki.
docker compose --profile observability up brings up:
- Grafana (port 3000, default creds: admin/admin)
- Prometheus (port 9090)
- Loki (port 3100)
- Tempo (port 4317 OTLP, 3200 query)
- All services pre-wired with metrics scrape, log shipping, OTLP export
For dev, sampling = 100%, log level = DEBUG, retention shortened.
Observability costs grow superlinearly. Budgets:
| Tier | Volume |
|---|---|
| Metrics | < 100k active series per service |
| Logs | < 10 GB/day per service |
| Traces | < 1 GB/day per service (after sampling) |
Cost optimization:
- Drop high-cardinality labels (per-userId, per-correlationId)
- Sample at source (don't ship 100% to drop 90% downstream)
- Aggregate before shipping (Vector / Fluent Bit)
- Use exemplars (single trace ID per histogram bucket) instead of full traces
Specific SLIs from Architecture-SLA-SLI-SLO map to:
| SLI | Metric expression |
|---|---|
| Ledger Post Availability | sum(rate(http_server_requests_total{path="/v1/transactions",status!~"5.."}[5m])) / sum(rate(http_server_requests_total{path="/v1/transactions"}[5m])) |
| Ledger Post p99 Latency | histogram_quantile(0.99, rate(http_server_requests_seconds_bucket{path="/v1/transactions"}[5m])) |
| Decision Eval p99 Latency | histogram_quantile(0.99, rate(decision_evaluation_duration_bucket[5m])) |
| Webhook Delivery Success Within 30s | sum(rate(webhook_delivered_within_30s_total[5m])) / sum(rate(webhook_delivered_total[5m])) |
| Outbox Lag p99 | histogram_quantile(0.99, rate(outbox_dispatcher_lag_seconds_bucket[5m])) |
| Idempotency Correctness | 1 - (rate(idempotency_violation_total[5m]) / rate(idempotency_check_total[5m])) |
These power both the SLO dashboard and burn-rate alerts.
When adding a feature:
- Counter/timer added for the operation
- Logs at INFO for state transitions
- Logs at DEBUG for flow detail
- No PII in logs
- Span created via @WithSpan or programmatic API
- Span attributes set (aggregate.type, aggregate.id, business context)
- Error path emits ERROR log with stacktrace
- Health check considers this feature if critical
- Integration test verifies metric is incremented
- If failure mode added: alert configured
- If alert added: runbook written
| Concern | Tool | Why |
|---|---|---|
| Metrics scrape | Prometheus | Industry standard, pull-based, multi-dimensional |
| Long-term metrics | Mimir / Cortex / VictoriaMetrics | Scale Prometheus to years |
| Logs aggregation | Loki | Cheap, label-based, integrates with Grafana |
| Traces | Tempo | Cheap, no external deps, integrates with Grafana |
| Visualization | Grafana | Single pane of glass for metrics + logs + traces |
| Alerting | Prometheus Alertmanager → PagerDuty/Slack | Standard, reliable |
| Profiling (continuous) | Pyroscope (optional, Y1 H2) | CPU/memory hotspots in production |
| Synthetic monitoring | k6 / Blackbox Exporter | External "is the API up" checks |
| Real User Monitoring | Sentry (optional) | Frontend errors when dashboard ships (Phase E v0.4) |
- Architecture-SLA-SLI-SLO - SLO definitions and error budgets
- Architecture-Resilience - what resilience metrics measure
- Architecture-Security - audit log distinction
- Runbook - operational procedures linked from alerts
- Incident-Response - when alerts fire
- Overview
- Services
- Data Model
- Domain Model
- Event Flow
- Security
- Observability
- Resilience
- SLA / SLI / SLO