-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture Resilience
Detailed treatment of resilience patterns: caching, sagas, circuit breakers, bulkheads, backpressure, graceful shutdown, connection pools, disaster recovery, chaos engineering hooks.
Companion to Architecture-Overview, Architecture-Event-Flow, Architecture-SLA-SLI-SLO.
flowchart LR
Request --> L1[L1: In-process<br/>Caffeine]
L1 -->|miss| L2[L2: Redis<br/>shared]
L2 -->|miss| Source[(Source of truth<br/>PostgreSQL / Keycloak / external)]
Source --> L2
L2 --> L1
L1 --> Response[Response]
L1 - Caffeine (in-JVM)
- Hot data, sub-millisecond access
- Per-pod (no network)
- Bounded size (eviction: LRU + TTL)
- Used for: parsed JWT validation cache, decision rule definitions, currency reference data
L2 - Redis (shared)
- Cross-pod consistency
- Network-bound (1-2ms typical)
- Used for: idempotency keys, rate-limit counters, JWKS, distributed locks (Redlock)
L3 - DB materialized views
- Source of truth for derived data
- Refreshed on writes (or scheduled)
- Used for:
account_balances(sum of entries)
| Data | Layer | TTL | Invalidation | Notes |
|---|---|---|---|---|
| JWKS (Keycloak public keys) | L1 + L2 | 10 min | On HTTP 401 from Keycloak | Two layers - even if Redis is down, L1 keeps signed validation working |
| Decoded JWT principal | L1 | 5 min (= JWT TTL) | Auto-expiry | Skip re-decode for repeat requests |
| Decision rules (active) | L1 + L2 | 5 min | Event-based on decision.rule.activated / decision.rule.deprecated
|
Stale rules = wrong decisions, so event invalidation |
| Idempotency response | L2 | 24 h | Auto-expire | Per-key TTL in Redis SET EX |
| Rate-limit counters | L2 | 1 min sliding window | Auto-expire | Sliding window log algorithm |
| Currency reference data (ISO 4217) | L1 | 1 day | Manual flush via admin endpoint | Static-ish, low frequency |
| Sanctions list snapshot | L2 | 1 hour | Event-based on provider webhook | Critical to be fresh |
| External provider responses (KYC, AML data) | L2 | provider-specific | Per-provider config | Some are 24h, some are 1m |
| Account balance | L3 (Postgres MV) | refreshed CONCURRENTLY after each post | - | NOT in Redis (consistency hazard) |
| Account metadata | NOT cached | - | - | Always read from DB (small table, indexed) |
| Transactions / entries | NOT cached | - | - | Append-only, queries are infrequent reads |
| Payment lifecycle state | NOT cached | - | - | Strict consistency required |
@Component
class CachedDecisionRuleRepository(
private val ruleRepo: DecisionRuleRepository,
private val cache: Cache<String, List<DecisionRule>>, // Caffeine
private val redis: ReactiveStringRedisTemplate,
) {
suspend fun getActiveRules(ruleSetId: String): List<DecisionRule> {
// L1
cache.getIfPresent(ruleSetId)?.let { return it }
// L2
val redisKey = "decision:rules:active:$ruleSetId"
redis.opsForValue().get(redisKey).awaitFirstOrNull()
?.let { json ->
val rules = deserialize(json)
cache.put(ruleSetId, rules)
return rules
}
// Source
val rules = ruleRepo.findByRuleSetIdAndStatus(ruleSetId, ACTIVE)
cache.put(ruleSetId, rules)
redis.opsForValue().set(redisKey, serialize(rules), Duration.ofMinutes(5)).awaitSingle()
return rules
}
@KafkaListener(topics = ["decision.events"])
fun onRuleChange(event: EventEnvelope) {
if (event.type in setOf("decision.rule.activated", "decision.rule.deprecated")) {
val ruleSetId = event.data["ruleSetId"] as String
cache.invalidate(ruleSetId)
redis.delete("decision:rules:active:$ruleSetId").subscribe()
}
}
}@Transactional
fun storeIdempotencyResponse(key: IdempotencyKey, hash: String, response: Response) {
// DB first (source of truth) - atomic with business state
idempotencyRepo.save(IdempotencyRecord(key, hash, response, expiresAt = now + 24h))
// Redis cache update (best-effort, eventually consistent)
redis.opsForValue().set(
"idem:${key.value}",
serialize(IdempotencyCacheEntry(hash, response)),
Duration.ofHours(24),
).subscribe()
}If Redis write fails - log, continue. Next read falls through to DB. No correctness impact.
For hot keys (e.g., active sanctions list), simultaneous misses can hammer the source. Mitigation:
// Single-flight pattern via Caffeine
val cache: AsyncLoadingCache<String, SanctionsSnapshot> = Caffeine.newBuilder()
.expireAfterWrite(Duration.ofHours(1))
.buildAsync { key, _ -> loadSanctionsFromProvider(key) }
// Multiple concurrent requests for the same key share a single load
suspend fun getSanctions(): SanctionsSnapshot = cache.get("global").await()For Redis L2 stampede: use SETNX with short TTL as advisory lock during refresh.
Every cache read is wrapped:
suspend fun fromCacheOrFallback(key: String, fallback: suspend () -> T): T =
try { cache.get(key) ?: fallback() }
catch (e: Exception) { log.warn("cache miss/error: ${e.message}"); fallback() }If Redis is unreachable, services degrade gracefully - slower (DB-only) but correct. Health probe doesn't fail just because Redis is down (services are still useful).
- No Caching
Accountentity by ID - even with short TTL, balance staleness is a fintech bug. - No Caching API responses globally - RFC 7807 problem details leak
correlationIdwhich is per-request. - No Caching across tenants without explicit tenant key.
- No Storing raw money amounts as strings in Redis (use serialized
Moneywith explicit currency, never trust the cache to preserve scale).
In modular monolith mode, all aggregates share one DB transaction when needed. Payment + Ledger writes commit atomically. This is correctness-by-default - no compensating logic, no eventual consistency surprises.
Adding sagas now would be premature complexity.
After service extraction (Y1 H2 onward), Payment Service and Ledger Service no longer share a DB transaction. The cross-service operations that need sagas:
| Operation | Steps | Compensation needed if step N fails |
|---|---|---|
| Initiate Payment | (1) Payment.create → (2) Decision.evaluate → (3) Ledger.postTransaction → (4) Payment.markProcessing → (5) BankAdapter.send | Reverse step 3 if step 5 fails permanently |
| Resolve Compliance Case (REJECT) | (1) Case.resolve → (2) Payment.markRejected → (3) Ledger.reverseTransaction | Re-open case if step 3 fails |
| External Refund | (1) BankAdapter.refund → (2) Ledger.postReversal → (3) Payment.markRefunded | Mark "manual review" if step 2 fails (rare, requires operator) |
| KYC-gated Account Activation | (1) KYC.approve → (2) User.markVerified → (3) Account.activate | Roll back step 2 if step 3 fails |
We adopt orchestrated sagas (vs choreographed) for clarity and operability:
-
Orchestrated - a central
SagaCoordinatorper saga type, explicit state machine, all steps and compensations in one place. - Choreographed - services react to events, no central state. Simpler at small scale, harder to debug at scale, harder to operate.
For fintech, orchestrated wins because:
- Operators need a clear "what state is this saga in?" answer
- Compensation logic is explicit and testable
- Saga history is easy to query (single state table)
@Entity
@Table(name = "sagas")
class Saga(
@Id val id: SagaId,
val sagaType: String, // e.g. "payment.initiate"
val correlationId: UUID,
var status: SagaStatus,
@Type(JsonBinaryType::class)
val context: JsonObject, // saga-specific state (paymentId, accountIds, etc.)
var currentStep: Int,
val totalSteps: Int,
val createdAt: Instant,
var updatedAt: Instant,
val completedAt: Instant?,
@Version var version: Long = 0,
@OneToMany(mappedBy = "saga", cascade = [PERSIST])
val executions: MutableList<SagaStepExecution>,
)
@Entity
@Table(name = "saga_step_executions")
class SagaStepExecution(
@Id val id: UUID,
@ManyToOne val saga: Saga,
val stepNumber: Int,
val stepName: String,
var status: SagaStepStatus, // PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED
val request: JsonObject,
var response: JsonObject?,
var error: String?,
val attempts: Int = 0,
val maxAttempts: Int = 5,
val startedAt: Instant,
var completedAt: Instant?,
)
enum class SagaStatus { RUNNING, SUCCEEDED, COMPENSATING, FAILED, REQUIRES_MANUAL_INTERVENTION }
enum class SagaStepStatus { PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED }class InitiatePaymentSaga(
private val coordinator: SagaCoordinator,
private val paymentClient: PaymentServiceClient,
private val decisionClient: DecisionEngineClient,
private val ledgerClient: LedgerServiceClient,
private val bankAdapter: BankAdapter,
) {
val steps: List<SagaStep<*, *>> = listOf(
SagaStep("create-payment",
forward = { ctx -> paymentClient.create(ctx.cmd) },
compensate = { ctx -> paymentClient.markCancelled(ctx.paymentId) }),
SagaStep("evaluate-decision",
forward = { ctx -> decisionClient.evaluate(ctx.toContext()) },
compensate = { /* no compensation - pure read */ }),
SagaStep("post-ledger-transaction",
forward = { ctx -> ledgerClient.post(ctx.toLedgerCmd()) },
compensate = { ctx -> ledgerClient.reverse(ctx.ledgerTxnId, "saga compensation") }),
SagaStep("mark-payment-processing",
forward = { ctx -> paymentClient.markProcessing(ctx.paymentId) },
compensate = { ctx -> paymentClient.markFailed(ctx.paymentId) }),
SagaStep("send-to-bank",
forward = { ctx -> bankAdapter.send(ctx.toBankCmd()) },
compensate = { /* terminal: bank ack received, can't compensate */
throw NonCompensableException("bank already accepted")
}),
)
suspend fun execute(cmd: InitiatePaymentCommand): Saga = coordinator.run(this, cmd)
}| Scenario | Handling |
|---|---|
| Step succeeds, app crashes before recording | Saga is replayed on recovery - step's idempotencyKey ensures no duplicate effect |
| Step fails transiently | Retry up to maxAttempts per step (exponential backoff) |
| Step fails permanently, compensation succeeds | Saga ends in FAILED state, original command rolled back |
| Step fails permanently, compensation fails | Saga ends in REQUIRES_MANUAL_INTERVENTION - alert fires, operator console shows details |
| Compensation impossible (e.g., money already left bank) | Saga ends in REQUIRES_MANUAL_INTERVENTION - explicit operator workflow |
Every step's forward and compensate action receives an idempotencyKey derived from saga context:
-
forward("create-payment")usespayment-create-${sagaId} -
compensate("post-ledger-transaction")usesledger-reverse-${sagaId}-${stepN}
Replays are safe - no double-effect.
- Saga state queryable via
GET /v1/admin/sagas/{id}(admin role) - Each step execution emits structured log + metric (
saga.step.<name>.duration) -
REQUIRES_MANUAL_INTERVENTIONstate triggers PagerDuty/Slack alert - Saga timeline rendered in operator UI (Phase E)
Considered. Rejected because:
- Heavyweight (extra dep, XML config historically)
- Doesn't fit naturally into a "saga is a sequence of remote calls" model
- Custom coordinator is ~300 lines of Kotlin and we own it
Kept option: if community contributes Spring State Machine integration for sagas, we accept it as optional dependency.
For every external call (bank, KYC, LLM, sanctions, payment provider), we wrap with a circuit breaker via Resilience4j.
resilience4j:
circuitbreaker:
instances:
bank-adapter:
failure-rate-threshold: 50 # %
slow-call-rate-threshold: 60
slow-call-duration-threshold: 5s
permitted-number-of-calls-in-half-open-state: 3
sliding-window-size: 100
sliding-window-type: COUNT_BASED
wait-duration-in-open-state: 30s
minimum-number-of-calls: 20
record-exceptions:
- java.io.IOException
- org.springframework.web.client.HttpServerErrorException
ignore-exceptions:
- com.fincore.payments.exception.PaymentValidationException
kyc-provider:
failure-rate-threshold: 50
sliding-window-size: 50
wait-duration-in-open-state: 60s
llm-provider:
failure-rate-threshold: 40 # tighter - LLM expensive
slow-call-duration-threshold: 30s # LLMs are slow by nature
wait-duration-in-open-state: 120s@Component
class BankAdapterAdapter(
private val httpClient: BankProviderHttpClient,
private val circuitBreakerRegistry: CircuitBreakerRegistry,
) : BankProvider {
private val cb = circuitBreakerRegistry.circuitBreaker("bank-adapter")
override suspend fun send(payment: Payment): BankResponse =
cb.executeSuspendFunction { httpClient.post(payment) }
}- New requests fail-fast with
CircuitBreakerOpenException→ mapped to503 Service UnavailablewithRetry-After: 30 - Inflight payments enter retry topic with backoff
- Health probe reflects circuit state: if all critical providers' circuits are OPEN, readiness probe fails (load balancer drains)
After wait duration, circuit allows up to N probe calls. Successful probes close the circuit. Failed probes re-open it.
-
resilience4j.circuitbreaker.state{name=bank-adapter}- 0=CLOSED, 1=OPEN, 2=HALF_OPEN - Alert: circuit OPEN for > 2 minutes
- Alert: failure rate > 30% for > 5 minutes
We isolate thread pools so that one slow dependency doesn't starve others.
- One thread pool per external dependency
- Separate connection pools per workload type (read-heavy vs write-heavy)
- Kafka consumer threads isolated from REST handlers
@Configuration
class BulkheadConfig {
// For bank adapter calls (potentially slow)
@Bean("bankAdapterExecutor")
fun bankAdapterExecutor(): Executor =
Executors.newFixedThreadPool(20, ThreadFactoryBuilder().setNameFormat("bank-adapter-%d").build())
// For KYC provider calls (slow, but lower volume)
@Bean("kycProviderExecutor")
fun kycProviderExecutor(): Executor =
Executors.newFixedThreadPool(10, ThreadFactoryBuilder().setNameFormat("kyc-%d").build())
// For LLM calls (very slow)
@Bean("llmExecutor")
fun llmExecutor(): Executor =
Executors.newFixedThreadPool(5, ThreadFactoryBuilder().setNameFormat("llm-%d").build())
}@Async("bankAdapterExecutor")
suspend fun sendToBank(payment: Payment): BankResponse = ...Virtual threads (JDK 21) are great for high-volume IO workloads. But:
- They share a single ForkJoinPool - no isolation between dependencies
- A slow LLM call could pin many virtual threads, starving banks
- Bulkheads with platform threads give isolation per dependency
Strategy: virtual threads for REST handlers (high volume, mixed IO), platform threads with bulkheads for external dependencies (isolation needed).
spring:
threads:
virtual:
enabled: true # for Tomcat / Spring MVCWhen a consumer can't keep up with topic throughput, naive consumption leads to OOM or rebalance storms.
Pause/resume on lag
@Component
class BackpressureAwareConsumer(
private val kafkaContainer: ConcurrentMessageListenerContainer<*, *>,
) {
@Scheduled(fixedDelay = 5_000)
fun checkLag() {
val lagMetric = meterRegistry.find("kafka.consumer.lag").gauge()?.value() ?: 0.0
when {
lagMetric > 10_000 -> {
log.warn("Backpressure: consumer lag $lagMetric - pausing")
kafkaContainer.pause()
}
lagMetric < 1_000 -> kafkaContainer.resume()
}
}
}Bounded in-flight processing
spring:
kafka:
listener:
concurrency: 4 # 4 concurrent partitions
poll-timeout: 1s
ack-mode: MANUAL_IMMEDIATE
consumer:
max-poll-records: 50 # bounded batch
max-poll-interval-ms: 300000 # 5min - must finish batch
fetch-min-bytes: 1024Outbox dispatcher rate-limited writes
- Limit Kafka publish rate to broker capacity (avoid producer queue overflow)
- Per-pod token bucket: 1000 events/sec by default
When a pod receives SIGTERM (K8s rolling update, scale-down), it must:
- Stop accepting new requests
- Complete in-flight requests (up to deadline)
- Drain Kafka consumers (stop polling, finish current batch)
- Flush outbox dispatcher (don't lose unpublished events)
- Close DB connections cleanly
- Exit
server:
shutdown: graceful
netty:
connection-timeout: 30s
spring:
lifecycle:
timeout-per-shutdown-phase: 30s@Component
class OutboxDispatcherShutdown(
private val dispatcher: OutboxDispatcher,
) : SmartLifecycle {
override fun stop(callback: Runnable) {
log.info("Draining outbox dispatcher")
dispatcher.shutdown(timeout = Duration.ofSeconds(20))
callback.run()
}
override fun isRunning() = dispatcher.isRunning()
override fun getPhase() = SmartLifecycle.DEFAULT_PHASE - 100 // shut down before web layer
}spec:
terminationGracePeriodSeconds: 60
containers:
- name: ledger
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5"] # let load balancer drainFor read-heavy services (Ledger queries), separate pools:
spring:
datasource:
primary: # for writes
url: jdbc:postgresql://primary:5432/fincore
hikari:
maximum-pool-size: 20
minimum-idle: 5
connection-timeout: 5000
max-lifetime: 1800000 # 30 min
idle-timeout: 600000 # 10 min
validation-timeout: 5000
leak-detection-threshold: 10000
read-replica: # for reads (scale-out)
url: jdbc:postgresql://replica:5432/fincore
hikari:
maximum-pool-size: 50 # more for reads
minimum-idle: 10
# ... same as above- Total pool size ≤ Postgres max_connections / N services - avoid exhaustion
- Per service: 20-50 for primary writes, 50-100 for read replicas
-
leak-detection-threshold: 10s- log unreturned connections - Avoid connections in transactions across multiple HTTP calls - one connection per request
| Layer | Mechanism | Frequency | Retention |
|---|---|---|---|
| Postgres data | Logical replication to standby + WAL archiving to S3 | Continuous (WAL) + nightly snapshot | 35 days PITR window |
| Postgres schema | Liquibase changelog versioned in git | On every deploy | Forever (git history) |
| Decision rules | DB backup + git export weekly | Weekly | 1 year |
| Kafka topics | Topic compaction + tiered storage to S3 | Real-time | 7 days hot, 90 days cold |
| Object storage (KYC docs metadata refs) | Provider-side replication | Continuous | Per provider SLA |
Point-in-time recovery (PITR):
# Restore Postgres to a specific second
pg_basebackup --pgdata=/var/lib/postgresql/restore --xlog-method=stream
recovery.conf:
restore_command = 'aws s3 cp s3://fincore-wal/%f %p'
recovery_target_time = '2026-04-25 14:32:00 UTC'Region failover (out of OSS scope, on roadmap):
- Multi-region active-passive via Postgres logical replication
- Kafka mirror maker 2.0 to standby region
- DNS-level traffic shift with health checks
- RPO (recovery point objective): ≤ 1 second (synchronous replication for primary, WAL streaming for archive)
- RTO (recovery time objective): ≤ 30 minutes for PITR; ≤ 5 minutes for replica promotion
Mandatory quarterly drills:
- Promote standby → verify writes on new primary
- Restore from WAL archive to point-in-time → verify ledger invariants hold
- Replay outbox events from snapshot → verify no event loss
- Simulate Kafka broker outage → verify outbox accumulation works, no data loss
Documented in runbooks/disaster-recovery-drill.md.
Not in v0.1. Mentioned for completeness.
- Primary region serves all writes
- Secondary region is read-only standby
- Failover via DNS + Postgres replication promote
- RTO 5 min, RPO 1 sec
- Per-account region affinity (account "lives" in one region for writes)
- Cross-region payments use distributed sagas
- Conflict resolution at outbox level (later-write-wins on metadata; ledger never conflicts because account is region-pinned)
- Requires architectural ADR before adoption
- Reads served from local-region replica
- Eventual consistency tolerated for read-only paths (balance lookups)
- For strict consistency reads (post-payment confirmation), use primary region directly
A Tier-2 killer feature (chaos engineering hooks for dev/staging fault injection).
Optional sub-project for inducing failures in dev/staging:
fincore:
chaos:
enabled: ${FINCORE_CHAOS_ENABLED:false}
bank-adapter:
timeout-rate: 0.05 # 5% of calls timeout randomly
partial-success-rate: 0.01 # 1% return partial success
duplicate-callback-rate: 0.02 # 2% trigger duplicate webhook
db:
slow-query-rate: 0.001 # 0.1% queries delayed by 5s
deadlock-rate: 0.0001
kafka:
duplicate-message-rate: 0.05
lost-message-rate: 0.001Test suite runs end-to-end with chaos enabled, asserts:
- Ledger SUM=0 invariant holds
- No duplicate payments (idempotency works)
- Outbox dispatch eventually succeeds (no event loss)
- Saga compensation correctly unwinds
Failures here are release blockers.
All resilience patterns emit metrics. Central dashboard:
| Metric | Source | Alert threshold |
|---|---|---|
resilience4j.circuitbreaker.state |
each external dep | OPEN > 2 min |
resilience4j.circuitbreaker.failure.rate |
each external dep | > 30% for 5 min |
kafka.consumer.lag |
each consumer group | > 10000 for 5 min |
outbox.events.pending |
outbox dispatcher | > 1000 for 5 min |
outbox.events.failed |
outbox dispatcher | > 0 (any failure) |
saga.requires_manual_intervention.count |
saga coordinator | > 0 (any) |
cache.hit.rate{cache=...} |
each cache | < 80% for 1 hour |
hikari.connections.active.usage |
each pool | > 80% for 5 min |
payment.retry.scheduled.depth |
payment retry job | > 1000 |
Full dashboard JSON in deploy/grafana/dashboards/resilience.json.
Before tagging v0.x.0:
- All external dependencies wrapped in circuit breakers
- Bulkhead executors configured per dependency
- Graceful shutdown verified - drain test with outbox + consumers
- Connection pool sized appropriately
- Backup strategy in place, RPO < 1 sec verified
- DR drill executed in last quarter
- Chaos test suite passing (in chaos profile)
- Saga state recoverable from crash mid-execution
- Cache invalidation paths tested for staleness
- Health probes accurate (don't fail just from Redis miss)
- Rate limiting tuned per endpoint
- Resilience metrics on Grafana dashboard
- Runbook entries for each "REQUIRES_MANUAL_INTERVENTION" scenario
- No Multi-region active-active (v1.5+ with explicit ADR)
- No Full Saga implementation (waiting for service extraction)
- No Distributed locks across DB+Kafka (not needed in modular monolith)
- No Database sharding (single Postgres scales to ~10M accounts; partitioning earlier)
- No Eventual consistency at API boundaries (we offer strict)
- No Cross-data-center replication of caches (Redis L2 is per-region)
These show up in the roadmap when actual demand emerges. Resilience is a journey, not a destination.
- Overview
- Services
- Data Model
- Domain Model
- Event Flow
- Security
- Observability
- Resilience
- SLA / SLI / SLO