Architecture Resilience

Architecture - Resilience, Caching, Sagas

Detailed treatment of resilience patterns: caching, sagas, circuit breakers, bulkheads, backpressure, graceful shutdown, connection pools, disaster recovery, chaos engineering hooks.

Companion to Architecture-Overview, Architecture-Event-Flow, Architecture-SLA-SLI-SLO.

1. Caching strategy

1.1. Cache levels

flowchart LR
    Request --> L1[L1: In-process<br/>Caffeine]
    L1 -->|miss| L2[L2: Redis<br/>shared]
    L2 -->|miss| Source[(Source of truth<br/>PostgreSQL / Keycloak / external)]
    Source --> L2
    L2 --> L1
    L1 --> Response[Response]

L1 - Caffeine (in-JVM)

Hot data, sub-millisecond access
Per-pod (no network)
Bounded size (eviction: LRU + TTL)
Used for: parsed JWT validation cache, decision rule definitions, currency reference data

L2 - Redis (shared)

Cross-pod consistency
Network-bound (1-2ms typical)
Used for: idempotency keys, rate-limit counters, JWKS, distributed locks (Redlock)

L3 - DB materialized views

Source of truth for derived data
Refreshed on writes (or scheduled)
Used for: account_balances (sum of entries)

1.2. What we cache (and why)

Data	Layer	TTL	Invalidation	Notes
JWKS (Keycloak public keys)	L1 + L2	10 min	On HTTP 401 from Keycloak	Two layers - even if Redis is down, L1 keeps signed validation working
Decoded JWT principal	L1	5 min (= JWT TTL)	Auto-expiry	Skip re-decode for repeat requests
Decision rules (active)	L1 + L2	5 min	Event-based on `decision.rule.activated` / `decision.rule.deprecated`	Stale rules = wrong decisions, so event invalidation
Idempotency response	L2	24 h	Auto-expire	Per-key TTL in Redis SET EX
Rate-limit counters	L2	1 min sliding window	Auto-expire	Sliding window log algorithm
Currency reference data (ISO 4217)	L1	1 day	Manual flush via admin endpoint	Static-ish, low frequency
Sanctions list snapshot	L2	1 hour	Event-based on provider webhook	Critical to be fresh
External provider responses (KYC, AML data)	L2	provider-specific	Per-provider config	Some are 24h, some are 1m
Account balance	L3 (Postgres MV)	refreshed CONCURRENTLY after each post	-	NOT in Redis (consistency hazard)
Account metadata	NOT cached	-	-	Always read from DB (small table, indexed)
Transactions / entries	NOT cached	-	-	Append-only, queries are infrequent reads
Payment lifecycle state	NOT cached	-	-	Strict consistency required

1.3. Pattern: cache-aside (default for reads)

@Component
class CachedDecisionRuleRepository(
    private val ruleRepo: DecisionRuleRepository,
    private val cache: Cache<String, List<DecisionRule>>,  // Caffeine
    private val redis: ReactiveStringRedisTemplate,
) {
    suspend fun getActiveRules(ruleSetId: String): List<DecisionRule> {
        // L1
        cache.getIfPresent(ruleSetId)?.let { return it }

        // L2
        val redisKey = "decision:rules:active:$ruleSetId"
        redis.opsForValue().get(redisKey).awaitFirstOrNull()
            ?.let { json ->
                val rules = deserialize(json)
                cache.put(ruleSetId, rules)
                return rules
            }

        // Source
        val rules = ruleRepo.findByRuleSetIdAndStatus(ruleSetId, ACTIVE)
        cache.put(ruleSetId, rules)
        redis.opsForValue().set(redisKey, serialize(rules), Duration.ofMinutes(5)).awaitSingle()
        return rules
    }

    @KafkaListener(topics = ["decision.events"])
    fun onRuleChange(event: EventEnvelope) {
        if (event.type in setOf("decision.rule.activated", "decision.rule.deprecated")) {
            val ruleSetId = event.data["ruleSetId"] as String
            cache.invalidate(ruleSetId)
            redis.delete("decision:rules:active:$ruleSetId").subscribe()
        }
    }
}

1.4. Pattern: write-through (idempotency keys)

@Transactional
fun storeIdempotencyResponse(key: IdempotencyKey, hash: String, response: Response) {
    // DB first (source of truth) - atomic with business state
    idempotencyRepo.save(IdempotencyRecord(key, hash, response, expiresAt = now + 24h))

    // Redis cache update (best-effort, eventually consistent)
    redis.opsForValue().set(
        "idem:${key.value}",
        serialize(IdempotencyCacheEntry(hash, response)),
        Duration.ofHours(24),
    ).subscribe()
}

If Redis write fails - log, continue. Next read falls through to DB. No correctness impact.

1.5. Cache stampede protection

For hot keys (e.g., active sanctions list), simultaneous misses can hammer the source. Mitigation:

// Single-flight pattern via Caffeine
val cache: AsyncLoadingCache<String, SanctionsSnapshot> = Caffeine.newBuilder()
    .expireAfterWrite(Duration.ofHours(1))
    .buildAsync { key, _ -> loadSanctionsFromProvider(key) }

// Multiple concurrent requests for the same key share a single load
suspend fun getSanctions(): SanctionsSnapshot = cache.get("global").await()

For Redis L2 stampede: use SETNX with short TTL as advisory lock during refresh.

1.6. Cache as optional dependency

Every cache read is wrapped:

suspend fun fromCacheOrFallback(key: String, fallback: suspend () -> T): T =
    try { cache.get(key) ?: fallback() }
    catch (e: Exception) { log.warn("cache miss/error: ${e.message}"); fallback() }

If Redis is unreachable, services degrade gracefully - slower (DB-only) but correct. Health probe doesn't fail just because Redis is down (services are still useful).

1.7. Anti-patterns (forbidden)

No Caching Account entity by ID - even with short TTL, balance staleness is a fintech bug.
No Caching API responses globally - RFC 7807 problem details leak correlationId which is per-request.
No Caching across tenants without explicit tenant key.
No Storing raw money amounts as strings in Redis (use serialized Money with explicit currency, never trust the cache to preserve scale).

2. Saga pattern

2.1. Why we don't use sagas in v0.1

In modular monolith mode, all aggregates share one DB transaction when needed. Payment + Ledger writes commit atomically. This is correctness-by-default - no compensating logic, no eventual consistency surprises.

Adding sagas now would be premature complexity.

2.2. When sagas become necessary

After service extraction (Y1 H2 onward), Payment Service and Ledger Service no longer share a DB transaction. The cross-service operations that need sagas:

Operation	Steps	Compensation needed if step N fails
Initiate Payment	(1) Payment.create → (2) Decision.evaluate → (3) Ledger.postTransaction → (4) Payment.markProcessing → (5) BankAdapter.send	Reverse step 3 if step 5 fails permanently
Resolve Compliance Case (REJECT)	(1) Case.resolve → (2) Payment.markRejected → (3) Ledger.reverseTransaction	Re-open case if step 3 fails
External Refund	(1) BankAdapter.refund → (2) Ledger.postReversal → (3) Payment.markRefunded	Mark "manual review" if step 2 fails (rare, requires operator)
KYC-gated Account Activation	(1) KYC.approve → (2) User.markVerified → (3) Account.activate	Roll back step 2 if step 3 fails

2.3. Pattern choice: Orchestrated Saga

We adopt orchestrated sagas (vs choreographed) for clarity and operability:

Orchestrated - a central SagaCoordinator per saga type, explicit state machine, all steps and compensations in one place.
Choreographed - services react to events, no central state. Simpler at small scale, harder to debug at scale, harder to operate.

For fintech, orchestrated wins because:

Operators need a clear "what state is this saga in?" answer
Compensation logic is explicit and testable
Saga history is easy to query (single state table)

2.4. Saga state model

@Entity
@Table(name = "sagas")
class Saga(
    @Id val id: SagaId,
    val sagaType: String,                  // e.g. "payment.initiate"
    val correlationId: UUID,
    var status: SagaStatus,
    @Type(JsonBinaryType::class)
    val context: JsonObject,               // saga-specific state (paymentId, accountIds, etc.)
    var currentStep: Int,
    val totalSteps: Int,
    val createdAt: Instant,
    var updatedAt: Instant,
    val completedAt: Instant?,
    @Version var version: Long = 0,
    @OneToMany(mappedBy = "saga", cascade = [PERSIST])
    val executions: MutableList<SagaStepExecution>,
)

@Entity
@Table(name = "saga_step_executions")
class SagaStepExecution(
    @Id val id: UUID,
    @ManyToOne val saga: Saga,
    val stepNumber: Int,
    val stepName: String,
    var status: SagaStepStatus,            // PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED
    val request: JsonObject,
    var response: JsonObject?,
    var error: String?,
    val attempts: Int = 0,
    val maxAttempts: Int = 5,
    val startedAt: Instant,
    var completedAt: Instant?,
)

enum class SagaStatus { RUNNING, SUCCEEDED, COMPENSATING, FAILED, REQUIRES_MANUAL_INTERVENTION }
enum class SagaStepStatus { PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED }

2.5. Coordinator example: Initiate Payment Saga

class InitiatePaymentSaga(
    private val coordinator: SagaCoordinator,
    private val paymentClient: PaymentServiceClient,
    private val decisionClient: DecisionEngineClient,
    private val ledgerClient: LedgerServiceClient,
    private val bankAdapter: BankAdapter,
) {

    val steps: List<SagaStep<*, *>> = listOf(
        SagaStep("create-payment",
            forward = { ctx -> paymentClient.create(ctx.cmd) },
            compensate = { ctx -> paymentClient.markCancelled(ctx.paymentId) }),

        SagaStep("evaluate-decision",
            forward = { ctx -> decisionClient.evaluate(ctx.toContext()) },
            compensate = { /* no compensation - pure read */ }),

        SagaStep("post-ledger-transaction",
            forward = { ctx -> ledgerClient.post(ctx.toLedgerCmd()) },
            compensate = { ctx -> ledgerClient.reverse(ctx.ledgerTxnId, "saga compensation") }),

        SagaStep("mark-payment-processing",
            forward = { ctx -> paymentClient.markProcessing(ctx.paymentId) },
            compensate = { ctx -> paymentClient.markFailed(ctx.paymentId) }),

        SagaStep("send-to-bank",
            forward = { ctx -> bankAdapter.send(ctx.toBankCmd()) },
            compensate = { /* terminal: bank ack received, can't compensate */
                throw NonCompensableException("bank already accepted")
            }),
    )

    suspend fun execute(cmd: InitiatePaymentCommand): Saga = coordinator.run(this, cmd)
}

2.6. Failure modes

Scenario	Handling
Step succeeds, app crashes before recording	Saga is replayed on recovery - step's `idempotencyKey` ensures no duplicate effect
Step fails transiently	Retry up to `maxAttempts` per step (exponential backoff)
Step fails permanently, compensation succeeds	Saga ends in `FAILED` state, original command rolled back
Step fails permanently, compensation fails	Saga ends in `REQUIRES_MANUAL_INTERVENTION` - alert fires, operator console shows details
Compensation impossible (e.g., money already left bank)	Saga ends in `REQUIRES_MANUAL_INTERVENTION` - explicit operator workflow

2.7. Idempotency in steps

Every step's forward and compensate action receives an idempotencyKey derived from saga context:

forward("create-payment") uses payment-create-${sagaId}
compensate("post-ledger-transaction") uses ledger-reverse-${sagaId}-${stepN}

Replays are safe - no double-effect.

2.8. Saga observability

Saga state queryable via GET /v1/admin/sagas/{id} (admin role)
Each step execution emits structured log + metric (saga.step.<name>.duration)
REQUIRES_MANUAL_INTERVENTION state triggers PagerDuty/Slack alert
Saga timeline rendered in operator UI (Phase E)

2.9. Why not Spring State Machine?

Considered. Rejected because:

Heavyweight (extra dep, XML config historically)
Doesn't fit naturally into a "saga is a sequence of remote calls" model
Custom coordinator is ~300 lines of Kotlin and we own it

Kept option: if community contributes Spring State Machine integration for sagas, we accept it as optional dependency.

3. Circuit breakers

For every external call (bank, KYC, LLM, sanctions, payment provider), we wrap with a circuit breaker via Resilience4j.

3.1. Configuration

resilience4j:
  circuitbreaker:
    instances:
      bank-adapter:
        failure-rate-threshold: 50              # %
        slow-call-rate-threshold: 60
        slow-call-duration-threshold: 5s
        permitted-number-of-calls-in-half-open-state: 3
        sliding-window-size: 100
        sliding-window-type: COUNT_BASED
        wait-duration-in-open-state: 30s
        minimum-number-of-calls: 20
        record-exceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - com.fincore.payments.exception.PaymentValidationException

      kyc-provider:
        failure-rate-threshold: 50
        sliding-window-size: 50
        wait-duration-in-open-state: 60s

      llm-provider:
        failure-rate-threshold: 40              # tighter - LLM expensive
        slow-call-duration-threshold: 30s       # LLMs are slow by nature
        wait-duration-in-open-state: 120s

3.2. Usage in code

@Component
class BankAdapterAdapter(
    private val httpClient: BankProviderHttpClient,
    private val circuitBreakerRegistry: CircuitBreakerRegistry,
) : BankProvider {

    private val cb = circuitBreakerRegistry.circuitBreaker("bank-adapter")

    override suspend fun send(payment: Payment): BankResponse =
        cb.executeSuspendFunction { httpClient.post(payment) }
}

3.3. Circuit OPEN behavior

New requests fail-fast with CircuitBreakerOpenException → mapped to 503 Service Unavailable with Retry-After: 30
Inflight payments enter retry topic with backoff
Health probe reflects circuit state: if all critical providers' circuits are OPEN, readiness probe fails (load balancer drains)

3.4. Half-open state

After wait duration, circuit allows up to N probe calls. Successful probes close the circuit. Failed probes re-open it.

3.5. Metrics & alerts

resilience4j.circuitbreaker.state{name=bank-adapter} - 0=CLOSED, 1=OPEN, 2=HALF_OPEN
Alert: circuit OPEN for > 2 minutes
Alert: failure rate > 30% for > 5 minutes

4. Bulkheads (resource isolation)

We isolate thread pools so that one slow dependency doesn't starve others.

4.1. Approach

One thread pool per external dependency
Separate connection pools per workload type (read-heavy vs write-heavy)
Kafka consumer threads isolated from REST handlers

4.2. Configuration

@Configuration
class BulkheadConfig {

    // For bank adapter calls (potentially slow)
    @Bean("bankAdapterExecutor")
    fun bankAdapterExecutor(): Executor =
        Executors.newFixedThreadPool(20, ThreadFactoryBuilder().setNameFormat("bank-adapter-%d").build())

    // For KYC provider calls (slow, but lower volume)
    @Bean("kycProviderExecutor")
    fun kycProviderExecutor(): Executor =
        Executors.newFixedThreadPool(10, ThreadFactoryBuilder().setNameFormat("kyc-%d").build())

    // For LLM calls (very slow)
    @Bean("llmExecutor")
    fun llmExecutor(): Executor =
        Executors.newFixedThreadPool(5, ThreadFactoryBuilder().setNameFormat("llm-%d").build())
}

4.3. Spring async usage

@Async("bankAdapterExecutor")
suspend fun sendToBank(payment: Payment): BankResponse = ...

4.4. Why not virtual threads exclusively?

Virtual threads (JDK 21) are great for high-volume IO workloads. But:

They share a single ForkJoinPool - no isolation between dependencies
A slow LLM call could pin many virtual threads, starving banks
Bulkheads with platform threads give isolation per dependency

Strategy: virtual threads for REST handlers (high volume, mixed IO), platform threads with bulkheads for external dependencies (isolation needed).

spring:
  threads:
    virtual:
      enabled: true   # for Tomcat / Spring MVC

5. Backpressure (Kafka consumer flow control)

When a consumer can't keep up with topic throughput, naive consumption leads to OOM or rebalance storms.

5.1. Mechanisms

Pause/resume on lag

@Component
class BackpressureAwareConsumer(
    private val kafkaContainer: ConcurrentMessageListenerContainer<*, *>,
) {

    @Scheduled(fixedDelay = 5_000)
    fun checkLag() {
        val lagMetric = meterRegistry.find("kafka.consumer.lag").gauge()?.value() ?: 0.0
        when {
            lagMetric > 10_000 -> {
                log.warn("Backpressure: consumer lag $lagMetric - pausing")
                kafkaContainer.pause()
            }
            lagMetric < 1_000 -> kafkaContainer.resume()
        }
    }
}

Bounded in-flight processing

spring:
  kafka:
    listener:
      concurrency: 4                     # 4 concurrent partitions
      poll-timeout: 1s
      ack-mode: MANUAL_IMMEDIATE
    consumer:
      max-poll-records: 50                # bounded batch
      max-poll-interval-ms: 300000        # 5min - must finish batch
      fetch-min-bytes: 1024

Outbox dispatcher rate-limited writes

Limit Kafka publish rate to broker capacity (avoid producer queue overflow)
Per-pod token bucket: 1000 events/sec by default

6. Graceful shutdown

When a pod receives SIGTERM (K8s rolling update, scale-down), it must:

Stop accepting new requests
Complete in-flight requests (up to deadline)
Drain Kafka consumers (stop polling, finish current batch)
Flush outbox dispatcher (don't lose unpublished events)
Close DB connections cleanly
Exit

6.1. Spring Boot config

server:
  shutdown: graceful
  netty:
    connection-timeout: 30s

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

6.2. Custom shutdown hook for outbox

@Component
class OutboxDispatcherShutdown(
    private val dispatcher: OutboxDispatcher,
) : SmartLifecycle {

    override fun stop(callback: Runnable) {
        log.info("Draining outbox dispatcher")
        dispatcher.shutdown(timeout = Duration.ofSeconds(20))
        callback.run()
    }

    override fun isRunning() = dispatcher.isRunning()
    override fun getPhase() = SmartLifecycle.DEFAULT_PHASE - 100  // shut down before web layer
}

6.3. Kubernetes integration

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: ledger
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]   # let load balancer drain

7. Connection pool management (HikariCP)

7.1. Per-workload pools

For read-heavy services (Ledger queries), separate pools:

spring:
  datasource:
    primary:                # for writes
      url: jdbc:postgresql://primary:5432/fincore
      hikari:
        maximum-pool-size: 20
        minimum-idle: 5
        connection-timeout: 5000
        max-lifetime: 1800000           # 30 min
        idle-timeout: 600000            # 10 min
        validation-timeout: 5000
        leak-detection-threshold: 10000

    read-replica:           # for reads (scale-out)
      url: jdbc:postgresql://replica:5432/fincore
      hikari:
        maximum-pool-size: 50           # more for reads
        minimum-idle: 10
        # ... same as above

7.2. Sizing rules

Total pool size ≤ Postgres max_connections / N services - avoid exhaustion
Per service: 20-50 for primary writes, 50-100 for read replicas
leak-detection-threshold: 10s - log unreturned connections
Avoid connections in transactions across multiple HTTP calls - one connection per request

8. Disaster recovery

8.1. Backup strategy

Layer	Mechanism	Frequency	Retention
Postgres data	Logical replication to standby + WAL archiving to S3	Continuous (WAL) + nightly snapshot	35 days PITR window
Postgres schema	Liquibase changelog versioned in git	On every deploy	Forever (git history)
Decision rules	DB backup + git export weekly	Weekly	1 year
Kafka topics	Topic compaction + tiered storage to S3	Real-time	7 days hot, 90 days cold
Object storage (KYC docs metadata refs)	Provider-side replication	Continuous	Per provider SLA

8.2. Recovery procedures

Point-in-time recovery (PITR):

# Restore Postgres to a specific second
pg_basebackup --pgdata=/var/lib/postgresql/restore --xlog-method=stream
recovery.conf:
  restore_command = 'aws s3 cp s3://fincore-wal/%f %p'
  recovery_target_time = '2026-04-25 14:32:00 UTC'

Region failover (out of OSS scope, on roadmap):

Multi-region active-passive via Postgres logical replication
Kafka mirror maker 2.0 to standby region
DNS-level traffic shift with health checks

8.3. RTO / RPO targets (commitments)

RPO (recovery point objective): ≤ 1 second (synchronous replication for primary, WAL streaming for archive)
RTO (recovery time objective): ≤ 30 minutes for PITR; ≤ 5 minutes for replica promotion

8.4. DR testing

Mandatory quarterly drills:

Promote standby → verify writes on new primary
Restore from WAL archive to point-in-time → verify ledger invariants hold
Replay outbox events from snapshot → verify no event loss
Simulate Kafka broker outage → verify outbox accumulation works, no data loss

Documented in runbooks/disaster-recovery-drill.md.

9. Multi-region considerations (v1.x roadmap)

Not in v0.1. Mentioned for completeness.

9.1. Active-passive (recommended starting point)

Primary region serves all writes
Secondary region is read-only standby
Failover via DNS + Postgres replication promote
RTO 5 min, RPO 1 sec

9.2. Active-active (advanced, v1.5+)

Per-account region affinity (account "lives" in one region for writes)
Cross-region payments use distributed sagas
Conflict resolution at outbox level (later-write-wins on metadata; ledger never conflicts because account is region-pinned)
Requires architectural ADR before adoption

9.3. Read-only replica promotion

Reads served from local-region replica
Eventual consistency tolerated for read-only paths (balance lookups)
For strict consistency reads (post-payment confirmation), use primary region directly

10. Chaos engineering hooks

A Tier-2 killer feature (chaos engineering hooks for dev/staging fault injection).

10.1. Module: `fincore-chaos`

Optional sub-project for inducing failures in dev/staging:

fincore:
  chaos:
    enabled: ${FINCORE_CHAOS_ENABLED:false}
    bank-adapter:
      timeout-rate: 0.05      # 5% of calls timeout randomly
      partial-success-rate: 0.01  # 1% return partial success
      duplicate-callback-rate: 0.02  # 2% trigger duplicate webhook
    db:
      slow-query-rate: 0.001  # 0.1% queries delayed by 5s
      deadlock-rate: 0.0001
    kafka:
      duplicate-message-rate: 0.05
      lost-message-rate: 0.001

10.2. Invariant verification under chaos

Test suite runs end-to-end with chaos enabled, asserts:

Ledger SUM=0 invariant holds
No duplicate payments (idempotency works)
Outbox dispatch eventually succeeds (no event loss)
Saga compensation correctly unwinds

Failures here are release blockers.

11. Observability of resilience

All resilience patterns emit metrics. Central dashboard:

Metric	Source	Alert threshold
`resilience4j.circuitbreaker.state`	each external dep	OPEN > 2 min
`resilience4j.circuitbreaker.failure.rate`	each external dep	> 30% for 5 min
`kafka.consumer.lag`	each consumer group	> 10000 for 5 min
`outbox.events.pending`	outbox dispatcher	> 1000 for 5 min
`outbox.events.failed`	outbox dispatcher	> 0 (any failure)
`saga.requires_manual_intervention.count`	saga coordinator	> 0 (any)
`cache.hit.rate{cache=...}`	each cache	< 80% for 1 hour
`hikari.connections.active.usage`	each pool	> 80% for 5 min
`payment.retry.scheduled.depth`	payment retry job	> 1000

Full dashboard JSON in deploy/grafana/dashboards/resilience.json.

12. Resilience checklist for release

Before tagging v0.x.0:

13. What's deliberately not in v0.1

No Multi-region active-active (v1.5+ with explicit ADR)
No Full Saga implementation (waiting for service extraction)
No Distributed locks across DB+Kafka (not needed in modular monolith)
No Database sharding (single Postgres scales to ~10M accounts; partitioning earlier)
No Eventual consistency at API boundaries (we offer strict)
No Cross-data-center replication of caches (Redis L2 is per-region)

These show up in the roadmap when actual demand emerges. Resilience is a journey, not a destination.

FinCore Engine - open-source fintech core (BSL 1.1 -> Apache 2.0). Repo - Roadmap - Vision

Product

Architecture

Overview
Services
Data Model
- Ledger
- Payments
- Compliance
- Decision
- Platform
Domain Model
Event Flow
Security
Observability
Resilience
SLA / SLI / SLO

Architecture Resilience

Architecture - Resilience, Caching, Sagas

1. Caching strategy

1.1. Cache levels

1.2. What we cache (and why)

1.3. Pattern: cache-aside (default for reads)

1.4. Pattern: write-through (idempotency keys)

1.5. Cache stampede protection

1.6. Cache as optional dependency

1.7. Anti-patterns (forbidden)

2. Saga pattern

2.1. Why we don't use sagas in v0.1

2.2. When sagas become necessary

2.3. Pattern choice: Orchestrated Saga

2.4. Saga state model

2.5. Coordinator example: Initiate Payment Saga

2.6. Failure modes

2.7. Idempotency in steps

2.8. Saga observability

2.9. Why not Spring State Machine?

3. Circuit breakers

3.1. Configuration

3.2. Usage in code

3.3. Circuit OPEN behavior

3.4. Half-open state

3.5. Metrics & alerts

4. Bulkheads (resource isolation)

4.1. Approach

4.2. Configuration

4.3. Spring async usage

4.4. Why not virtual threads exclusively?

5. Backpressure (Kafka consumer flow control)

5.1. Mechanisms

6. Graceful shutdown

6.1. Spring Boot config

6.2. Custom shutdown hook for outbox

6.3. Kubernetes integration

7. Connection pool management (HikariCP)

7.1. Per-workload pools

7.2. Sizing rules

8. Disaster recovery

8.1. Backup strategy

8.2. Recovery procedures

8.3. RTO / RPO targets (commitments)

8.4. DR testing

9. Multi-region considerations (v1.x roadmap)

9.1. Active-passive (recommended starting point)

9.2. Active-active (advanced, v1.5+)

9.3. Read-only replica promotion

10. Chaos engineering hooks

10.1. Module: fincore-chaos

10.2. Invariant verification under chaos

11. Observability of resilience

12. Resilience checklist for release

13. What's deliberately not in v0.1

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

10.1. Module: `fincore-chaos`