Skip to content

Architecture Resilience

Tiana_ edited this page May 30, 2026 · 1 revision

Architecture - Resilience, Caching, Sagas

Detailed treatment of resilience patterns: caching, sagas, circuit breakers, bulkheads, backpressure, graceful shutdown, connection pools, disaster recovery, chaos engineering hooks.

Companion to Architecture-Overview, Architecture-Event-Flow, Architecture-SLA-SLI-SLO.


1. Caching strategy

1.1. Cache levels

flowchart LR
    Request --> L1[L1: In-process<br/>Caffeine]
    L1 -->|miss| L2[L2: Redis<br/>shared]
    L2 -->|miss| Source[(Source of truth<br/>PostgreSQL / Keycloak / external)]
    Source --> L2
    L2 --> L1
    L1 --> Response[Response]
Loading

L1 - Caffeine (in-JVM)

  • Hot data, sub-millisecond access
  • Per-pod (no network)
  • Bounded size (eviction: LRU + TTL)
  • Used for: parsed JWT validation cache, decision rule definitions, currency reference data

L2 - Redis (shared)

  • Cross-pod consistency
  • Network-bound (1-2ms typical)
  • Used for: idempotency keys, rate-limit counters, JWKS, distributed locks (Redlock)

L3 - DB materialized views

  • Source of truth for derived data
  • Refreshed on writes (or scheduled)
  • Used for: account_balances (sum of entries)

1.2. What we cache (and why)

Data Layer TTL Invalidation Notes
JWKS (Keycloak public keys) L1 + L2 10 min On HTTP 401 from Keycloak Two layers - even if Redis is down, L1 keeps signed validation working
Decoded JWT principal L1 5 min (= JWT TTL) Auto-expiry Skip re-decode for repeat requests
Decision rules (active) L1 + L2 5 min Event-based on decision.rule.activated / decision.rule.deprecated Stale rules = wrong decisions, so event invalidation
Idempotency response L2 24 h Auto-expire Per-key TTL in Redis SET EX
Rate-limit counters L2 1 min sliding window Auto-expire Sliding window log algorithm
Currency reference data (ISO 4217) L1 1 day Manual flush via admin endpoint Static-ish, low frequency
Sanctions list snapshot L2 1 hour Event-based on provider webhook Critical to be fresh
External provider responses (KYC, AML data) L2 provider-specific Per-provider config Some are 24h, some are 1m
Account balance L3 (Postgres MV) refreshed CONCURRENTLY after each post - NOT in Redis (consistency hazard)
Account metadata NOT cached - - Always read from DB (small table, indexed)
Transactions / entries NOT cached - - Append-only, queries are infrequent reads
Payment lifecycle state NOT cached - - Strict consistency required

1.3. Pattern: cache-aside (default for reads)

@Component
class CachedDecisionRuleRepository(
    private val ruleRepo: DecisionRuleRepository,
    private val cache: Cache<String, List<DecisionRule>>,  // Caffeine
    private val redis: ReactiveStringRedisTemplate,
) {
    suspend fun getActiveRules(ruleSetId: String): List<DecisionRule> {
        // L1
        cache.getIfPresent(ruleSetId)?.let { return it }

        // L2
        val redisKey = "decision:rules:active:$ruleSetId"
        redis.opsForValue().get(redisKey).awaitFirstOrNull()
            ?.let { json ->
                val rules = deserialize(json)
                cache.put(ruleSetId, rules)
                return rules
            }

        // Source
        val rules = ruleRepo.findByRuleSetIdAndStatus(ruleSetId, ACTIVE)
        cache.put(ruleSetId, rules)
        redis.opsForValue().set(redisKey, serialize(rules), Duration.ofMinutes(5)).awaitSingle()
        return rules
    }

    @KafkaListener(topics = ["decision.events"])
    fun onRuleChange(event: EventEnvelope) {
        if (event.type in setOf("decision.rule.activated", "decision.rule.deprecated")) {
            val ruleSetId = event.data["ruleSetId"] as String
            cache.invalidate(ruleSetId)
            redis.delete("decision:rules:active:$ruleSetId").subscribe()
        }
    }
}

1.4. Pattern: write-through (idempotency keys)

@Transactional
fun storeIdempotencyResponse(key: IdempotencyKey, hash: String, response: Response) {
    // DB first (source of truth) - atomic with business state
    idempotencyRepo.save(IdempotencyRecord(key, hash, response, expiresAt = now + 24h))

    // Redis cache update (best-effort, eventually consistent)
    redis.opsForValue().set(
        "idem:${key.value}",
        serialize(IdempotencyCacheEntry(hash, response)),
        Duration.ofHours(24),
    ).subscribe()
}

If Redis write fails - log, continue. Next read falls through to DB. No correctness impact.

1.5. Cache stampede protection

For hot keys (e.g., active sanctions list), simultaneous misses can hammer the source. Mitigation:

// Single-flight pattern via Caffeine
val cache: AsyncLoadingCache<String, SanctionsSnapshot> = Caffeine.newBuilder()
    .expireAfterWrite(Duration.ofHours(1))
    .buildAsync { key, _ -> loadSanctionsFromProvider(key) }

// Multiple concurrent requests for the same key share a single load
suspend fun getSanctions(): SanctionsSnapshot = cache.get("global").await()

For Redis L2 stampede: use SETNX with short TTL as advisory lock during refresh.

1.6. Cache as optional dependency

Every cache read is wrapped:

suspend fun fromCacheOrFallback(key: String, fallback: suspend () -> T): T =
    try { cache.get(key) ?: fallback() }
    catch (e: Exception) { log.warn("cache miss/error: ${e.message}"); fallback() }

If Redis is unreachable, services degrade gracefully - slower (DB-only) but correct. Health probe doesn't fail just because Redis is down (services are still useful).

1.7. Anti-patterns (forbidden)

  • No Caching Account entity by ID - even with short TTL, balance staleness is a fintech bug.
  • No Caching API responses globally - RFC 7807 problem details leak correlationId which is per-request.
  • No Caching across tenants without explicit tenant key.
  • No Storing raw money amounts as strings in Redis (use serialized Money with explicit currency, never trust the cache to preserve scale).

2. Saga pattern

2.1. Why we don't use sagas in v0.1

In modular monolith mode, all aggregates share one DB transaction when needed. Payment + Ledger writes commit atomically. This is correctness-by-default - no compensating logic, no eventual consistency surprises.

Adding sagas now would be premature complexity.

2.2. When sagas become necessary

After service extraction (Y1 H2 onward), Payment Service and Ledger Service no longer share a DB transaction. The cross-service operations that need sagas:

Operation Steps Compensation needed if step N fails
Initiate Payment (1) Payment.create → (2) Decision.evaluate → (3) Ledger.postTransaction → (4) Payment.markProcessing → (5) BankAdapter.send Reverse step 3 if step 5 fails permanently
Resolve Compliance Case (REJECT) (1) Case.resolve → (2) Payment.markRejected → (3) Ledger.reverseTransaction Re-open case if step 3 fails
External Refund (1) BankAdapter.refund → (2) Ledger.postReversal → (3) Payment.markRefunded Mark "manual review" if step 2 fails (rare, requires operator)
KYC-gated Account Activation (1) KYC.approve → (2) User.markVerified → (3) Account.activate Roll back step 2 if step 3 fails

2.3. Pattern choice: Orchestrated Saga

We adopt orchestrated sagas (vs choreographed) for clarity and operability:

  • Orchestrated - a central SagaCoordinator per saga type, explicit state machine, all steps and compensations in one place.
  • Choreographed - services react to events, no central state. Simpler at small scale, harder to debug at scale, harder to operate.

For fintech, orchestrated wins because:

  • Operators need a clear "what state is this saga in?" answer
  • Compensation logic is explicit and testable
  • Saga history is easy to query (single state table)

2.4. Saga state model

@Entity
@Table(name = "sagas")
class Saga(
    @Id val id: SagaId,
    val sagaType: String,                  // e.g. "payment.initiate"
    val correlationId: UUID,
    var status: SagaStatus,
    @Type(JsonBinaryType::class)
    val context: JsonObject,               // saga-specific state (paymentId, accountIds, etc.)
    var currentStep: Int,
    val totalSteps: Int,
    val createdAt: Instant,
    var updatedAt: Instant,
    val completedAt: Instant?,
    @Version var version: Long = 0,
    @OneToMany(mappedBy = "saga", cascade = [PERSIST])
    val executions: MutableList<SagaStepExecution>,
)

@Entity
@Table(name = "saga_step_executions")
class SagaStepExecution(
    @Id val id: UUID,
    @ManyToOne val saga: Saga,
    val stepNumber: Int,
    val stepName: String,
    var status: SagaStepStatus,            // PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED
    val request: JsonObject,
    var response: JsonObject?,
    var error: String?,
    val attempts: Int = 0,
    val maxAttempts: Int = 5,
    val startedAt: Instant,
    var completedAt: Instant?,
)

enum class SagaStatus { RUNNING, SUCCEEDED, COMPENSATING, FAILED, REQUIRES_MANUAL_INTERVENTION }
enum class SagaStepStatus { PENDING, IN_PROGRESS, SUCCEEDED, FAILED, COMPENSATED }

2.5. Coordinator example: Initiate Payment Saga

class InitiatePaymentSaga(
    private val coordinator: SagaCoordinator,
    private val paymentClient: PaymentServiceClient,
    private val decisionClient: DecisionEngineClient,
    private val ledgerClient: LedgerServiceClient,
    private val bankAdapter: BankAdapter,
) {

    val steps: List<SagaStep<*, *>> = listOf(
        SagaStep("create-payment",
            forward = { ctx -> paymentClient.create(ctx.cmd) },
            compensate = { ctx -> paymentClient.markCancelled(ctx.paymentId) }),

        SagaStep("evaluate-decision",
            forward = { ctx -> decisionClient.evaluate(ctx.toContext()) },
            compensate = { /* no compensation - pure read */ }),

        SagaStep("post-ledger-transaction",
            forward = { ctx -> ledgerClient.post(ctx.toLedgerCmd()) },
            compensate = { ctx -> ledgerClient.reverse(ctx.ledgerTxnId, "saga compensation") }),

        SagaStep("mark-payment-processing",
            forward = { ctx -> paymentClient.markProcessing(ctx.paymentId) },
            compensate = { ctx -> paymentClient.markFailed(ctx.paymentId) }),

        SagaStep("send-to-bank",
            forward = { ctx -> bankAdapter.send(ctx.toBankCmd()) },
            compensate = { /* terminal: bank ack received, can't compensate */
                throw NonCompensableException("bank already accepted")
            }),
    )

    suspend fun execute(cmd: InitiatePaymentCommand): Saga = coordinator.run(this, cmd)
}

2.6. Failure modes

Scenario Handling
Step succeeds, app crashes before recording Saga is replayed on recovery - step's idempotencyKey ensures no duplicate effect
Step fails transiently Retry up to maxAttempts per step (exponential backoff)
Step fails permanently, compensation succeeds Saga ends in FAILED state, original command rolled back
Step fails permanently, compensation fails Saga ends in REQUIRES_MANUAL_INTERVENTION - alert fires, operator console shows details
Compensation impossible (e.g., money already left bank) Saga ends in REQUIRES_MANUAL_INTERVENTION - explicit operator workflow

2.7. Idempotency in steps

Every step's forward and compensate action receives an idempotencyKey derived from saga context:

  • forward("create-payment") uses payment-create-${sagaId}
  • compensate("post-ledger-transaction") uses ledger-reverse-${sagaId}-${stepN}

Replays are safe - no double-effect.

2.8. Saga observability

  • Saga state queryable via GET /v1/admin/sagas/{id} (admin role)
  • Each step execution emits structured log + metric (saga.step.<name>.duration)
  • REQUIRES_MANUAL_INTERVENTION state triggers PagerDuty/Slack alert
  • Saga timeline rendered in operator UI (Phase E)

2.9. Why not Spring State Machine?

Considered. Rejected because:

  • Heavyweight (extra dep, XML config historically)
  • Doesn't fit naturally into a "saga is a sequence of remote calls" model
  • Custom coordinator is ~300 lines of Kotlin and we own it

Kept option: if community contributes Spring State Machine integration for sagas, we accept it as optional dependency.


3. Circuit breakers

For every external call (bank, KYC, LLM, sanctions, payment provider), we wrap with a circuit breaker via Resilience4j.

3.1. Configuration

resilience4j:
  circuitbreaker:
    instances:
      bank-adapter:
        failure-rate-threshold: 50              # %
        slow-call-rate-threshold: 60
        slow-call-duration-threshold: 5s
        permitted-number-of-calls-in-half-open-state: 3
        sliding-window-size: 100
        sliding-window-type: COUNT_BASED
        wait-duration-in-open-state: 30s
        minimum-number-of-calls: 20
        record-exceptions:
          - java.io.IOException
          - org.springframework.web.client.HttpServerErrorException
        ignore-exceptions:
          - com.fincore.payments.exception.PaymentValidationException

      kyc-provider:
        failure-rate-threshold: 50
        sliding-window-size: 50
        wait-duration-in-open-state: 60s

      llm-provider:
        failure-rate-threshold: 40              # tighter - LLM expensive
        slow-call-duration-threshold: 30s       # LLMs are slow by nature
        wait-duration-in-open-state: 120s

3.2. Usage in code

@Component
class BankAdapterAdapter(
    private val httpClient: BankProviderHttpClient,
    private val circuitBreakerRegistry: CircuitBreakerRegistry,
) : BankProvider {

    private val cb = circuitBreakerRegistry.circuitBreaker("bank-adapter")

    override suspend fun send(payment: Payment): BankResponse =
        cb.executeSuspendFunction { httpClient.post(payment) }
}

3.3. Circuit OPEN behavior

  • New requests fail-fast with CircuitBreakerOpenException → mapped to 503 Service Unavailable with Retry-After: 30
  • Inflight payments enter retry topic with backoff
  • Health probe reflects circuit state: if all critical providers' circuits are OPEN, readiness probe fails (load balancer drains)

3.4. Half-open state

After wait duration, circuit allows up to N probe calls. Successful probes close the circuit. Failed probes re-open it.

3.5. Metrics & alerts

  • resilience4j.circuitbreaker.state{name=bank-adapter} - 0=CLOSED, 1=OPEN, 2=HALF_OPEN
  • Alert: circuit OPEN for > 2 minutes
  • Alert: failure rate > 30% for > 5 minutes

4. Bulkheads (resource isolation)

We isolate thread pools so that one slow dependency doesn't starve others.

4.1. Approach

  • One thread pool per external dependency
  • Separate connection pools per workload type (read-heavy vs write-heavy)
  • Kafka consumer threads isolated from REST handlers

4.2. Configuration

@Configuration
class BulkheadConfig {

    // For bank adapter calls (potentially slow)
    @Bean("bankAdapterExecutor")
    fun bankAdapterExecutor(): Executor =
        Executors.newFixedThreadPool(20, ThreadFactoryBuilder().setNameFormat("bank-adapter-%d").build())

    // For KYC provider calls (slow, but lower volume)
    @Bean("kycProviderExecutor")
    fun kycProviderExecutor(): Executor =
        Executors.newFixedThreadPool(10, ThreadFactoryBuilder().setNameFormat("kyc-%d").build())

    // For LLM calls (very slow)
    @Bean("llmExecutor")
    fun llmExecutor(): Executor =
        Executors.newFixedThreadPool(5, ThreadFactoryBuilder().setNameFormat("llm-%d").build())
}

4.3. Spring async usage

@Async("bankAdapterExecutor")
suspend fun sendToBank(payment: Payment): BankResponse = ...

4.4. Why not virtual threads exclusively?

Virtual threads (JDK 21) are great for high-volume IO workloads. But:

  • They share a single ForkJoinPool - no isolation between dependencies
  • A slow LLM call could pin many virtual threads, starving banks
  • Bulkheads with platform threads give isolation per dependency

Strategy: virtual threads for REST handlers (high volume, mixed IO), platform threads with bulkheads for external dependencies (isolation needed).

spring:
  threads:
    virtual:
      enabled: true   # for Tomcat / Spring MVC

5. Backpressure (Kafka consumer flow control)

When a consumer can't keep up with topic throughput, naive consumption leads to OOM or rebalance storms.

5.1. Mechanisms

Pause/resume on lag

@Component
class BackpressureAwareConsumer(
    private val kafkaContainer: ConcurrentMessageListenerContainer<*, *>,
) {

    @Scheduled(fixedDelay = 5_000)
    fun checkLag() {
        val lagMetric = meterRegistry.find("kafka.consumer.lag").gauge()?.value() ?: 0.0
        when {
            lagMetric > 10_000 -> {
                log.warn("Backpressure: consumer lag $lagMetric - pausing")
                kafkaContainer.pause()
            }
            lagMetric < 1_000 -> kafkaContainer.resume()
        }
    }
}

Bounded in-flight processing

spring:
  kafka:
    listener:
      concurrency: 4                     # 4 concurrent partitions
      poll-timeout: 1s
      ack-mode: MANUAL_IMMEDIATE
    consumer:
      max-poll-records: 50                # bounded batch
      max-poll-interval-ms: 300000        # 5min - must finish batch
      fetch-min-bytes: 1024

Outbox dispatcher rate-limited writes

  • Limit Kafka publish rate to broker capacity (avoid producer queue overflow)
  • Per-pod token bucket: 1000 events/sec by default

6. Graceful shutdown

When a pod receives SIGTERM (K8s rolling update, scale-down), it must:

  1. Stop accepting new requests
  2. Complete in-flight requests (up to deadline)
  3. Drain Kafka consumers (stop polling, finish current batch)
  4. Flush outbox dispatcher (don't lose unpublished events)
  5. Close DB connections cleanly
  6. Exit

6.1. Spring Boot config

server:
  shutdown: graceful
  netty:
    connection-timeout: 30s

spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

6.2. Custom shutdown hook for outbox

@Component
class OutboxDispatcherShutdown(
    private val dispatcher: OutboxDispatcher,
) : SmartLifecycle {

    override fun stop(callback: Runnable) {
        log.info("Draining outbox dispatcher")
        dispatcher.shutdown(timeout = Duration.ofSeconds(20))
        callback.run()
    }

    override fun isRunning() = dispatcher.isRunning()
    override fun getPhase() = SmartLifecycle.DEFAULT_PHASE - 100  // shut down before web layer
}

6.3. Kubernetes integration

spec:
  terminationGracePeriodSeconds: 60
  containers:
    - name: ledger
      lifecycle:
        preStop:
          exec:
            command: ["/bin/sh", "-c", "sleep 5"]   # let load balancer drain

7. Connection pool management (HikariCP)

7.1. Per-workload pools

For read-heavy services (Ledger queries), separate pools:

spring:
  datasource:
    primary:                # for writes
      url: jdbc:postgresql://primary:5432/fincore
      hikari:
        maximum-pool-size: 20
        minimum-idle: 5
        connection-timeout: 5000
        max-lifetime: 1800000           # 30 min
        idle-timeout: 600000            # 10 min
        validation-timeout: 5000
        leak-detection-threshold: 10000

    read-replica:           # for reads (scale-out)
      url: jdbc:postgresql://replica:5432/fincore
      hikari:
        maximum-pool-size: 50           # more for reads
        minimum-idle: 10
        # ... same as above

7.2. Sizing rules

  • Total pool size ≤ Postgres max_connections / N services - avoid exhaustion
  • Per service: 20-50 for primary writes, 50-100 for read replicas
  • leak-detection-threshold: 10s - log unreturned connections
  • Avoid connections in transactions across multiple HTTP calls - one connection per request

8. Disaster recovery

8.1. Backup strategy

Layer Mechanism Frequency Retention
Postgres data Logical replication to standby + WAL archiving to S3 Continuous (WAL) + nightly snapshot 35 days PITR window
Postgres schema Liquibase changelog versioned in git On every deploy Forever (git history)
Decision rules DB backup + git export weekly Weekly 1 year
Kafka topics Topic compaction + tiered storage to S3 Real-time 7 days hot, 90 days cold
Object storage (KYC docs metadata refs) Provider-side replication Continuous Per provider SLA

8.2. Recovery procedures

Point-in-time recovery (PITR):

# Restore Postgres to a specific second
pg_basebackup --pgdata=/var/lib/postgresql/restore --xlog-method=stream
recovery.conf:
  restore_command = 'aws s3 cp s3://fincore-wal/%f %p'
  recovery_target_time = '2026-04-25 14:32:00 UTC'

Region failover (out of OSS scope, on roadmap):

  • Multi-region active-passive via Postgres logical replication
  • Kafka mirror maker 2.0 to standby region
  • DNS-level traffic shift with health checks

8.3. RTO / RPO targets (commitments)

  • RPO (recovery point objective): ≤ 1 second (synchronous replication for primary, WAL streaming for archive)
  • RTO (recovery time objective): ≤ 30 minutes for PITR; ≤ 5 minutes for replica promotion

8.4. DR testing

Mandatory quarterly drills:

  1. Promote standby → verify writes on new primary
  2. Restore from WAL archive to point-in-time → verify ledger invariants hold
  3. Replay outbox events from snapshot → verify no event loss
  4. Simulate Kafka broker outage → verify outbox accumulation works, no data loss

Documented in runbooks/disaster-recovery-drill.md.


9. Multi-region considerations (v1.x roadmap)

Not in v0.1. Mentioned for completeness.

9.1. Active-passive (recommended starting point)

  • Primary region serves all writes
  • Secondary region is read-only standby
  • Failover via DNS + Postgres replication promote
  • RTO 5 min, RPO 1 sec

9.2. Active-active (advanced, v1.5+)

  • Per-account region affinity (account "lives" in one region for writes)
  • Cross-region payments use distributed sagas
  • Conflict resolution at outbox level (later-write-wins on metadata; ledger never conflicts because account is region-pinned)
  • Requires architectural ADR before adoption

9.3. Read-only replica promotion

  • Reads served from local-region replica
  • Eventual consistency tolerated for read-only paths (balance lookups)
  • For strict consistency reads (post-payment confirmation), use primary region directly

10. Chaos engineering hooks

A Tier-2 killer feature (chaos engineering hooks for dev/staging fault injection).

10.1. Module: fincore-chaos

Optional sub-project for inducing failures in dev/staging:

fincore:
  chaos:
    enabled: ${FINCORE_CHAOS_ENABLED:false}
    bank-adapter:
      timeout-rate: 0.05      # 5% of calls timeout randomly
      partial-success-rate: 0.01  # 1% return partial success
      duplicate-callback-rate: 0.02  # 2% trigger duplicate webhook
    db:
      slow-query-rate: 0.001  # 0.1% queries delayed by 5s
      deadlock-rate: 0.0001
    kafka:
      duplicate-message-rate: 0.05
      lost-message-rate: 0.001

10.2. Invariant verification under chaos

Test suite runs end-to-end with chaos enabled, asserts:

  • Ledger SUM=0 invariant holds
  • No duplicate payments (idempotency works)
  • Outbox dispatch eventually succeeds (no event loss)
  • Saga compensation correctly unwinds

Failures here are release blockers.


11. Observability of resilience

All resilience patterns emit metrics. Central dashboard:

Metric Source Alert threshold
resilience4j.circuitbreaker.state each external dep OPEN > 2 min
resilience4j.circuitbreaker.failure.rate each external dep > 30% for 5 min
kafka.consumer.lag each consumer group > 10000 for 5 min
outbox.events.pending outbox dispatcher > 1000 for 5 min
outbox.events.failed outbox dispatcher > 0 (any failure)
saga.requires_manual_intervention.count saga coordinator > 0 (any)
cache.hit.rate{cache=...} each cache < 80% for 1 hour
hikari.connections.active.usage each pool > 80% for 5 min
payment.retry.scheduled.depth payment retry job > 1000

Full dashboard JSON in deploy/grafana/dashboards/resilience.json.


12. Resilience checklist for release

Before tagging v0.x.0:

  • All external dependencies wrapped in circuit breakers
  • Bulkhead executors configured per dependency
  • Graceful shutdown verified - drain test with outbox + consumers
  • Connection pool sized appropriately
  • Backup strategy in place, RPO < 1 sec verified
  • DR drill executed in last quarter
  • Chaos test suite passing (in chaos profile)
  • Saga state recoverable from crash mid-execution
  • Cache invalidation paths tested for staleness
  • Health probes accurate (don't fail just from Redis miss)
  • Rate limiting tuned per endpoint
  • Resilience metrics on Grafana dashboard
  • Runbook entries for each "REQUIRES_MANUAL_INTERVENTION" scenario

13. What's deliberately not in v0.1

  • No Multi-region active-active (v1.5+ with explicit ADR)
  • No Full Saga implementation (waiting for service extraction)
  • No Distributed locks across DB+Kafka (not needed in modular monolith)
  • No Database sharding (single Postgres scales to ~10M accounts; partitioning earlier)
  • No Eventual consistency at API boundaries (we offer strict)
  • No Cross-data-center replication of caches (Redis L2 is per-region)

These show up in the roadmap when actual demand emerges. Resilience is a journey, not a destination.

Clone this wiki locally