Skip to content

ADR 0003 Outbox over Direct Kafka

Tiana_ edited this page May 30, 2026 · 1 revision

ADR-0003: Outbox pattern over direct Kafka publish

Status: Accepted Date: 2026-04-25 Decider: Maintainer

Context

FinCore services produce events for downstream consumers (AML check, webhook delivery, analytics). The naive approach: publish directly to Kafka after a DB transaction commits. This has the dual-write problem: if the Kafka publish fails after the DB commits, the event is silently lost. If the DB rollback happens after Kafka publishes, downstream sees phantom events.

Common "solutions" that don't actually work:

  • Publishing inside @Transactional - Kafka isn't part of the JDBC transaction
  • Kafka transactions (KIP-98) - only covers Kafka, doesn't span Kafka + Postgres
  • 2PC / XA - slow, brittle, universally regretted
  • Hoping Kafka publish never fails - it does, regularly, especially during broker rolling updates

Decision

Adopt the transactional outbox pattern:

  1. Services write business state + an outbox_events row in the same DB transaction
  2. A separate Outbox Dispatcher worker polls outbox_events, publishes to Kafka, marks rows PUBLISHED
  3. Consumers dedup via processed_events table for at-least-once → effectively-once

Implementation specifics:

  • Each schema has its own outbox_events table (or a centralized platform.outbox_events)
  • Dispatcher uses SELECT ... FOR UPDATE SKIP LOCKED for safe parallel workers (no leader election)
  • Kafka partition key = aggregate_id to preserve per-aggregate ordering
  • 100ms polling interval (configurable; sub-100ms when latency-critical)
  • Cleanup job nightly: DELETE PUBLISHED rows older than 30 days
  • Failure → 5 retries → mark FAILED, alert, manual intervention

Consequences

Positive

  • No event loss - outbox row is committed atomically with business state. If DB rolls back, no event published
  • No phantom events - Kafka publish only happens after DB commit succeeds
  • Dispatcher restart safe - uncommitted Kafka publish on restart simply re-publishes; consumer dedup handles
  • Multi-pod safe - SKIP LOCKED allows N dispatcher pods without leader coordination
  • Source of truth is DB - if Kafka data is lost (broker disaster), all events can be re-published from DB
  • Audit-friendly - outbox_events is a complete log of "what events have happened"

Negative

  • Latency: ~100ms to publish a Kafka event (poll interval). Acceptable for our SLO (webhook delivery within 30s, AML check within seconds)
  • DB load: every publish becomes 1 INSERT + 1 UPDATE. Acceptable at our target scale
  • Cleanup overhead: nightly DELETE of old PUBLISHED rows
  • Storage growth: outbox table grows during Kafka outages - must size for 24h × normal volume × 3 capacity

Neutral

  • Some teams use Debezium (CDC reading WAL directly) instead of application-level dispatcher. Lower latency, but adds operational burden of Kafka Connect. We chose application dispatcher for simplicity; Debezium is on Y2 roadmap if scale demands.

Alternatives considered

Direct Kafka publish from @Transactional method

  • Rejected: dual-write problem; one of the most common production bugs in microservices

Kafka transactional API (KIP-98)

  • Rejected: doesn't span Kafka + Postgres
  • Adds significant complexity (transactional IDs, fencing, abort handling)
  • Useful only when ALL writes are to Kafka

XA distributed transactions

  • Rejected: legacy, brittle, slow
  • Coordinator-failure recovery is operationally painful

Event sourcing as primary state

  • Rejected: too radical for v0.1
  • Most teams not ready for the mental model shift
  • Outbox pattern provides ~80% of benefits with 10% of complexity

Debezium CDC from WAL

  • Considered for future: lower latency, no application code in publish path
  • Rejected for v0.1: heavyweight (Kafka Connect cluster), schema-coupling (consumers see DB rows, not domain events)
  • May adopt for v0.5+ if latency or throughput warrants

Validation

  • Integration test: post 100 transactions, kill Kafka mid-flight, restart, verify all 100 events eventually published
  • Property test: any sequence of writes preserves "every business state change has exactly one corresponding event"
  • Load test: dispatcher sustains 1000 events/sec/pod
  • Disaster drill: restore Postgres from backup, verify all unpublished events get published

Related

Clone this wiki locally