-
Notifications
You must be signed in to change notification settings - Fork 0
ADR 0003 Outbox over Direct Kafka
Tiana_ edited this page May 30, 2026
·
1 revision
Status: Accepted Date: 2026-04-25 Decider: Maintainer
FinCore services produce events for downstream consumers (AML check, webhook delivery, analytics). The naive approach: publish directly to Kafka after a DB transaction commits. This has the dual-write problem: if the Kafka publish fails after the DB commits, the event is silently lost. If the DB rollback happens after Kafka publishes, downstream sees phantom events.
Common "solutions" that don't actually work:
- Publishing inside
@Transactional- Kafka isn't part of the JDBC transaction - Kafka transactions (KIP-98) - only covers Kafka, doesn't span Kafka + Postgres
- 2PC / XA - slow, brittle, universally regretted
- Hoping Kafka publish never fails - it does, regularly, especially during broker rolling updates
Adopt the transactional outbox pattern:
- Services write business state + an
outbox_eventsrow in the same DB transaction - A separate Outbox Dispatcher worker polls
outbox_events, publishes to Kafka, marks rowsPUBLISHED - Consumers dedup via
processed_eventstable for at-least-once → effectively-once
Implementation specifics:
- Each schema has its own
outbox_eventstable (or a centralizedplatform.outbox_events) - Dispatcher uses
SELECT ... FOR UPDATE SKIP LOCKEDfor safe parallel workers (no leader election) - Kafka partition key =
aggregate_idto preserve per-aggregate ordering - 100ms polling interval (configurable; sub-100ms when latency-critical)
- Cleanup job nightly:
DELETE PUBLISHED rows older than 30 days - Failure → 5 retries → mark
FAILED, alert, manual intervention
- No event loss - outbox row is committed atomically with business state. If DB rolls back, no event published
- No phantom events - Kafka publish only happens after DB commit succeeds
- Dispatcher restart safe - uncommitted Kafka publish on restart simply re-publishes; consumer dedup handles
-
Multi-pod safe -
SKIP LOCKEDallows N dispatcher pods without leader coordination - Source of truth is DB - if Kafka data is lost (broker disaster), all events can be re-published from DB
-
Audit-friendly -
outbox_eventsis a complete log of "what events have happened"
- Latency: ~100ms to publish a Kafka event (poll interval). Acceptable for our SLO (webhook delivery within 30s, AML check within seconds)
- DB load: every publish becomes 1 INSERT + 1 UPDATE. Acceptable at our target scale
-
Cleanup overhead: nightly DELETE of old
PUBLISHEDrows -
Storage growth: outbox table grows during Kafka outages - must size for
24h × normal volume × 3capacity
- Some teams use Debezium (CDC reading WAL directly) instead of application-level dispatcher. Lower latency, but adds operational burden of Kafka Connect. We chose application dispatcher for simplicity; Debezium is on Y2 roadmap if scale demands.
- Rejected: dual-write problem; one of the most common production bugs in microservices
- Rejected: doesn't span Kafka + Postgres
- Adds significant complexity (transactional IDs, fencing, abort handling)
- Useful only when ALL writes are to Kafka
- Rejected: legacy, brittle, slow
- Coordinator-failure recovery is operationally painful
- Rejected: too radical for v0.1
- Most teams not ready for the mental model shift
- Outbox pattern provides ~80% of benefits with 10% of complexity
- Considered for future: lower latency, no application code in publish path
- Rejected for v0.1: heavyweight (Kafka Connect cluster), schema-coupling (consumers see DB rows, not domain events)
- May adopt for v0.5+ if latency or throughput warrants
- Integration test: post 100 transactions, kill Kafka mid-flight, restart, verify all 100 events eventually published
- Property test: any sequence of writes preserves "every business state change has exactly one corresponding event"
- Load test: dispatcher sustains 1000 events/sec/pod
- Disaster drill: restore Postgres from backup, verify all unpublished events get published
- Architecture-Event-Flow - full dispatcher/consumer mechanics
- Architecture-Resilience#11-outbox-dispatcher - operational details
- ADR-0006 - broker choice (works with any Kafka-compatible broker)
- Overview
- Services
- Data Model
- Domain Model
- Event Flow
- Security
- Observability
- Resilience
- SLA / SLI / SLO