ADR 0003 Outbox over Direct Kafka

ADR-0003: Outbox pattern over direct Kafka publish

Status: Accepted Date: 2026-04-25 Decider: Maintainer

Context

FinCore services produce events for downstream consumers (AML check, webhook delivery, analytics). The naive approach: publish directly to Kafka after a DB transaction commits. This has the dual-write problem: if the Kafka publish fails after the DB commits, the event is silently lost. If the DB rollback happens after Kafka publishes, downstream sees phantom events.

Common "solutions" that don't actually work:

Publishing inside @Transactional - Kafka isn't part of the JDBC transaction
Kafka transactions (KIP-98) - only covers Kafka, doesn't span Kafka + Postgres
2PC / XA - slow, brittle, universally regretted
Hoping Kafka publish never fails - it does, regularly, especially during broker rolling updates

Decision

Adopt the transactional outbox pattern:

Services write business state + an outbox_events row in the same DB transaction
A separate Outbox Dispatcher worker polls outbox_events, publishes to Kafka, marks rows PUBLISHED
Consumers dedup via processed_events table for at-least-once → effectively-once

Implementation specifics:

Each schema has its own outbox_events table (or a centralized platform.outbox_events)
Dispatcher uses SELECT ... FOR UPDATE SKIP LOCKED for safe parallel workers (no leader election)
Kafka partition key = aggregate_id to preserve per-aggregate ordering
100ms polling interval (configurable; sub-100ms when latency-critical)
Cleanup job nightly: DELETE PUBLISHED rows older than 30 days
Failure → 5 retries → mark FAILED, alert, manual intervention

Consequences

Positive

No event loss - outbox row is committed atomically with business state. If DB rolls back, no event published
No phantom events - Kafka publish only happens after DB commit succeeds
Dispatcher restart safe - uncommitted Kafka publish on restart simply re-publishes; consumer dedup handles
Multi-pod safe - SKIP LOCKED allows N dispatcher pods without leader coordination
Source of truth is DB - if Kafka data is lost (broker disaster), all events can be re-published from DB
Audit-friendly - outbox_events is a complete log of "what events have happened"

Negative

Latency: ~100ms to publish a Kafka event (poll interval). Acceptable for our SLO (webhook delivery within 30s, AML check within seconds)
DB load: every publish becomes 1 INSERT + 1 UPDATE. Acceptable at our target scale
Cleanup overhead: nightly DELETE of old PUBLISHED rows
Storage growth: outbox table grows during Kafka outages - must size for 24h × normal volume × 3 capacity

Neutral

Some teams use Debezium (CDC reading WAL directly) instead of application-level dispatcher. Lower latency, but adds operational burden of Kafka Connect. We chose application dispatcher for simplicity; Debezium is on Y2 roadmap if scale demands.

Alternatives considered

Direct Kafka publish from `@Transactional` method

Rejected: dual-write problem; one of the most common production bugs in microservices

Kafka transactional API (KIP-98)

Rejected: doesn't span Kafka + Postgres
Adds significant complexity (transactional IDs, fencing, abort handling)
Useful only when ALL writes are to Kafka

XA distributed transactions

Rejected: legacy, brittle, slow
Coordinator-failure recovery is operationally painful

Event sourcing as primary state

Rejected: too radical for v0.1
Most teams not ready for the mental model shift
Outbox pattern provides ~80% of benefits with 10% of complexity

Debezium CDC from WAL

Considered for future: lower latency, no application code in publish path
Rejected for v0.1: heavyweight (Kafka Connect cluster), schema-coupling (consumers see DB rows, not domain events)
May adopt for v0.5+ if latency or throughput warrants

Validation

Integration test: post 100 transactions, kill Kafka mid-flight, restart, verify all 100 events eventually published
Property test: any sequence of writes preserves "every business state change has exactly one corresponding event"
Load test: dispatcher sustains 1000 events/sec/pod
Disaster drill: restore Postgres from backup, verify all unpublished events get published

ADR 0003 Outbox over Direct Kafka

ADR-0003: Outbox pattern over direct Kafka publish

Context

Decision

Consequences

Positive

Negative

Neutral

Alternatives considered

Direct Kafka publish from @Transactional method

Kafka transactional API (KIP-98)

XA distributed transactions

Event sourcing as primary state

Debezium CDC from WAL

Validation

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Product

Architecture

Decisions (ADR)

Engineering

Delivery

Risk and Ops

Clone this wiki locally

Direct Kafka publish from `@Transactional` method