Bug
@workflow/world-postgres currently relies on PostgreSQL LISTEN/NOTIFY for live stream chunk delivery.
This is fragile: NOTIFY is only a wake-up signal for currently connected listeners, not a durable backlog. If the dedicated LISTEN workflow_event_chunk client disconnects, or if a notification is missed during reconnect, chunks can still be written to the streams table successfully while active readers stop receiving live updates indefinitely.
In other words: the streams table is the source of truth, and LISTEN/NOTIFY should only be used to wake readers up to re-query chunks newer than their last delivered chunk_id.
Symptoms
In production, after the dedicated LISTEN client is dropped:
writeToStream(...) continues inserting chunk rows successfully.
pg_notify(...) continues executing successfully.
readFromStream(...) readers may receive the initial query batch, then never receive subsequent chunks.
- Restarting the pod restores delivery until the next LISTEN disconnect.
This silently halts live in-process delivery in the affected process, while persisted stream rows remain intact.
Proposed fix
This needs two layers:
-
Make listenChannel resilient:
- attach
error and end handlers to the dedicated pg.Client
- reconnect with bounded exponential backoff
- re-run
LISTEN workflow_event_chunk after reconnect
- stop reconnect attempts on
close()
-
Make readFromStream resilient to missed notifications:
- keep a per-reader
lastChunkId
- load initial chunks from the
streams table
- on notification, query
streams WHERE chunk_id > lastChunkId
- periodically run the same query as a polling fallback
- dedupe/order by
chunk_id
- stop polling on EOF, cancel, or controller close
This makes world-postgres stream delivery durable even when the LISTEN connection is interrupted.
Relation to other work
This is compatible with #1847, but it is a lower-level world-postgres reliability issue. Core-level stream reconnect cannot recover notifications that PostgreSQL never delivered to a disconnected LISTEN client. The Postgres world still needs to treat the table as the durable source of truth.
Bug
@workflow/world-postgrescurrently relies on PostgreSQLLISTEN/NOTIFYfor live stream chunk delivery.This is fragile:
NOTIFYis only a wake-up signal for currently connected listeners, not a durable backlog. If the dedicatedLISTEN workflow_event_chunkclient disconnects, or if a notification is missed during reconnect, chunks can still be written to thestreamstable successfully while active readers stop receiving live updates indefinitely.In other words: the
streamstable is the source of truth, andLISTEN/NOTIFYshould only be used to wake readers up to re-query chunks newer than their last deliveredchunk_id.Symptoms
In production, after the dedicated LISTEN client is dropped:
writeToStream(...)continues inserting chunk rows successfully.pg_notify(...)continues executing successfully.readFromStream(...)readers may receive the initial query batch, then never receive subsequent chunks.This silently halts live in-process delivery in the affected process, while persisted stream rows remain intact.
Proposed fix
This needs two layers:
Make
listenChannelresilient:errorandendhandlers to the dedicatedpg.ClientLISTEN workflow_event_chunkafter reconnectclose()Make
readFromStreamresilient to missed notifications:lastChunkIdstreamstablestreams WHERE chunk_id > lastChunkIdchunk_idThis makes
world-postgresstream delivery durable even when the LISTEN connection is interrupted.Relation to other work
This is compatible with #1847, but it is a lower-level
world-postgresreliability issue. Core-level stream reconnect cannot recover notifications that PostgreSQL never delivered to a disconnected LISTEN client. The Postgres world still needs to treat the table as the durable source of truth.