Skip to content

world-postgres: stream readers can stall after LISTEN disconnects or missed NOTIFY event #1855

@Pom4H

Description

@Pom4H

Bug

@workflow/world-postgres currently relies on PostgreSQL LISTEN/NOTIFY for live stream chunk delivery.

This is fragile: NOTIFY is only a wake-up signal for currently connected listeners, not a durable backlog. If the dedicated LISTEN workflow_event_chunk client disconnects, or if a notification is missed during reconnect, chunks can still be written to the streams table successfully while active readers stop receiving live updates indefinitely.

In other words: the streams table is the source of truth, and LISTEN/NOTIFY should only be used to wake readers up to re-query chunks newer than their last delivered chunk_id.

Symptoms

In production, after the dedicated LISTEN client is dropped:

  • writeToStream(...) continues inserting chunk rows successfully.
  • pg_notify(...) continues executing successfully.
  • readFromStream(...) readers may receive the initial query batch, then never receive subsequent chunks.
  • Restarting the pod restores delivery until the next LISTEN disconnect.

This silently halts live in-process delivery in the affected process, while persisted stream rows remain intact.

Proposed fix

This needs two layers:

  1. Make listenChannel resilient:

    • attach error and end handlers to the dedicated pg.Client
    • reconnect with bounded exponential backoff
    • re-run LISTEN workflow_event_chunk after reconnect
    • stop reconnect attempts on close()
  2. Make readFromStream resilient to missed notifications:

    • keep a per-reader lastChunkId
    • load initial chunks from the streams table
    • on notification, query streams WHERE chunk_id > lastChunkId
    • periodically run the same query as a polling fallback
    • dedupe/order by chunk_id
    • stop polling on EOF, cancel, or controller close

This makes world-postgres stream delivery durable even when the LISTEN connection is interrupted.

Relation to other work

This is compatible with #1847, but it is a lower-level world-postgres reliability issue. Core-level stream reconnect cannot recover notifications that PostgreSQL never delivered to a disconnected LISTEN client. The Postgres world still needs to treat the table as the durable source of truth.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions