feat(core): Implement memory based backpressure mechanism#605
feat(core): Implement memory based backpressure mechanism#605iambriccardo wants to merge 18 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR implements memory-aware processing by adding a MemoryMonitor that samples system memory and broadcasts a hysteresis-based blocked/unblocked signal. Two stream wrappers, BackpressureStream and BatchBackpressureStream, observe the memory signal and pause/resume or flush streams and batches accordingly. MemoryMonitor is created at pipeline startup and threaded through ApplyWorker/ApplyLoop/TableSyncWorker and table_copy call paths so table-copy and streaming flows respond to memory pressure. Several configuration fields and plumbing (config, workers, replication client updates) were added to propagate the monitor. Sequence Diagram(s)sequenceDiagram
participant Monitor as MemoryMonitor (bg task)
participant System as sysinfo::System
participant Watch as watch::Sender<bool>
participant Subscriber as MemoryMonitorSubscription
loop Every MEMORY_REFRESH_INTERVAL
Monitor->>System: sample memory stats
System-->>Monitor: MemorySnapshot
Monitor->>Monitor: compute_next_blocked(used_percent)
Monitor->>Watch: send blocked state (if changed)
Watch-->>Subscriber: broadcast update
end
Subscriber->>Subscriber: poll_update()/current_blocked()
sequenceDiagram
participant Consumer as Consumer
participant Stream as BackpressureStream
participant Inner as EventsStream
participant Memory as MemoryMonitorSubscription
Consumer->>Stream: poll_next()
Stream->>Memory: poll_update(cx)
alt memory blocked
Memory-->>Stream: Some(true)
Stream->>Consumer: Pending
else memory not blocked
Memory-->>Stream: None/false
Stream->>Inner: poll_next()
Inner-->>Stream: Item / Pending / Done
Stream-->>Consumer: Item / Pending / Done
end
Assessment against linked issues
Out-of-scope changes
Comment |
Pull Request Test Coverage Report for Build 22140054258Details
💛 - Coveralls |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@etl/src/concurrency/stream.rs`:
- Around line 47-70: The task stalls because when memory_monitor.poll_update(cx)
yields Ready(...) we set this.paused_for_memory but do not register the waker
for future updates; before returning Poll::Pending you must continue polling
memory_monitor.poll_update(cx) (or loop until it returns Poll::Pending) so the
watch channel registers the current waker. Modify the poll logic around
memory_monitor.poll_update(cx) in the stream's poll method (the match that sets
*this.paused_for_memory and calls this.memory_monitor.current_blocked()) to
consume Ready variants and only stop when poll_update returns Poll::Pending,
updating *this.paused_for_memory on each Ready, and then return Poll::Pending
(so the waker is registered for the next change).
- Around line 190-202: In BatchBackpressureStream's poll implementation, when
*this.paused_for_memory is true and this.items is empty you currently return
Poll::Pending without registering the waker; change the branch so you capture
and store the current task waker (e.g. this.waker = Some(cx.waker().clone()) or
equivalent) before returning Poll::Pending so the stream can be woken when
memory state changes, keeping the existing behavior of flushing when items exist
(the symbols to modify are this.paused_for_memory, this.items, this.reset_timer
and the Poll::Pending return).
etl/src/concurrency/stream.rs
Outdated
| match this.memory_monitor.poll_update(cx) { | ||
| Poll::Ready(Some(blocked)) => { | ||
| *this.paused_for_memory = blocked; | ||
| } | ||
| Poll::Ready(None) => { | ||
| *this.paused_for_memory = false; | ||
| } | ||
| Poll::Pending => { | ||
| let currently_blocked = this.memory_monitor.current_blocked(); | ||
| if *this.paused_for_memory != currently_blocked { | ||
| *this.paused_for_memory = currently_blocked; | ||
| } | ||
| } | ||
| } | ||
|
|
||
| if !was_paused && *this.paused_for_memory { | ||
| info!("backpressure active, stream paused"); | ||
| } else if was_paused && !*this.paused_for_memory { | ||
| info!("backpressure released, stream resumed"); | ||
| } | ||
|
|
||
| if *this.paused_for_memory { | ||
| return Poll::Pending; | ||
| } |
There was a problem hiding this comment.
Critical: Missing waker registration causes indefinite task stall when backpressure activates.
When poll_update(cx) returns Ready(Some(true)), the waker is not registered with the watch channel because a Ready result was obtained. Returning Poll::Pending on line 69 without a registered waker means the task will never be woken when memory pressure is released.
After receiving a Ready from the watch stream, you must poll again to register for the next update before returning Pending.
Proposed fix
if *this.paused_for_memory {
+ // Ensure waker is registered for the next state change.
+ let _ = this.memory_monitor.poll_update(cx);
return Poll::Pending;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| match this.memory_monitor.poll_update(cx) { | |
| Poll::Ready(Some(blocked)) => { | |
| *this.paused_for_memory = blocked; | |
| } | |
| Poll::Ready(None) => { | |
| *this.paused_for_memory = false; | |
| } | |
| Poll::Pending => { | |
| let currently_blocked = this.memory_monitor.current_blocked(); | |
| if *this.paused_for_memory != currently_blocked { | |
| *this.paused_for_memory = currently_blocked; | |
| } | |
| } | |
| } | |
| if !was_paused && *this.paused_for_memory { | |
| info!("backpressure active, stream paused"); | |
| } else if was_paused && !*this.paused_for_memory { | |
| info!("backpressure released, stream resumed"); | |
| } | |
| if *this.paused_for_memory { | |
| return Poll::Pending; | |
| } | |
| match this.memory_monitor.poll_update(cx) { | |
| Poll::Ready(Some(blocked)) => { | |
| *this.paused_for_memory = blocked; | |
| } | |
| Poll::Ready(None) => { | |
| *this.paused_for_memory = false; | |
| } | |
| Poll::Pending => { | |
| let currently_blocked = this.memory_monitor.current_blocked(); | |
| if *this.paused_for_memory != currently_blocked { | |
| *this.paused_for_memory = currently_blocked; | |
| } | |
| } | |
| } | |
| if !was_paused && *this.paused_for_memory { | |
| info!("backpressure active, stream paused"); | |
| } else if was_paused && !*this.paused_for_memory { | |
| info!("backpressure released, stream resumed"); | |
| } | |
| if *this.paused_for_memory { | |
| // Ensure waker is registered for the next state change. | |
| let _ = this.memory_monitor.poll_update(cx); | |
| return Poll::Pending; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@etl/src/concurrency/stream.rs` around lines 47 - 70, The task stalls because
when memory_monitor.poll_update(cx) yields Ready(...) we set
this.paused_for_memory but do not register the waker for future updates; before
returning Poll::Pending you must continue polling memory_monitor.poll_update(cx)
(or loop until it returns Poll::Pending) so the watch channel registers the
current waker. Modify the poll logic around memory_monitor.poll_update(cx) in
the stream's poll method (the match that sets *this.paused_for_memory and calls
this.memory_monitor.current_blocked()) to consume Ready variants and only stop
when poll_update returns Poll::Pending, updating *this.paused_for_memory on each
Ready, and then return Poll::Pending (so the waker is registered for the next
change).
There was a problem hiding this comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@etl/src/concurrency/stream.rs`:
- Around line 95-121: The code sets paused_for_memory on Ready from
memory_subscription.poll_update but doesn't register the current waker for
future updates, so returning Poll::Pending can sleep forever; fix by repeatedly
calling memory_subscription.poll_update(cx) in a loop (or otherwise re-invoking
it) until it returns Poll::Pending so the waker is registered for the next
change, updating *this.paused_for_memory on each Ready(Some/None) result; apply
this change to the same polling logic in both BackpressureStream and
BatchBackpressureStream (use the existing memory_subscription.poll_update,
paused_for_memory and was_paused symbols to locate and update the code).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@etl/src/lib.rs`:
- Around line 98-99: Remove the inline code example from the Rust doc comment
that contains the settings memory_backpressure_activate_percentage and
memory_backpressure_resume_percentage; locate the doc comment block in the
module/lib doc (the block that lists those two example lines) and delete the
example lines (or move them to external docs) so the doc comment no longer
contains runnable code while preserving any plain-text descriptions.
---
Duplicate comments:
In `@etl/src/concurrency/stream.rs`:
- Around line 42-72: The stream's poll_next currently can return Poll::Pending
while pausing without ensuring the watch has registered the current task waker;
fix poll_next (and the analogous wrapper at the other location) so that before
returning Poll::Pending when *paused_for_memory is true you repeatedly call
this.memory_subscription.poll_update(cx) until it returns Poll::Pending (or
Ready(None)/unblocked) to allow the watch to register the waker, updating
*paused_for_memory from poll_update results (use
memory_subscription.current_blocked() as fallback) and only then return
Poll::Pending if still blocked.
| #[derive(Clone, Debug, Serialize, Deserialize)] | ||
| #[cfg_attr(feature = "utoipa", derive(ToSchema))] | ||
| #[serde(rename_all = "snake_case")] | ||
| pub struct BatchConfig { |
There was a problem hiding this comment.
Just moved it to align with how/where backpressure it written.
This PR introduces a new memory backpressure mechanism that monitors memory usage using the
sysinfocrate, taking cgroup limits into account. Memory measurements are taken every 100 ms. When memory usage exceeds defined thresholds, a watch channel emits the current backpressure state. This state is consumed by the two main data ingestion streams: table copy and table streaming.As part of this change, the existing stream abstractions have been improved as follows:
BackpressureStreamhas been added. It listens for memory backpressure signals and pauses consumption of elements until memory pressure is relieved.BatchBackpressureStreamvariant has been introduced, combining the previous timed batch behavior with support for memory-based backpressure alerts.One important consequence of this design is that, while the system is paused due to high memory usage, it may delay detection of connection closure messages or errors. To avoid forcing streams to be aware of connection status or to poll unnecessarily during backpressure periods (just to detect termination signals), a cleaner solution was chosen:
The connection task now broadcasts its status changes (via a watch channel or similar). In the
select!branches of both the table copy and table streaming tasks, we immediately terminate processing upon receiving a connection error or closure signal. This approach is preferred over the previous reliance on errors propagating throughtokio-postgresstream/table-copy layers, which often led to semantically confusing error handling.For more context on the underlying behavior, see the
tokio-postgresinternals, particularly:connection.rs,copy_out.rs, andcopy_both.rs.Another improvement: The batch stream no longer waits for shutdown signals internally (which was unclean). Instead, shutdown handling is now consistently managed via
select!in the table copy logic as well.