fix: potential data corruption bug and improve message sending behaviour #174

imor · 2025-07-09T11:13:56Z

This PR:

Fixes a bug in which table states were updated before we got an ack back from the destination. This could have lead to data corruption if the process crashed after updating the table state and before sending the batch.
No longer batches the ReplicationMessage<LogicalReplicationMessage> messages into a vec, which were later copied to another vec. This reduces number of allocations. Allocations can be further reduced if we send a slice to the destination instead of a vec, but that's not yet handled.

I also wanted to create a stream which encapsulates the whole message filtering logic and produced LogicalReplicationMessages and then batch the output of this stream, but this became complicated due to stream calling async methods. This will be done in a separate PR later on.

…pply-loop-should-be-called-on

iambriccardo

Left some comments

etl/src/v2/replication/apply.rs

iambriccardo · 2025-07-11T12:16:42Z

etl/src/v2/replication/apply.rs

+            end_loop |= !hook
+                .process_syncing_tables(state.next_status_update.flush_lsn, true)
+                .await?;


With the process syncing here, the system becomes much harder to reason about, but I am failing to see any alternative that would work well with batching.

Yes, it has become harder to see the logic, but as you said I can't think of a real alternative either.

etl/src/v2/replication/apply.rs

iambriccardo · 2025-07-11T12:53:18Z

etl/src/v2/replication/apply.rs

+        if let Some(table_id) = skip_table {
+            end_loop |= !hook.skip_table(table_id).await?;


Shouldn't this be outside of the sending of the batch? As far as I know we need to skip the table as soon as we get the event.

This will be called as soon as the relation event indicates a change in schema because the end_batch.is_some() part in the if time_to_send_batch || state.events_batch.len() >= max_batch_size || end_batch.is_some() condition will be true in that case (end_batch will be set to EndBatch::Exclusive).

This again is important to call only after the batch is acked because if we set the table state to skipped before that we run the risk of missing events.

Oh yeah, nvm, I missed the last condition.

imor added 7 commits July 10, 2025 17:10

use result instead of state to indicate early breaking

606c7f3

call process_syncing_tables after ack from destination

5e84a93

add more comments and cleaned up code

2cf750a

wip

2753326

add more comments and cleaned up code

e8953f0

rename a method

b8c85b2

fix a bug an derive Debug for a couple of structs

3c12160

imor force-pushed the raminder/etl-148-process_syncing_tables-in-apply-loop-should-be-called-on branch from 2f20960 to 3c12160 Compare July 10, 2025 11:59

imor added 11 commits July 10, 2025 18:05

call process_syncing_tables with flush lsn value

6321fa8

simplify implementation

de7be44

Merge branch 'main' into raminder/etl-148-process_syncing_tables-in-a…

9d6b5b2

…pply-loop-should-be-called-on

only update state after a batch is written

a5b7998

rearrange conditions

a6df469

finetune logic

b9af1e8

remove a variable

019098c

factor out a try_send_batch method

2db5819

Merge branch 'main' into raminder/etl-148-process_syncing_tables-in-a…

00e804e

…pply-loop-should-be-called-on

fix potential panic due to clock skew

1d39441

simplify implementation

04c0c57

imor changed the title ~~Raminder/etl 148 process syncing tables in apply loop should be called on~~ Improve message sending behaviour Jul 11, 2025

imor changed the title ~~Improve message sending behaviour~~ fix: potential data corruption bug and improve message sending behaviour Jul 11, 2025

imor marked this pull request as ready for review July 11, 2025 11:16

imor requested a review from a team as a code owner July 11, 2025 11:16

iambriccardo reviewed Jul 11, 2025

View reviewed changes

imor added 3 commits July 11, 2025 18:39

simplify condition

8815d13

move out max batch fill duration from ApplyLoopState

3d1ca06

improve comment

4a11097

imor requested a review from iambriccardo July 11, 2025 13:37

iambriccardo approved these changes Jul 11, 2025

View reviewed changes

imor merged commit cd8b282 into main Jul 11, 2025
3 checks passed

imor deleted the raminder/etl-148-process_syncing_tables-in-apply-loop-should-be-called-on branch July 11, 2025 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: potential data corruption bug and improve message sending behaviour #174

fix: potential data corruption bug and improve message sending behaviour #174

Uh oh!

imor commented Jul 9, 2025 •

edited

Loading

Uh oh!

iambriccardo left a comment

Uh oh!

Uh oh!

Uh oh!

iambriccardo Jul 11, 2025

Uh oh!

imor Jul 11, 2025

Uh oh!

Uh oh!

iambriccardo Jul 11, 2025

Uh oh!

imor Jul 11, 2025

Uh oh!

iambriccardo Jul 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if let Some(table_id) = skip_table {
		end_loop \|= !hook.skip_table(table_id).await?;

Uh oh!

fix: potential data corruption bug and improve message sending behaviour #174

fix: potential data corruption bug and improve message sending behaviour #174

Uh oh!

Conversation

imor commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iambriccardo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

iambriccardo Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

imor Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iambriccardo Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

imor Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

iambriccardo Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

imor commented Jul 9, 2025 •

edited

Loading