Sync protocol improvements #530

gruuya · 2024-06-13T14:39:17Z

The changes in this PR involve:

new more robust replication protocol (formalized in the clade proto files).
batch compaction, meaning compressing changes from a single sync into a minimal form getting rid of temporary intermediate changes
idempotent CRUD support

TODOS:

Further segmentation of logic pieces that are not strictly needed in the sync writer
~~A lot~~ more unit testing, and a bit more integration testing

eatonphil · 2024-06-13T14:43:29Z

clade/proto/sync.proto

+  // Denotes a column that that has the new value for a primary key column
+  NEW_PK = 1;
+  // Denotes whether a particular column has changed in value
+  CHANGED = 2;


I am still not personally confident about how TOAST columns actually work. We may need to change more here or not. But nbd for now, just mentioning.

src/frontend/flight/sync/schema.rs

eatonphil · 2024-06-13T14:47:58Z

src/frontend/flight/sync/utils.rs

+// This means that if a row is changed multiple times, only the last change will be reflected in the
+// output batch (meaning the last NewPk and Value role columns and the last Value column where the
+// accompanying Changed field was `true`).
+pub(super) fn compact_batches(


While I see the point about squashing changes, isn't it pretty wasteful to be converting to rows and then back to columns?

Also, can't you squash rows without actually converting to row format?

Fair point; for one thing note this was designed to be efficiently encodable:
https://github.com/apache/arrow-rs/blob/8752e01be642bce205984e16b44e06078413dc68/arrow-row/src/lib.rs#L20-L24

In fact DataFusion itself relies on it for some core computations itself (also we only ever transpose the PKs, though that isn't a big factor anyway).

Sorting with columnar format is like the worst case for columnar format, so it doesn't surprise me they convert to row to sort, but I don't think we are trying to do that exactly. Or I'm not sure what you mean

Or I'm not sure what you mean

Ah, so what needs to happen above is the squashing (compaction) of all chains of changes to a given row (that may encompass multiple input rows), into a single output row, which will then be applied to the target table.

These chains can be of arbitrary lengths, from 1 (meaning a row is touched only once) to the batch row count (meaning the entire batch conveys changes for a single original row). In addition, various things can happen in each chain, such as value changes, pk changes and entire row deletion.

In principle I think this kind of computation can be formalized using recursive CTEs. However there are 3 things to note here:

I briefly experimented with recursive CTEs, managing to get some basic (but not full) functionality that we need here. The simplest prototype worked ok, however slightly more involved stuff made DF hang and spike the CPU, meaning there are some bugs there and I can't resolve those now.

Even if there were no issues in the execution, the recursive CTE implementation (using dataframe/plan/expression primitives) would be much more complex, since it would approach the problem in a round-about way.

Ultimately, even such an implementation would entail using the arrow-row under the hood somewhere (e.g. in the grouping/window aggregation phases).

Yeah, I think normally we could do this with something like this:

WITH latest_rows AS ( SELECT row_number() OVER (PARTITION BY pk ORDER BY timestamp DESC) AS _row_num FROM table ) SELECT [cols] FROM latest_rows WHERE _row_num = 1 AND NOT is_deleted

which would still process things row-by-row in the background but at least would reduce the amount of our custom code that we have to write.

But with PK-changing UPDATEs we have to be clever. It does raise the question of why we expect the API user to convert row-based changes into Arrow and then this code transposes it back into rows, then back again into columnar (maybe V2 of this would be Avro changes instead).

We can always see how this performs under benchmarks (append-only, update-intensive, update-intensive with PK changes). Ultimately, we'll have to do this transpose/compact somewhere - on write or on every read.

eatonphil · 2024-06-13T14:49:21Z

src/frontend/flight/sync/utils.rs

+}
+
+// Generates a pruning qualifier for the table based on the PK columns and the min/max values
+// in the data sync entries.


I don't quite understand the point of this, like what is the benefit? The comment doesn't help me much

The goal is to reduce the amount of data needed to be scanned/streamed/joined with from the base table.

Basically we only need to touch those partition which might have the extreme (min/max) PK values that are contained in the sync data.

I'd add something like that to the comment in this case

// Generate a qualifier expression that, when applied to the table, will only return // rows whose primary keys are affected by the changes in `entry`. This is so that // we can only read the partitions from Delta Lake that we need to rewrite.

Aside: if my entry has two changes to PK=1 and PK=infinity, will this make me read the entire Delta table?

Got it, changing the doc.

Aside: if my entry has two changes to PK=1 and PK=infinity, will this make me read the entire Delta table?

Yup, I don't think there's a way around it.

I think we can still work around it though - by manually reading the Delta metadata and cross-referencing it against each PK we know we have changes for, but that sounds like more of a future optimization to me.

But delta-rs already does that for us automatically, i.e. the scan uses the filter from this function to prune the partitions. So if we have PK=1..infinity (I'm assuming you mean here a really large number encompassing all partitions) then we must hit all partitions otherwise we're not guaranteeing idempotency (e.g. we may double-insert)?

I was thinking just two changes, one hitting partition #1, one hitting the last partition, not a range of changes spanning all partitions. Surely in this case we can optimize by only rewriting these two partitions?

Ah, yeah that should be possible.

Not sure how complicated it would be though, naively in the general case we'd need to have a qualifier per change/sync row and consequently a multiple scans/joins.

The less naive approach would be to segment all changes and group them by partitions they hit, and then only scan/join per group.

…interpretation

src/frontend/flight/sync/utils.rs

mildbyte · 2024-06-17T07:55:25Z

src/frontend/flight/sync/utils.rs

+}
+
+// Generates a pruning qualifier for the table based on the PK columns and the min/max values
+// in the data sync entries.


I'd add something like that to the comment in this case

// Generate a qualifier expression that, when applied to the table, will only return // rows whose primary keys are affected by the changes in `entry`. This is so that // we can only read the partitions from Delta Lake that we need to rewrite.

Aside: if my entry has two changes to PK=1 and PK=infinity, will this make me read the entire Delta table?

mildbyte · 2024-06-17T08:03:39Z

src/frontend/flight/sync/utils.rs

+// This means that if a row is changed multiple times, only the last change will be reflected in the
+// output batch (meaning the last NewPk and Value role columns and the last Value column where the
+// accompanying Changed field was `true`).
+pub(super) fn compact_batches(


Yeah, I think normally we could do this with something like this:

WITH latest_rows AS ( SELECT row_number() OVER (PARTITION BY pk ORDER BY timestamp DESC) AS _row_num FROM table ) SELECT [cols] FROM latest_rows WHERE _row_num = 1 AND NOT is_deleted

which would still process things row-by-row in the background but at least would reduce the amount of our custom code that we have to write.

But with PK-changing UPDATEs we have to be clever. It does raise the question of why we expect the API user to convert row-based changes into Arrow and then this code transposes it back into rows, then back again into columnar (maybe V2 of this would be Avro changes instead).

We can always see how this performs under benchmarks (append-only, update-intensive, update-intensive with PK changes). Ultimately, we'll have to do this transpose/compact somewhere - on write or on every read.

Cargo.toml

mildbyte · 2024-06-17T08:06:45Z

src/frontend/flight/sync/schema.rs

+    #[allow(dead_code)]
+    pub fn arrow_schema(&self) -> SchemaRef {
+        self.schema.clone()
+    }


Why keep this code if it's dead, some future impl?

Potentially, but also I use it in the tests.

You can make it non-public then (it'll still be available to unit tests in the same file)

The unit test is in another file, so it's simpler to remove and revise the test setup slightly.

src/frontend/flight/sync/schema.rs

src/frontend/flight/sync/utils.rs

src/frontend/flight/sync/writer.rs

…s instead of a hashmap

gruuya added 4 commits June 12, 2024 20:28

Add sync batch compacting

cfeec68

Implement SyncSchema helper struct

c33e845

Complete the projection logic for all cases

335f0ec

Fix tests and couple of sync bugs

ba0d569

gruuya requested a review from mildbyte June 13, 2024 14:39

eatonphil reviewed Jun 13, 2024

View reviewed changes

src/frontend/flight/sync/schema.rs Show resolved Hide resolved

eatonphil reviewed Jun 13, 2024

View reviewed changes

gruuya added 6 commits June 14, 2024 22:19

Use the same row converter when encoding old and new PK columns

4edd50a

Add map_columns method to SyncSchema

52fca2d

Construct the SyncColumns during SyncSchema init

68c9092

Setup batch compacting data in row order for easier manipulation and …

6b8d3e5

…interpretation

Add fuzz test for batch compaction

0936206

Exclude temporary rows when compacting a batch

26506af

mildbyte reviewed Jun 17, 2024

View reviewed changes

gruuya added 4 commits June 17, 2024 10:57

Clarify some comments and extend the compaction test

554815f

Remove SyncSchema method used only in test

9a736f1

Fix a bug in filter generation and add a test

0596a2e

Ensure consistent order of filter expressions by using a vec of tuple…

a5238fa

…s instead of a hashmap

mildbyte approved these changes Jun 18, 2024

View reviewed changes

gruuya merged commit 20887ea into main Jun 18, 2024
1 check passed

gruuya deleted the sync-v2 branch June 18, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync protocol improvements #530

Sync protocol improvements #530

gruuya commented Jun 13, 2024 •

edited

Loading

eatonphil Jun 13, 2024

eatonphil Jun 13, 2024

eatonphil Jun 13, 2024

gruuya Jun 13, 2024

eatonphil Jun 13, 2024

gruuya Jun 13, 2024

mildbyte Jun 17, 2024

eatonphil Jun 13, 2024

gruuya Jun 13, 2024

mildbyte Jun 17, 2024

gruuya Jun 17, 2024

mildbyte Jun 17, 2024

gruuya Jun 17, 2024

mildbyte Jun 18, 2024

gruuya Jun 18, 2024

mildbyte Jun 17, 2024

mildbyte Jun 17, 2024

mildbyte Jun 17, 2024

gruuya Jun 17, 2024

mildbyte Jun 17, 2024

gruuya Jun 17, 2024

Sync protocol improvements #530

Sync protocol improvements #530

Conversation

gruuya commented Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gruuya commented Jun 13, 2024 •

edited

Loading