feat(storage): implement dedup pk deserializer #3578

kwannoel · 2022-06-30T10:16:16Z

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

PLEASE DO NOT LEAVE THIS EMPTY !!!

Please explain IN DETAIL what the changes are in this PR and why they are needed:

Summarize your change (mandatory)
Implement dedup pk cell based deserializer.
It will be used by relational layer for dedup pk encoding.
How does this PR work? Need a brief introduction for the changed logic (optional)
DedupPkCellBasedRowDeserializer uses CellBasedRowDeserializer internally to deserialize rows.
It then replaces de-duplicated datums from pk.
Describe clearly one logical change and avoid lazy messages (optional)
Describe any limitations of the current code (optional)
Add the 'user-facing changes' label if your PR contains changes that are visible to users (optional)

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests
Tests for dedup pk encoding ser/de will be done separately. Tracked in Tracking: Cell encoding - store a column either in pk or value, but not both #3412.
All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

#3412

codecov · 2022-06-30T16:38:04Z

Codecov Report

Merging #3578 (f9302ee) into main (ee8b0e8) will increase coverage by 0.02%.
The diff coverage is 92.26%.

@@            Coverage Diff             @@
##             main    #3578      +/-   ##
==========================================
+ Coverage   74.30%   74.32%   +0.02%     
==========================================
  Files         772      773       +1     
  Lines      109190   109358     +168     
==========================================
+ Hits        81131    81285     +154     
- Misses      28059    28073      +14

Flag	Coverage Δ
rust	`74.32% <92.26%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/storage/src/lib.rs	`100.00% <ø> (ø)`
...torage/src/dedup_pk_cell_based_row_deserializer.rs	`92.26% <92.26%> (ø)`
src/connector/src/filesystem/file_common.rs	`80.35% <0.00%> (-0.45%)`	⬇️
src/frontend/src/expr/utils.rs	`98.74% <0.00%> (-0.26%)`	⬇️
src/meta/src/manager/id.rs	`96.06% <0.00%> (+0.56%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

BowenXiao1999

LGTM.

We recently proposed to use Row_encoding in https://singularity-data.quip.com/5TutAoAlk6Oa/RFC-Row-based-encoding-in-relational-table-layer. But I'm not sure whether the conclusion have been sync to you. cc @wcy-fdu

kwannoel · 2022-07-01T07:03:22Z

LGTM.

We recently proposed to use Row_encoding in https://singularity-data.quip.com/5TutAoAlk6Oa/RFC-Row-based-encoding-in-relational-table-layer. But I'm not sure whether the conclusion have been sync to you. cc @wcy-fdu

I see. Thanks for sharing. Just read through the doc, if cell-based encoding will be phased out, then dedup pk cell based encoding won't be needed either.

Since I will be away, will leave this unmerged for now as it might be obsolete. In the meantime I will see how discussion on Row-based encoding proceeds.

wcy-fdu · 2022-07-01T07:09:28Z

Maybe we will keep cell-based encoding at this stage and even for a long time. In the future, after row-based encoding is implemented, we need to do more detailed benchmarks to compare cell-based and row-based encoding, and I think dedup pk will make the bench fairer as it is an optimization of cell-based encoding.

wcy-fdu · 2022-07-01T07:10:27Z

BTW, I will review this PR later😊

kwannoel · 2022-07-01T07:14:01Z

Maybe we will keep cell-based encoding at this stage and even for a long time. In the future, after row-based encoding is implemented, we need to do more detailed benchmarks to compare cell-based and row-based encoding, and I think dedup pk will make the bench fairer as it is an optimization of cell-based encoding.

I see, thanks for clarifying this!

skyzh

Rest LGTM, good work!

skyzh · 2022-07-02T05:09:02Z

src/storage/src/dedup_pk_cell_based_row_deserializer.rs

+    /// Create a [`DedupPkCellBasedRowDeserializer`]
+    /// to decode cell based row with dedup pk encoding.
+    pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self {
+        let (pk_data_types, pk_order_types) = pk_descs


The RowDeserializer / Serializer was designed to be created without overhead, and that's why I introduced ColumnDescMapping before to cache the HashMap of column mapping. For DedupPk serializer, it would be better to have a new DedupColumnDescMapping to store all such generated info (e.g. pk_to_row_mapping). Can done in later PRs.

The ultimate solution will be to cache all such information in a StateTableInfo struct, and pass it everywhere... I guess @.BugenZhao is working on this.

skyzh · 2022-07-02T05:09:52Z

src/storage/src/dedup_pk_cell_based_row_deserializer.rs

+        if let Some((_vnode, pk, row)) = raw_result {
+            let pk_datums = self.pk_deserializer.deserialize(&pk)?;
+            Ok(Some((
+                _vnode,


Suggested change

_vnode,

vnode,

If a value is actually used, we can remove the underscore.

skyzh · 2022-07-02T05:17:21Z

src/storage/src/dedup_pk_cell_based_row_deserializer.rs

+            })
+            .collect();
+
+        let inner = CellBasedRowDeserializer::new(column_mapping);


Shall we remove deduped pks from column mapping?

If we use original column mapping, when deserializing, the inner CellBasedRowDeserializer will leave placeholders for missing dedupped pk datums. Then we can just replace the placeholders with the dedupped pk datums.

If we remove deduped pks from column mapping, we can't do this. The deserialized form will be compact. Probably need another step to map the dedupped row + the dedupped pk datums to a result row.

Both ways are possible, but prefer the first.

For example, given [1, 2, 3] with pk_indices = [1], we store [1, 3] under dedup pk encoding.
When deserializing, the result from the inner CellBasedRowDeserializer is:

with original column mapping: [Some(1), None, Some(3)] // can just replace None with dedupped pk datum with dedup column mapping: [Some(1), Some(3)]

skyzh · 2022-07-02T05:18:15Z

src/storage/src/dedup_pk_cell_based_row_deserializer.rs

+impl<Desc: Deref<Target = ColumnDescMapping>> DedupPkCellBasedRowDeserializer<Desc> {
+    /// Create a [`DedupPkCellBasedRowDeserializer`]
+    /// to decode cell based row with dedup pk encoding.
+    pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self {


It would be more intuitive to specify pk indices instead of the ColumnDescs 🤣 But this looks okay to me for now.

* add dedup deserializer skeleton * add dedup pk deserialization logic * fmt * add tests * fmt * config test module * add docs * add docs * rerun ci * remove underscore * add todos for DedupPkCellDeserializer instantiation

github-actions bot added the Invalid PR Title label Jun 30, 2022

kwannoel changed the title ~~add dedup deserializer skeleton~~ feat(row_deserializer): add dedup deserializer skeleton Jun 30, 2022

github-actions bot removed the Invalid PR Title label Jun 30, 2022

kwannoel changed the title ~~feat(row_deserializer): add dedup deserializer skeleton~~ feat(row_deserializer): add dedup pk deserializer Jun 30, 2022

github-actions bot added the type/feature label Jun 30, 2022

kwannoel added 6 commits June 30, 2022 23:05

add dedup deserializer skeleton

fcc8c01

add dedup pk deserialization logic

54d7daf

fmt

caf8de4

add tests

7ec64fc

fmt

b56c5da

config test module

479912c

kwannoel force-pushed the kwannoel/pk-dedup-deser branch from dea4eaa to 479912c Compare June 30, 2022 15:05

add docs

3682d54

kwannoel force-pushed the kwannoel/pk-dedup-deser branch from 3f36b3a to 3682d54 Compare June 30, 2022 15:15

add docs

b8122d7

kwannoel marked this pull request as ready for review June 30, 2022 15:23

kwannoel mentioned this pull request Jun 30, 2022

Tracking: Cell encoding - store a column either in pk or value, but not both #3412

Closed

13 tasks

kwannoel requested a review from wcy-fdu June 30, 2022 15:36

kwannoel changed the title ~~feat(row_deserializer): add dedup pk deserializer~~ feat(storage): implement dedup pk deserializer Jun 30, 2022

rerun ci

ddc8e60

kwannoel requested a review from BowenXiao1999 June 30, 2022 17:14

BowenXiao1999 approved these changes Jul 1, 2022

View reviewed changes

kwannoel marked this pull request as draft July 1, 2022 07:03

kwannoel marked this pull request as ready for review July 1, 2022 07:13

skyzh approved these changes Jul 2, 2022

View reviewed changes

wcy-fdu approved these changes Jul 2, 2022

View reviewed changes

kwannoel added 3 commits July 2, 2022 15:17

Merge remote-tracking branch 'origin/main' into kwannoel/pk-dedup-deser

0c20b60

remove underscore

7d6ba27

add todos for DedupPkCellDeserializer instantiation

f9302ee

kwannoel added the mergify/can-merge Indicates that the PR can be added to the merge queue label Jul 2, 2022

mergify bot merged commit feb7e43 into main Jul 2, 2022

mergify bot deleted the kwannoel/pk-dedup-deser branch July 2, 2022 07:58

kwannoel mentioned this pull request Jul 14, 2022

refactor(encoding): reduce DedupPkCellBasedRowDeserializer initialization overhead #3855

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): implement dedup pk deserializer #3578

feat(storage): implement dedup pk deserializer #3578

kwannoel commented Jun 30, 2022 •

edited

codecov bot commented Jun 30, 2022 •

edited

BowenXiao1999 left a comment

kwannoel commented Jul 1, 2022

wcy-fdu commented Jul 1, 2022

wcy-fdu commented Jul 1, 2022

kwannoel commented Jul 1, 2022

skyzh left a comment

skyzh Jul 2, 2022

skyzh Jul 2, 2022

skyzh Jul 2, 2022

skyzh Jul 2, 2022

kwannoel Jul 2, 2022 •

edited

skyzh Jul 2, 2022

feat(storage): implement dedup pk deserializer #3578

feat(storage): implement dedup pk deserializer #3578

Conversation

kwannoel commented Jun 30, 2022 • edited

What's changed and what's your intention?

Checklist

Refer to a related PR or issue link (optional)

codecov bot commented Jun 30, 2022 • edited

Codecov Report

BowenXiao1999 left a comment

Choose a reason for hiding this comment

kwannoel commented Jul 1, 2022

wcy-fdu commented Jul 1, 2022

wcy-fdu commented Jul 1, 2022

kwannoel commented Jul 1, 2022

skyzh left a comment

Choose a reason for hiding this comment

skyzh Jul 2, 2022

Choose a reason for hiding this comment

skyzh Jul 2, 2022

Choose a reason for hiding this comment

skyzh Jul 2, 2022

Choose a reason for hiding this comment

skyzh Jul 2, 2022

Choose a reason for hiding this comment

kwannoel Jul 2, 2022 • edited

Choose a reason for hiding this comment

skyzh Jul 2, 2022

Choose a reason for hiding this comment

kwannoel commented Jun 30, 2022 •

edited

codecov bot commented Jun 30, 2022 •

edited

kwannoel Jul 2, 2022 •

edited