Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): implement dedup pk deserializer #3578

Merged
merged 12 commits into from
Jul 2, 2022
Merged

Conversation

kwannoel
Copy link
Contributor

@kwannoel kwannoel commented Jun 30, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

PLEASE DO NOT LEAVE THIS EMPTY !!!

Please explain IN DETAIL what the changes are in this PR and why they are needed:

  • Summarize your change (mandatory)
    Implement dedup pk cell based deserializer.
    It will be used by relational layer for dedup pk encoding.
  • How does this PR work? Need a brief introduction for the changed logic (optional)
    DedupPkCellBasedRowDeserializer uses CellBasedRowDeserializer internally to deserialize rows.
    It then replaces de-duplicated datums from pk.
  • Describe clearly one logical change and avoid lazy messages (optional)
  • Describe any limitations of the current code (optional)
  • Add the 'user-facing changes' label if your PR contains changes that are visible to users (optional)

Checklist

Refer to a related PR or issue link (optional)

#3412

@kwannoel kwannoel changed the title add dedup deserializer skeleton feat(row_deserializer): add dedup deserializer skeleton Jun 30, 2022
@kwannoel kwannoel changed the title feat(row_deserializer): add dedup deserializer skeleton feat(row_deserializer): add dedup pk deserializer Jun 30, 2022
@kwannoel kwannoel marked this pull request as ready for review June 30, 2022 15:23
@kwannoel kwannoel requested a review from wcy-fdu June 30, 2022 15:36
@kwannoel kwannoel changed the title feat(row_deserializer): add dedup pk deserializer feat(storage): implement dedup pk deserializer Jun 30, 2022
@codecov
Copy link

codecov bot commented Jun 30, 2022

Codecov Report

Merging #3578 (f9302ee) into main (ee8b0e8) will increase coverage by 0.02%.
The diff coverage is 92.26%.

@@            Coverage Diff             @@
##             main    #3578      +/-   ##
==========================================
+ Coverage   74.30%   74.32%   +0.02%     
==========================================
  Files         772      773       +1     
  Lines      109190   109358     +168     
==========================================
+ Hits        81131    81285     +154     
- Misses      28059    28073      +14     
Flag Coverage Δ
rust 74.32% <92.26%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/storage/src/lib.rs 100.00% <ø> (ø)
...torage/src/dedup_pk_cell_based_row_deserializer.rs 92.26% <92.26%> (ø)
src/connector/src/filesystem/file_common.rs 80.35% <0.00%> (-0.45%) ⬇️
src/frontend/src/expr/utils.rs 98.74% <0.00%> (-0.26%) ⬇️
src/meta/src/manager/id.rs 96.06% <0.00%> (+0.56%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Copy link
Contributor

@BowenXiao1999 BowenXiao1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

We recently proposed to use Row_encoding in https://singularity-data.quip.com/5TutAoAlk6Oa/RFC-Row-based-encoding-in-relational-table-layer. But I'm not sure whether the conclusion have been sync to you. cc @wcy-fdu

@kwannoel
Copy link
Contributor Author

kwannoel commented Jul 1, 2022

LGTM.

We recently proposed to use Row_encoding in https://singularity-data.quip.com/5TutAoAlk6Oa/RFC-Row-based-encoding-in-relational-table-layer. But I'm not sure whether the conclusion have been sync to you. cc @wcy-fdu

I see. Thanks for sharing. Just read through the doc, if cell-based encoding will be phased out, then dedup pk cell based encoding won't be needed either.

Since I will be away, will leave this unmerged for now as it might be obsolete. In the meantime I will see how discussion on Row-based encoding proceeds.

@kwannoel kwannoel marked this pull request as draft July 1, 2022 07:03
@wcy-fdu
Copy link
Contributor

wcy-fdu commented Jul 1, 2022

Maybe we will keep cell-based encoding at this stage and even for a long time. In the future, after row-based encoding is implemented, we need to do more detailed benchmarks to compare cell-based and row-based encoding, and I think dedup pk will make the bench fairer as it is an optimization of cell-based encoding.

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Jul 1, 2022

BTW, I will review this PR later😊

@kwannoel kwannoel marked this pull request as ready for review July 1, 2022 07:13
@kwannoel
Copy link
Contributor Author

kwannoel commented Jul 1, 2022

Maybe we will keep cell-based encoding at this stage and even for a long time. In the future, after row-based encoding is implemented, we need to do more detailed benchmarks to compare cell-based and row-based encoding, and I think dedup pk will make the bench fairer as it is an optimization of cell-based encoding.

I see, thanks for clarifying this!

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM, good work!

/// Create a [`DedupPkCellBasedRowDeserializer`]
/// to decode cell based row with dedup pk encoding.
pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self {
let (pk_data_types, pk_order_types) = pk_descs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RowDeserializer / Serializer was designed to be created without overhead, and that's why I introduced ColumnDescMapping before to cache the HashMap of column mapping. For DedupPk serializer, it would be better to have a new DedupColumnDescMapping to store all such generated info (e.g. pk_to_row_mapping). Can done in later PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ultimate solution will be to cache all such information in a StateTableInfo struct, and pass it everywhere... I guess @.BugenZhao is working on this.

if let Some((_vnode, pk, row)) = raw_result {
let pk_datums = self.pk_deserializer.deserialize(&pk)?;
Ok(Some((
_vnode,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_vnode,
vnode,

If a value is actually used, we can remove the underscore.

})
.collect();

let inner = CellBasedRowDeserializer::new(column_mapping);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove deduped pks from column mapping?

Copy link
Contributor Author

@kwannoel kwannoel Jul 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use original column mapping, when deserializing, the inner CellBasedRowDeserializer will leave placeholders for missing dedupped pk datums. Then we can just replace the placeholders with the dedupped pk datums.

If we remove deduped pks from column mapping, we can't do this. The deserialized form will be compact. Probably need another step to map the dedupped row + the dedupped pk datums to a result row.

Both ways are possible, but prefer the first.

For example, given [1, 2, 3] with pk_indices = [1], we store [1, 3] under dedup pk encoding.
When deserializing, the result from the inner CellBasedRowDeserializer is:

with original column mapping: [Some(1), None, Some(3)] // can just replace None with dedupped pk datum
with dedup column mapping: [Some(1), Some(3)]

impl<Desc: Deref<Target = ColumnDescMapping>> DedupPkCellBasedRowDeserializer<Desc> {
/// Create a [`DedupPkCellBasedRowDeserializer`]
/// to decode cell based row with dedup pk encoding.
pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more intuitive to specify pk indices instead of the ColumnDescs 🤣 But this looks okay to me for now.

@kwannoel kwannoel added the mergify/can-merge Indicates that the PR can be added to the merge queue label Jul 2, 2022
@mergify mergify bot merged commit feb7e43 into main Jul 2, 2022
@mergify mergify bot deleted the kwannoel/pk-dedup-deser branch July 2, 2022 07:58
huangjw806 pushed a commit that referenced this pull request Jul 5, 2022
* add dedup deserializer skeleton

* add dedup pk deserialization logic

* fmt

* add tests

* fmt

* config test module

* add docs

* add docs

* rerun ci

* remove underscore

* add todos for DedupPkCellDeserializer instantiation
nasnoisaac pushed a commit to nasnoisaac/risingwave that referenced this pull request Aug 9, 2022
* add dedup deserializer skeleton

* add dedup pk deserialization logic

* fmt

* add tests

* fmt

* config test module

* add docs

* add docs

* rerun ci

* remove underscore

* add todos for DedupPkCellDeserializer instantiation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mergify/can-merge Indicates that the PR can be added to the merge queue type/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants