-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): implement dedup pk deserializer #3578
Conversation
dea4eaa
to
479912c
Compare
3f36b3a
to
3682d54
Compare
Codecov Report
@@ Coverage Diff @@
## main #3578 +/- ##
==========================================
+ Coverage 74.30% 74.32% +0.02%
==========================================
Files 772 773 +1
Lines 109190 109358 +168
==========================================
+ Hits 81131 81285 +154
- Misses 28059 28073 +14
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
We recently proposed to use Row_encoding in https://singularity-data.quip.com/5TutAoAlk6Oa/RFC-Row-based-encoding-in-relational-table-layer. But I'm not sure whether the conclusion have been sync to you. cc @wcy-fdu
I see. Thanks for sharing. Just read through the doc, if cell-based encoding will be phased out, then dedup pk cell based encoding won't be needed either. Since I will be away, will leave this unmerged for now as it might be obsolete. In the meantime I will see how discussion on |
Maybe we will keep cell-based encoding at this stage and even for a long time. In the future, after row-based encoding is implemented, we need to do more detailed benchmarks to compare cell-based and row-based encoding, and I think |
BTW, I will review this PR later😊 |
I see, thanks for clarifying this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM, good work!
/// Create a [`DedupPkCellBasedRowDeserializer`] | ||
/// to decode cell based row with dedup pk encoding. | ||
pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self { | ||
let (pk_data_types, pk_order_types) = pk_descs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RowDeserializer / Serializer was designed to be created without overhead, and that's why I introduced ColumnDescMapping
before to cache the HashMap of column mapping. For DedupPk serializer, it would be better to have a new DedupColumnDescMapping
to store all such generated info (e.g. pk_to_row_mapping). Can done in later PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ultimate solution will be to cache all such information in a StateTableInfo
struct, and pass it everywhere... I guess @.BugenZhao is working on this.
if let Some((_vnode, pk, row)) = raw_result { | ||
let pk_datums = self.pk_deserializer.deserialize(&pk)?; | ||
Ok(Some(( | ||
_vnode, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_vnode, | |
vnode, |
If a value is actually used, we can remove the underscore.
}) | ||
.collect(); | ||
|
||
let inner = CellBasedRowDeserializer::new(column_mapping); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we remove deduped pks from column mapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use original column mapping, when deserializing, the inner CellBasedRowDeserializer will leave placeholders for missing dedupped pk datums. Then we can just replace the placeholders with the dedupped pk datums.
If we remove deduped pks from column mapping, we can't do this. The deserialized form will be compact. Probably need another step to map the dedupped row + the dedupped pk datums to a result row.
Both ways are possible, but prefer the first.
For example, given [1, 2, 3]
with pk_indices = [1]
, we store [1, 3]
under dedup pk encoding.
When deserializing, the result from the inner CellBasedRowDeserializer
is:
with original column mapping: [Some(1), None, Some(3)] // can just replace None with dedupped pk datum
with dedup column mapping: [Some(1), Some(3)]
impl<Desc: Deref<Target = ColumnDescMapping>> DedupPkCellBasedRowDeserializer<Desc> { | ||
/// Create a [`DedupPkCellBasedRowDeserializer`] | ||
/// to decode cell based row with dedup pk encoding. | ||
pub fn new(column_mapping: Desc, pk_descs: &[OrderedColumnDesc]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more intuitive to specify pk indices instead of the ColumnDescs 🤣 But this looks okay to me for now.
* add dedup deserializer skeleton * add dedup pk deserialization logic * fmt * add tests * fmt * config test module * add docs * add docs * rerun ci * remove underscore * add todos for DedupPkCellDeserializer instantiation
* add dedup deserializer skeleton * add dedup pk deserialization logic * fmt * add tests * fmt * config test module * add docs * add docs * rerun ci * remove underscore * add todos for DedupPkCellDeserializer instantiation
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
PLEASE DO NOT LEAVE THIS EMPTY !!!
Please explain IN DETAIL what the changes are in this PR and why they are needed:
Implement dedup pk cell based deserializer.
It will be used by relational layer for dedup pk encoding.
DedupPkCellBasedRowDeserializer
usesCellBasedRowDeserializer
internally to deserialize rows.It then replaces de-duplicated datums from pk.
Checklist
Tests for dedup pk encoding ser/de will be done separately. Tracked in Tracking: Cell encoding - store a column either in pk or value, but not both #3412.
./risedev check
(or alias,./risedev c
)Refer to a related PR or issue link (optional)
#3412