-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): introduce dedup_pk_row encoding for StateTable #3214
Conversation
5237896
to
b72de5d
Compare
Codecov Report
@@ Coverage Diff @@
## main #3214 +/- ##
==========================================
+ Coverage 74.41% 74.46% +0.04%
==========================================
Files 768 769 +1
Lines 107793 108054 +261
==========================================
+ Hits 80218 80464 +246
- Misses 27575 27590 +15
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 852 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
850 | 1 | 1 | 0 |
Click to see the invalid file list
- src/storage/src/table/dedup_pk_state_table.rs
Currently breaking with Simplified, the query is: create table iii_t1 (v1 int, v2 int);
create table iii_t2 (v3 int, v4 int);
insert into iii_t1 values (2, 0), (3, 0), (0, 0), (1, 0);
insert into iii_t2 values (2, 5), (3, 4), (0, 3), (1, 2);
flush;
create index iii_index_1 on iii_t1(v1);
create index iii_index_2 on iii_t2(v3);
create materialized view iii_mv2 as select * from iii_t1, iii_t2 where iii_t1.v1 = iii_t2.v3;
select v1, v2, v3, v4 from iii_mv2; As can be seen by the plan, this uses
Under the hood
Still thinking of how to deal with this. |
All readers will need to use dedup pk decoding, as long as they read from storage which has dedup pk encoding. We need to extend lookup to use that too. |
In this or a future PR, we need to make it easy to toggle dedup pk encoding. |
Did further digging into how lookup executor works. When scanning from storage we first get a key_prefix: Which gets masked from the pk: This also means that the Need to think of how to get this dedupped datum. |
Thinking whether to implement this e2e just yet, because of changes to various executors. Will elaborate more soon. |
387a5e8
to
148d553
Compare
Partially fixed e2e tests ( Checkpointing work here for now. I will be breaking this PR up into two chunks:
|
785cb7f
to
31a3e31
Compare
src/storage/src/table/state_table.rs
Outdated
#[derive(Clone)] | ||
pub struct StateTable<S: StateStore> { | ||
pub struct StateTableExtended<S: StateStore, SER: CellSerializer> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I‘m not sure is this by design?
Maybe I missed some of the discussion, is there any RFC about this?
f39715d
to
45786ca
Compare
45786ca
to
5ff2f74
Compare
Ready for review again~ |
I'll help to resolve the conflicts brought by #3407. |
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
dfdb279
to
e17e76a
Compare
Done. Could you please review the resolving results? @kwannoel Besides, I've also made these changes:
|
Both of these changes LGTM, thanks for the help! |
Reviewed and LGTM too 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM
} | ||
} | ||
|
||
impl CellSerializer for DedupPkCellBasedRowSerializer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name looks a bit confusing for me.
RowSerializer
seems serialize a row (vec), while CellSerializer
for a cell (Datum).
/// 1. Row indices of datums not in pk, | ||
/// 2. or datums which have to be stored regardless | ||
/// (e.g. if memcomparable not equal to value encoding) | ||
dedup_datum_indices: HashSet<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether consider a situation like:
all cells of a state table is pk (table_columns.len = pk.len). just like #3474 . In this case, we can not dedup all pk, but at least one pk is not dedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can SENTINEL_CELL_ID
help with this case? Would require changes to the logic of deserialization though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some tests to verify this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test case: https://github.com/singularity-data/risingwave/blob/cc4d1d196313f9bebfa2fd44e2cd3436d70652b8/src/storage/src/dedup_pk_cell_based_row_serializer.rs#L200
Tests the case where all datums are dedupped. As @lmatz mentioned, SENTINEL
cell will still be pushed:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would require changes to the logic of deserialization though.
Regarding deserialization, we can also rely on cell_based_row_deserializer
to do it, so I don't think much changes are needed:
Just need to replace the deduplicated datums with datums from pk. This logic can be in dedup_pk_cell_based_row_deserializer
.
What's changed and what's your intention?
Alternate design for dedup_pk_row encoding to #3143.
See #3143 (comment) for context.
Please explain IN DETAIL what the changes are in this PR and why they are needed:
StateTable
,CellTable
.CellSerializer
which relational layer is parameterized on.DedupPkStateTable, DedupPkCellTable
, they are only parameterized on serializer. This means serializing and deserializing differs. (ForStateTable, CellTable
, serializing and deserializing still matches, behaviour is unchanged).dedup_pk_iter
incell_based_table
can be removed eventually.dedup_pk_iter
and related structs / traits, replace with parameterized cell table iter.Checklist
./risedev check
(or alias,./risedev c
)Refer to a related PR or issue link (optional)
Previous implementaiton: #3143
Issue: #588