Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): initially introduce vnode encoding in cell-based table #3407

Merged
merged 17 commits into from
Jun 23, 2022

Conversation

BugenZhao
Copy link
Member

@BugenZhao BugenZhao commented Jun 22, 2022

I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.

What's changed and what's your intention?

This PR included multiple changes.

  • Introduce a generic AccessMode for cell-based table.
    With READ_ONLY, the write interfaces will not be exposed and column pruning will be done by specifying a subset output_columns of all columns. With READ_WRITE, the instance should operate on all columns of this table, i.e., the output columns are a complete set. Under this assumption, we can ensure the correctness of some interfaces much more easily.

  • Encode/decode the vnode in cell_based_row_serializer/deserializer, all handled by CellBasedTable.
    If a None for distribution keys is given, we'll fall back to the default vnode of 0x00 and encode/decode it anyway. So this reserve the order of real keys and is compatible with the old implementation when point getting or iterating. Besides, the vnode value of 0x00 can also be compressed by Hummock.

  • Remove const generic of ITER_TYPE.
    There'll be two dimensions of iterator types in the future: whether to wait for the epoch, and whether to preserve the order among vnodes. Let's make them runtime parameters to avoid type exercises.

There remain several things to do as well. Check #3316 for the roadmap.

  • The merge iterator of multiple vnode ranges is not implemented yet to avoid the PR being huge. Thus, all of the vnodes fall back to 0x00 now to keep the correctness of iterator behavior. 馃ぃ

  • Lookup of delta join still uses the Keyspace API to scan the arrangement. Some workarounds are introduced. May need to refactor before enabling real vnode computing.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Refer to a related PR or issue link (optional)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
@BugenZhao BugenZhao changed the title feat(storage): initially introduce vnode encoding in cell-based table [WIP] feat(storage): initially introduce vnode encoding in cell-based table Jun 22, 2022
@BugenZhao BugenZhao marked this pull request as ready for review June 22, 2022 10:27
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
@codecov
Copy link

codecov bot commented Jun 22, 2022

Codecov Report

Merging #3407 (2fb0da7) into main (8463a0e) will decrease coverage by 0.03%.
The diff coverage is 81.96%.

@@            Coverage Diff             @@
##             main    #3407      +/-   ##
==========================================
- Coverage   73.84%   73.81%   -0.04%     
==========================================
  Files         765      765              
  Lines      105273   105351      +78     
==========================================
+ Hits        77742    77766      +24     
- Misses      27531    27585      +54     
Flag Coverage 螖
rust 73.81% <81.96%> (-0.04%) 猬囷笍

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage 螖
src/batch/src/executor/row_seq_scan.rs 19.25% <酶> (酶)
src/common/src/types/mod.rs 67.75% <酶> (酶)
src/storage/src/lib.rs 100.00% <酶> (酶)
src/storage/src/table/mem_table.rs 78.57% <酶> (酶)
src/stream/src/executor/mview/materialize.rs 88.88% <酶> (酶)
src/storage/src/table/cell_based_table.rs 70.86% <76.00%> (-4.95%) 猬囷笍
src/storage/src/cell_based_row_deserializer.rs 92.57% <95.34%> (+0.85%) 猬嗭笍
src/storage/src/cell_based_row_serializer.rs 100.00% <100.00%> (+34.48%) 猬嗭笍
src/storage/src/table/state_table.rs 95.74% <100.00%> (+0.47%) 猬嗭笍
src/storage/src/table/test_relational_table.rs 98.08% <100.00%> (酶)
... and 11 more

馃摚 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Copy link
Contributor

@wcy-fdu wcy-fdu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
May ask someone else to review the implementation details.

use crate::{Keyspace, StateStore, StateStoreIter};

pub type AccessType = bool;
pub const READ_ONLY: AccessType = false;
pub const READ_WRITE: AccessType = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub const READ_WRITE: AccessType = true;
pub const READ_AND_WRITE: AccessType = true;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO READ_WRITE is better 馃

/// Get a [`StreamingIter`] with given `encoded_key_range`.
pub(super) async fn streaming_iter_with_encoded_key_range<R, B>(
&self,
encoded_key_range: R,
epoch: u64,
) -> StorageResult<StreamingIter<S>>
where
R: RangeBounds<B> + Send,
R: RangeBounds<B> + Send + Clone,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: why add this clone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may construct multiple iterators if there're multiple vnodes, so we need to clone the range for all of them. We may let the caller to pass an Arc to avoid allocations.

Comment on lines 477 to 480
let iter = match iterators.len() {
0 => unreachable!(),
1 => iterators.into_iter().next().unwrap(),
_ => todo!("merge multiple vnode ranges"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the relationship between iterators.len() and vonde nums? Is it one-on-one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on whether we need to preserve the order of primary keys. If so, we need to construct an iterator for every single vnode and merge them into a sorted one, or we can construct a single for a continuous range.

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Copy link
Contributor

@BowenXiao1999 BowenXiao1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM. Good PR!

self.deserialize_inner::<true>(raw_key, cell)
}

// TODO: remove this once we refactored lookup in delta join with cell-based table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't quite understand this workaround. I see in the code we already .extend(DEFAULT_VNODE) when calculate the prefix of scan key in lookup executor. Why still call deserialize_without_vnode instead of deserialize?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a pk prefix scan with Keyspace.append::scan instead of Keyspace::scan, so the vnode part and the pk prefix of the key passed into the deserializer are truncated.

Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
} else {
ValueMeta::default()
};
let vnode = self.compute_vnode_by_row(&row);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should vnode be computed by cell based table or state table? If we are to pass vnode from upper layer and do filtering accordingly, then memtable should also have the info of vnode.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "do filtering accordingly"? 馃 Currently the memtable associated with the cell-based table will also be partitioned into the same vnode, so there's no need to do filtering.

Copy link
Contributor

@xx01cyx xx01cyx Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, executor only needs the data of a specific vnode. (Is that possible?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The streaming executor should tell the cell-based table the vnodes it cares about when constructing a CellBasedTable instance. The vnode bitmap changes only if the system scales in/out, and the executors will be dropped and rebuilt according to our design.

The batch executors may need this, which is unrelated to the memtable, however. Will implement in the following PRs.

Co-authored-by: Yuanxin Cao <60498509+xx01cyx@users.noreply.github.com>
Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BugenZhao BugenZhao added the mergify/can-merge Indicates that the PR can be added to the merge queue label Jun 23, 2022
@mergify mergify bot merged commit be644d1 into main Jun 23, 2022
@mergify mergify bot deleted the bz/vnode-in-key-part-3 branch June 23, 2022 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mergify/can-merge Indicates that the PR can be added to the merge queue type/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

capability of encoding and decoding vnode in cell-based table
5 participants