-
Notifications
You must be signed in to change notification settings - Fork 526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): initially introduce vnode encoding in cell-based table #3407
Conversation
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Codecov Report
@@ Coverage Diff @@
## main #3407 +/- ##
==========================================
- Coverage 73.84% 73.81% -0.04%
==========================================
Files 765 765
Lines 105273 105351 +78
==========================================
+ Hits 77742 77766 +24
- Misses 27531 27585 +54
Flags with carried forward coverage won't be shown. Click here to find out more.
馃摚 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
May ask someone else to review the implementation details.
use crate::{Keyspace, StateStore, StateStoreIter}; | ||
|
||
pub type AccessType = bool; | ||
pub const READ_ONLY: AccessType = false; | ||
pub const READ_WRITE: AccessType = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub const READ_WRITE: AccessType = true; | |
pub const READ_AND_WRITE: AccessType = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO READ_WRITE
is better 馃
/// Get a [`StreamingIter`] with given `encoded_key_range`. | ||
pub(super) async fn streaming_iter_with_encoded_key_range<R, B>( | ||
&self, | ||
encoded_key_range: R, | ||
epoch: u64, | ||
) -> StorageResult<StreamingIter<S>> | ||
where | ||
R: RangeBounds<B> + Send, | ||
R: RangeBounds<B> + Send + Clone, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: why add this clone
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may construct multiple iterators if there're multiple vnodes, so we need to clone the range
for all of them. We may let the caller to pass an Arc
to avoid allocations.
let iter = match iterators.len() { | ||
0 => unreachable!(), | ||
1 => iterators.into_iter().next().unwrap(), | ||
_ => todo!("merge multiple vnode ranges"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the relationship between iterators.len()
and vonde nums? Is it one-on-one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on whether we need to preserve the order of primary keys. If so, we need to construct an iterator for every single vnode and merge them into a sorted one, or we can construct a single for a continuous range.
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Good PR!
self.deserialize_inner::<true>(raw_key, cell) | ||
} | ||
|
||
// TODO: remove this once we refactored lookup in delta join with cell-based table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't quite understand this workaround. I see in the code we already .extend(DEFAULT_VNODE)
when calculate the prefix of scan key in lookup executor. Why still call deserialize_without_vnode
instead of deserialize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a pk prefix scan with Keyspace.append::scan
instead of Keyspace::scan
, so the vnode part and the pk prefix of the key passed into the deserializer are truncated.
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
} else { | ||
ValueMeta::default() | ||
}; | ||
let vnode = self.compute_vnode_by_row(&row); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should vnode be computed by cell based table or state table? If we are to pass vnode from upper layer and do filtering accordingly, then memtable should also have the info of vnode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "do filtering accordingly"? 馃 Currently the memtable associated with the cell-based table will also be partitioned into the same vnode, so there's no need to do filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, executor only needs the data of a specific vnode. (Is that possible?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The streaming executor should tell the cell-based table the vnodes it cares about when constructing a CellBasedTable
instance. The vnode bitmap changes only if the system scales in/out, and the executors will be dropped and rebuilt according to our design.
The batch executors may need this, which is unrelated to the memtable, however. Will implement in the following PRs.
Co-authored-by: Yuanxin Cao <60498509+xx01cyx@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
What's changed and what's your intention?
This PR included multiple changes.
Introduce a generic
AccessMode
for cell-based table.With
READ_ONLY
, the write interfaces will not be exposed and column pruning will be done by specifying a subsetoutput_columns
of all columns. WithREAD_WRITE
, the instance should operate on all columns of this table, i.e., the output columns are a complete set. Under this assumption, we can ensure the correctness of some interfaces much more easily.Encode/decode the vnode in
cell_based_row_serializer/deserializer
, all handled byCellBasedTable
.If a
None
for distribution keys is given, we'll fall back to the default vnode of 0x00 and encode/decode it anyway. So this reserve the order of real keys and is compatible with the old implementation when point getting or iterating. Besides, the vnode value of 0x00 can also be compressed by Hummock.Remove const generic of
ITER_TYPE
.There'll be two dimensions of iterator types in the future: whether to wait for the epoch, and whether to preserve the order among vnodes. Let's make them runtime parameters to avoid type exercises.
There remain several things to do as well. Check #3316 for the roadmap.
The merge iterator of multiple vnode ranges is not implemented yet to avoid the PR being huge. Thus, all of the vnodes fall back to 0x00 now to keep the correctness of iterator behavior. 馃ぃ
Lookup of delta join still uses the
Keyspace
API to scan the arrangement. Some workarounds are introduced. May need to refactor before enabling real vnode computing.Checklist
./risedev check
(or alias,./risedev c
)Refer to a related PR or issue link (optional)