-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): support range-filter scan by sort key #644
Conversation
Signed-off-by: Shmiwy <wyf000219@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you please add some unit tests for RowSetIterator
for range filters before starting review? 🤪
oh, I forgot again....🥲 |
Signed-off-by: Shmiwy <wyf000219@126.com>
For now, I only support int32 for the sort key, and only the first column can by used to support range filter scan. I will support ohter type later(maybe use macros to reuse the logic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is basically correct, but there are some small issues. Comments below.
Also one idea come in my mind: maybe we can make range filter part of the filter condition, so that we don't need to compute which rows should be removed by ourselves, and re-use the current expression framework?
We can doc our RowSetIterator and Scan interface like: begin / end sort key must be included in filter condition as 10 < x and x < 15
, so as to make range filter scan work correctly.
Now we only need to do two additional things compared with the previous code:
- seek to the block (no need to be accurate, as filter condition will help us filter rows out-of-range)
- pre-compute the block that contains
end_sort_key
instead of checking the last item in the data chunk. That's because filter condition will help us filter rows out-of-range, and it might incur a lot I/Os before a row can be produced. So if we stop at the block instead of checking every data chunk, things could be easier.
For example, let take this SQL as example:
select * from table where x > 10 and x < 15;
The RowSetIterator will be created as:
begin_sort_key: [10], end_sort_key: [15], filter: 10 < x and x < 15
Before creating RowSetIterator, we will position the first block with 10
occurrence (namely block A), and the last block with 15
occurrence (namely block B). At the beginning, the RowSetIterator will be positioned at the first row of Block A. When the RowSetIterator goes beyond block B (we can check this by current row id > last row of block B), we return the end of iterator.
Therefore, things could be a lot easier. Also, we can easily support >
and >=
conditions.
) -> StorageResult<RowSetIterator> { | ||
RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr).await | ||
RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr, end_sort_key).await |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we only pass end_sort_key
here? I guess this function should have both begin_sort_key
and end_sort_key
as parameter?
|
||
// Todo: only suppor range-filter scan by sort key type of int32, support other type | ||
// later. | ||
if end_key - &array.get(len - 1) < DataValue::Int32(0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directly compare by end_key < &array.get(len - 1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I found that we have implemented addition, subtraction, multiplication and division between DataValue
(like int32, int64). so I try reuse the logic to avoid many match
arms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This thing is a little weird. I'll think of something else.😂
Signed-off-by: Shmiwy <wyf000219@126.com>
just fix some bug in this commit, I will make range filter part of the filter condition in the later pr |
Signed-off-by: Shmiwy <wyf000219@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM!
/// If `begin_key` is greater than all blocks' `first_key`, we return the `first_key` of the | ||
/// last block. | ||
/// Todo: support multi sort-keys range filter | ||
pub async fn start_rowid(&self, begin_keys: &[DataValue]) -> ColumnSeekPosition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are assuming that the first column of the RowSet is pk, and pk is non-nullable. Should enforce this constraint in binder in later PRs.
break; | ||
} | ||
pre_block_first_key = index.first_rowid; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One optimization can be done in the future PRs:
- Use
partition_point
function to do binary search, which could be a lot faster.
@@ -95,13 +101,22 @@ impl RowSetIterator { | |||
dvs, | |||
column_iterators, | |||
filter_expr, | |||
start_keys: start_keys.to_vec(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can simply pass begin_row_id and end_row_id into RowSetIterator? So that we don’t need complex logic to compare chunk against sort key in next_batch. begin_row_id and end_row_id include the range to scan, and we use filter scan to filter data. For example, the outer function determines that block 2 - 3 contains data for user specified sort key. So we can pass block 2's first row_id as begin_row_id, and block 3’s first row_id + row_count as end_row_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May fix in later PRs.
.into(), | ||
vec![], | ||
ColumnSeekPosition::RowId(168), | ||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filter condition should include the range info. Here we can construct an expression manually -- 180 < InputRef(0) < 195
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May fix in later PRs.
Okay, let me re-do some reviews. Now I assume that filter condition doesn't include range filter. If you make range filter part of the filter condition, there are a lot of code that can be removed later :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if you don't want to include range filter in filter condition for now. Let's improve this in future PRs.
let len = array.len(); | ||
let start_key = &self.start_keys[0]; | ||
let start_row_id = | ||
(0..len).position(|idx| start_key - &array.get(idx) <= DataValue::Int32(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do a binary search. As we will eventually remove this part, looks okay to me now.
So to summarize, there are a list of things to do after this PR gets merged:
|
Signed-off-by: Shmiwy wyf000219@126.com
close #589
for now only support range-filter scan by first column and this column should by primary key