Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): support range-filter scan by sort key #644

Merged
merged 4 commits into from
May 7, 2022
Merged

feat(storage): support range-filter scan by sort key #644

merged 4 commits into from
May 7, 2022

Conversation

shmiwy
Copy link
Contributor

@shmiwy shmiwy commented May 4, 2022

Signed-off-by: Shmiwy wyf000219@126.com
close #589
for now only support range-filter scan by first column and this column should by primary key

Signed-off-by: Shmiwy <wyf000219@126.com>
@skyzh skyzh changed the title feat: support range-filter scan by sort key feat(storage): support range-filter scan by sort key May 4, 2022
@skyzh skyzh requested review from likg227 and skyzh May 4, 2022 08:26
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you please add some unit tests for RowSetIterator for range filters before starting review? 🤪

@shmiwy
Copy link
Contributor Author

shmiwy commented May 4, 2022

Would you please add some unit tests for RowSetIterator for range filters before starting review? 🤪

oh, I forgot again....🥲

Signed-off-by: Shmiwy <wyf000219@126.com>
@shmiwy
Copy link
Contributor Author

shmiwy commented May 6, 2022

For now, I only support int32 for the sort key, and only the first column can by used to support range filter scan. I will support ohter type later(maybe use macros to reuse the logic)

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is basically correct, but there are some small issues. Comments below.

Also one idea come in my mind: maybe we can make range filter part of the filter condition, so that we don't need to compute which rows should be removed by ourselves, and re-use the current expression framework?

We can doc our RowSetIterator and Scan interface like: begin / end sort key must be included in filter condition as 10 < x and x < 15, so as to make range filter scan work correctly.

Now we only need to do two additional things compared with the previous code:

  1. seek to the block (no need to be accurate, as filter condition will help us filter rows out-of-range)
  2. pre-compute the block that contains end_sort_key instead of checking the last item in the data chunk. That's because filter condition will help us filter rows out-of-range, and it might incur a lot I/Os before a row can be produced. So if we stop at the block instead of checking every data chunk, things could be easier.

For example, let take this SQL as example:

select * from table where x > 10 and x < 15;

The RowSetIterator will be created as:

begin_sort_key: [10], end_sort_key: [15], filter: 10 < x and x < 15

Before creating RowSetIterator, we will position the first block with 10 occurrence (namely block A), and the last block with 15 occurrence (namely block B). At the beginning, the RowSetIterator will be positioned at the first row of Block A. When the RowSetIterator goes beyond block B (we can check this by current row id > last row of block B), we return the end of iterator.

Therefore, things could be a lot easier. Also, we can easily support > and >= conditions.

) -> StorageResult<RowSetIterator> {
RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr).await
RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr, end_sort_key).await
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we only pass end_sort_key here? I guess this function should have both begin_sort_key and end_sort_key as parameter?

src/storage/secondary/rowset/disk_rowset.rs Show resolved Hide resolved

// Todo: only suppor range-filter scan by sort key type of int32, support other type
// later.
if end_key - &array.get(len - 1) < DataValue::Int32(0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly compare by end_key < &array.get(len - 1)?

Copy link
Contributor Author

@shmiwy shmiwy May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I found that we have implemented addition, subtraction, multiplication and division between DataValue(like int32, int64). so I try reuse the logic to avoid many match arms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This thing is a little weird. I'll think of something else.😂

Signed-off-by: Shmiwy <wyf000219@126.com>
@shmiwy
Copy link
Contributor Author

shmiwy commented May 7, 2022

just fix some bug in this commit, I will make range filter part of the filter condition in the later pr

Signed-off-by: Shmiwy <wyf000219@126.com>
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM!

/// If `begin_key` is greater than all blocks' `first_key`, we return the `first_key` of the
/// last block.
/// Todo: support multi sort-keys range filter
pub async fn start_rowid(&self, begin_keys: &[DataValue]) -> ColumnSeekPosition {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are assuming that the first column of the RowSet is pk, and pk is non-nullable. Should enforce this constraint in binder in later PRs.

break;
}
pre_block_first_key = index.first_rowid;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One optimization can be done in the future PRs:

  • Use partition_point function to do binary search, which could be a lot faster.

@@ -95,13 +101,22 @@ impl RowSetIterator {
dvs,
column_iterators,
filter_expr,
start_keys: start_keys.to_vec(),
Copy link
Member

@skyzh skyzh May 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simply pass begin_row_id and end_row_id into RowSetIterator? So that we don’t need complex logic to compare chunk against sort key in next_batch. begin_row_id and end_row_id include the range to scan, and we use filter scan to filter data. For example, the outer function determines that block 2 - 3 contains data for user specified sort key. So we can pass block 2's first row_id as begin_row_id, and block 3’s first row_id + row_count as end_row_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May fix in later PRs.

.into(),
vec![],
ColumnSeekPosition::RowId(168),
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter condition should include the range info. Here we can construct an expression manually -- 180 < InputRef(0) < 195.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May fix in later PRs.

@skyzh
Copy link
Member

skyzh commented May 7, 2022

just fix some bug in this commit, I will make range filter part of the filter condition in the later pr

Okay, let me re-do some reviews. Now I assume that filter condition doesn't include range filter. If you make range filter part of the filter condition, there are a lot of code that can be removed later :)

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if you don't want to include range filter in filter condition for now. Let's improve this in future PRs.

let len = array.len();
let start_key = &self.start_keys[0];
let start_row_id =
(0..len).position(|idx| start_key - &array.get(idx) <= DataValue::Int32(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do a binary search. As we will eventually remove this part, looks okay to me now.

@skyzh
Copy link
Member

skyzh commented May 7, 2022

So to summarize, there are a list of things to do after this PR gets merged:

  • Enforce pk not null, and pks are the first, second, etc., columns in binder. Alternatives: just ensure .iter()'s first column is StorageColumnRef::Idx(x), where x is pk (can be done in optimizer).
  • Include range filter in filter condition, and only pass approximate begin/end row id to RowSetIterator. The actual filter will be done by filter condition, and begin/end row id don't need to be accurate.
  • Use partition_point (or implement binary search by ourselves) when possible.

@skyzh skyzh merged commit f140187 into risinglightdb:main May 7, 2022
@shmiwy shmiwy deleted the range_feature branch May 7, 2022 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: support range-filter scan by sort key
2 participants