feat(storage): support range-filter scan by sort key #644

shmiwy · 2022-05-04T08:21:09Z

Signed-off-by: Shmiwy wyf000219@126.com
close #589
for now only support range-filter scan by first column and this column should by primary key

Signed-off-by: Shmiwy <wyf000219@126.com>

skyzh

Would you please add some unit tests for RowSetIterator for range filters before starting review? 🤪

shmiwy · 2022-05-04T08:29:49Z

Would you please add some unit tests for RowSetIterator for range filters before starting review? 🤪

oh， I forgot again....🥲

Signed-off-by: Shmiwy <wyf000219@126.com>

shmiwy · 2022-05-06T15:53:40Z

For now, I only support int32 for the sort key, and only the first column can by used to support range filter scan. I will support ohter type later(maybe use macros to reuse the logic)

skyzh

The logic is basically correct, but there are some small issues. Comments below.

Also one idea come in my mind: maybe we can make range filter part of the filter condition, so that we don't need to compute which rows should be removed by ourselves, and re-use the current expression framework?

We can doc our RowSetIterator and Scan interface like: begin / end sort key must be included in filter condition as 10 < x and x < 15, so as to make range filter scan work correctly.

Now we only need to do two additional things compared with the previous code:

seek to the block (no need to be accurate, as filter condition will help us filter rows out-of-range)
pre-compute the block that contains end_sort_key instead of checking the last item in the data chunk. That's because filter condition will help us filter rows out-of-range, and it might incur a lot I/Os before a row can be produced. So if we stop at the block instead of checking every data chunk, things could be easier.

For example, let take this SQL as example:

select * from table where x > 10 and x < 15;

The RowSetIterator will be created as:

begin_sort_key: [10], end_sort_key: [15], filter: 10 < x and x < 15

Before creating RowSetIterator, we will position the first block with 10 occurrence (namely block A), and the last block with 15 occurrence (namely block B). At the beginning, the RowSetIterator will be positioned at the first row of Block A. When the RowSetIterator goes beyond block B (we can check this by current row id > last row of block B), we return the end of iterator.

Therefore, things could be a lot easier. Also, we can easily support > and >= conditions.

skyzh · 2022-05-07T02:09:49Z

src/storage/secondary/rowset/disk_rowset.rs

    ) -> StorageResult<RowSetIterator> {
-        RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr).await
+        RowSetIterator::new(self.clone(), column_refs, dvs, seek_pos, expr, end_sort_key).await


Why we only pass end_sort_key here? I guess this function should have both begin_sort_key and end_sort_key as parameter?

src/storage/secondary/rowset/disk_rowset.rs

skyzh · 2022-05-07T02:13:56Z

src/storage/secondary/rowset/rowset_iterator.rs

+
+                // Todo: only suppor range-filter scan by sort key type of int32, support other type
+                // later.
+                if end_key - &array.get(len - 1) < DataValue::Int32(0) {


Directly compare by end_key < &array.get(len - 1)?

Because I found that we have implemented addition, subtraction, multiplication and division between DataValue（like int32， int64). so I try reuse the logic to avoid many match arms

This thing is a little weird. I'll think of something else.😂

Signed-off-by: Shmiwy <wyf000219@126.com>

shmiwy · 2022-05-07T08:27:46Z

just fix some bug in this commit, I will make range filter part of the filter condition in the later pr

Signed-off-by: Shmiwy <wyf000219@126.com>

skyzh

Rest LGTM!

skyzh · 2022-05-07T08:28:38Z

src/storage/secondary/rowset/disk_rowset.rs

+    /// If `begin_key` is greater than all blocks' `first_key`, we return the `first_key` of the
+    /// last block.
+    /// Todo: support multi sort-keys range filter
+    pub async fn start_rowid(&self, begin_keys: &[DataValue]) -> ColumnSeekPosition {


So we are assuming that the first column of the RowSet is pk, and pk is non-nullable. Should enforce this constraint in binder in later PRs.

skyzh · 2022-05-07T08:30:32Z

src/storage/secondary/rowset/disk_rowset.rs

+                        break;
+                    }
+                    pre_block_first_key = index.first_rowid;
+                }


One optimization can be done in the future PRs:

Use partition_point function to do binary search, which could be a lot faster.

skyzh · 2022-05-07T08:32:57Z

src/storage/secondary/rowset/rowset_iterator.rs

@@ -95,13 +101,22 @@ impl RowSetIterator {
            dvs,
            column_iterators,
            filter_expr,
+            start_keys: start_keys.to_vec(),


I think we can simply pass begin_row_id and end_row_id into RowSetIterator? So that we don’t need complex logic to compare chunk against sort key in next_batch. begin_row_id and end_row_id include the range to scan, and we use filter scan to filter data. For example, the outer function determines that block 2 - 3 contains data for user specified sort key. So we can pass block 2's first row_id as begin_row_id, and block 3’s first row_id + row_count as end_row_id.

May fix in later PRs.

skyzh · 2022-05-07T08:33:43Z

src/storage/secondary/rowset/rowset_iterator.rs

+                    .into(),
+                    vec![],
+                    ColumnSeekPosition::RowId(168),
+                    None,


Filter condition should include the range info. Here we can construct an expression manually -- 180 < InputRef(0) < 195.

May fix in later PRs.

skyzh · 2022-05-07T08:34:27Z

just fix some bug in this commit, I will make range filter part of the filter condition in the later pr

Okay, let me re-do some reviews. Now I assume that filter condition doesn't include range filter. If you make range filter part of the filter condition, there are a lot of code that can be removed later :)

skyzh

LGTM if you don't want to include range filter in filter condition for now. Let's improve this in future PRs.

skyzh · 2022-05-07T08:36:26Z

src/storage/secondary/rowset/rowset_iterator.rs

+                    let len = array.len();
+                    let start_key = &self.start_keys[0];
+                    let start_row_id =
+                        (0..len).position(|idx| start_key - &array.get(idx) <= DataValue::Int32(0));


Can do a binary search. As we will eventually remove this part, looks okay to me now.

skyzh · 2022-05-07T08:40:16Z

So to summarize, there are a list of things to do after this PR gets merged:

Enforce pk not null, and pks are the first, second, etc., columns in binder. Alternatives: just ensure .iter()'s first column is StorageColumnRef::Idx(x), where x is pk (can be done in optimizer).
Include range filter in filter condition, and only pass approximate begin/end row id to RowSetIterator. The actual filter will be done by filter condition, and begin/end row id don't need to be accurate.
Use partition_point (or implement binary search by ourselves) when possible.

feat: support range-filter scan by sort key

a63a7ff

Signed-off-by: Shmiwy <wyf000219@126.com>

skyzh changed the title ~~feat: support range-filter scan by sort key~~ feat(storage): support range-filter scan by sort key May 4, 2022

skyzh requested review from likg227 and skyzh May 4, 2022 08:26

skyzh reviewed May 4, 2022

View reviewed changes

feat: support range-filter scan by sort key

333abc5

Signed-off-by: Shmiwy <wyf000219@126.com>

skyzh reviewed May 7, 2022

View reviewed changes

skyzh mentioned this pull request May 7, 2022

optimizer: minimal rule of range scan #645

Open

feat: support range-filter scan by sort key

f23d4de

Signed-off-by: Shmiwy <wyf000219@126.com>

feat: support range-filter scan by sort key

2a55d77

Signed-off-by: Shmiwy <wyf000219@126.com>

skyzh reviewed May 7, 2022

View reviewed changes

skyzh approved these changes May 7, 2022

View reviewed changes

skyzh merged commit f140187 into risinglightdb:main May 7, 2022

shmiwy deleted the range_feature branch May 7, 2022 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): support range-filter scan by sort key #644

feat(storage): support range-filter scan by sort key #644

shmiwy commented May 4, 2022 •

edited

skyzh left a comment

shmiwy commented May 4, 2022

shmiwy commented May 6, 2022 •

edited

skyzh left a comment •

edited

skyzh May 7, 2022

skyzh May 7, 2022

shmiwy May 7, 2022 •

edited

shmiwy May 7, 2022

shmiwy commented May 7, 2022

skyzh left a comment

skyzh May 7, 2022

skyzh May 7, 2022

skyzh May 7, 2022 •

edited

skyzh May 7, 2022

skyzh May 7, 2022

skyzh May 7, 2022

skyzh commented May 7, 2022 •

edited

skyzh left a comment

skyzh May 7, 2022

skyzh commented May 7, 2022

feat(storage): support range-filter scan by sort key #644

feat(storage): support range-filter scan by sort key #644

Conversation

shmiwy commented May 4, 2022 • edited

skyzh left a comment

Choose a reason for hiding this comment

shmiwy commented May 4, 2022

shmiwy commented May 6, 2022 • edited

skyzh left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmiwy May 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shmiwy commented May 7, 2022

skyzh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh May 7, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented May 7, 2022 • edited

skyzh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented May 7, 2022

shmiwy commented May 4, 2022 •

edited

shmiwy commented May 6, 2022 •

edited

skyzh left a comment •

edited

shmiwy May 7, 2022 •

edited

skyzh May 7, 2022 •

edited

skyzh commented May 7, 2022 •

edited