Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(storage): support scan row handler only #447

Merged
merged 4 commits into from
Feb 16, 2022

Conversation

Fedomn
Copy link
Member

@Fedomn Fedomn commented Feb 12, 2022

Try to close #421, but I'm not sure my solution is right. I split two cases in rowset_iterator:

  1. column_ref has RowHandler and other user columns
  2. column_ref only has RowHandler

for the first part, I keep the original logic which is RowCount calculated by common_chunk_range
for the second part, I add a new condition branch that returns the first column total row_count directly

Signed-off-by: Fedomn fedomn.ma@gmail.com

@skyzh skyzh requested review from likg227 and skyzh February 12, 2022 14:02
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage part looks generally good (and I have some new ideas for that), but the binder part doesn't looks perfect in join statements.

The previous plan (note that it includes a bug in #388):

> create table t1(v1 int not null, v2 int);
created
> create table t2(v3 int, v4 varchar)
created
> explain select count(*) from t1 join t2 on v2=v3;
PhysicalProjection: exprs [InputRef #0]
  PhysicalSimpleAgg: 1 agg calls
    PhysicalHashJoin: op Inner, left_index 0,  right_index 0, predicate: Bool(true) (const) 
      PhysicalTableScan: table #8, columns [1, 0], with_row_handler: false, is_sorted: false, expr: None
      PhysicalTableScan: table #9, columns [0], with_row_handler: false, is_sorted: false, expr: None

The new plan:

> explain select count(*) from t1 join t2 on v2=v3; 
PhysicalProjection: exprs [InputRef #0]
  PhysicalSimpleAgg: 1 agg calls
    PhysicalHashJoin: op Inner, left_index 0,  right_index 0, predicate: Bool(true) (const) 
      PhysicalTableScan: table #8, columns [1], with_row_handler: true, is_sorted: false, expr: None
      PhysicalTableScan: table #9, columns [0], with_row_handler: true, is_sorted: false, expr: None

Indeed, if we are joining two tables, we do not need to scan the row handler -- we will always scan the join key.


From my perspective, there are two ways to support scanning row handlers only.

  • Add a RowHandlerIterator struct, and treat it in the same way as user column iterator.
  • Or we can store the current_row_id variable in RowSetIterator, and use this row id to sequence the row handler iterators.

Thanks for your contribution!

if row_handler_count == column_refs.len() {
panic!("no user column")
}
let only_scan_row_handler = row_handler_count == column_refs.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to treat row handler column in the same way as user columns...

.indexes()
.iter()
.map(|x| x.row_count)
.sum::<u32>() as usize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summing the total row doesn't look like a trivial operation. There could be thousands of blocks in a single column. I think we can use the original fetch_size for row handlers. Most executors prefer to consume chunks little by little, instead of getting them all at once.

Also, the expected_size of next_batch function requires that next_batch cannot return elements more than expected_size. This seems to violate the rule.

@likg227
Copy link
Contributor

likg227 commented Feb 13, 2022

Thanks for your contribution! But I can't simply determine the correctness of your code :)

@skyzh
Copy link
Member

skyzh commented Feb 13, 2022

Thanks for your contribution! But I can't simply determine the correctness of your code :)

The logic in RowSetIterator is correct (from my perspective). Adding some unit tests might also help.

@Fedomn
Copy link
Member Author

Fedomn commented Feb 15, 2022

Sorry for the late reply, I had added the RowHandlerIterator that is almost the same as ConcreteColumnIterator, but seems the ConcreteColumnIterator refactor is still in progress, I will continue to follow this.

@skyzh
Copy link
Member

skyzh commented Feb 15, 2022

Sorry for the late reply, I had added the RowHandlerIterator that is almost the same as ConcreteColumnIterator, but seems the ConcreteColumnIterator refactor is still in progress, I will continue to follow this.

Thanks a lot! The ConcreteColumnIterator refactor has been completed, so you may need to merge (or rebase) and continue.

Signed-off-by: Fedomn <fedomn.ma@gmail.com>
@Fedomn
Copy link
Member Author

Fedomn commented Feb 15, 2022

Some explanations about these changes for easier understanding:

  1. The first commit: rewrite the whole logic using row_handler_iterator that keep the same behavior as concrete_column_iterator
  2. Secondary commit: merged row_handler_iterator into concrete_column_iterator to reduce identical logic in next_batch_inner
  3. Third commit: fix unnecessary with_row_handler when join tables

@skyzh skyzh self-requested a review February 15, 2022 14:50
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for your contribution! But IMO, this implementation doesn't look very clear to me. The main problem is with the commit 7fbbf70, which implements the RowHandlerColumnIterator in a complex way. You may apply the following idea to implement the RowHandlerColumnIterator.

  • The RowHandlerColumnIterator doesn't need to have corresponding block and the filter-scan logic -- it is a very simple iterator that returns i64 value in an incremental order.
  • And therefore, there should be no change on concrete_column_iterator... We just need to simply implement a new RowHandlerColumnIterator.

An example:

pub struct RowHandlerColumnIterator {
     rowset_id: usize,
     row_count: usize,
     current_row_id: usize,
}

impl RowHandlerColumnIterator {
    fn new(rowset_id: usize, first_row: usize) { ... }
}

Then, we can implement

impl ColumnIterator<I64Array> for RowHandlerIterator {
     type NextFuture<'a> = impl Future<Output = StorageResult<Option<(u32, I64Array)>>> + 'a;

     fn next_batch(&mut self, expected_size: Option<usize>) -> Self::NextFuture<'_> {
         async move { /* produce a I64Array like the original `sequence` method, and take `expected_size` into account. */ }
     }

     fn fetch_hint(&self) -> usize {
         // simply return the remaining rows of the column
     }

     fn fetch_current_row_id(&self) -> u32 {
         self.current_row_id
     }

     fn skip(&mut self, cnt: usize) {
         self.current_row_id += cnt;
     }
 }

Then, everything should be set -- we can use the RowHandlerIterator like other user columns. No need to implement a separate block iterator for row handler :)

For the commit with binder ede0762, LGTM, good work!

@skyzh
Copy link
Member

skyzh commented Feb 15, 2022

... the RowSet handler doesn't even need to know about how other columns are indexed -- it can return all its information without any I/O, so we can set its fetch_hint to the remaining items (instead of the remaining items in columns[0]'s block)

@Fedomn
Copy link
Member Author

Fedomn commented Feb 15, 2022

Thanks a lot. I will apply these feedbacks.

@Fedomn
Copy link
Member Author

Fedomn commented Feb 16, 2022

... the RowSet handler doesn't even need to know about how other columns are indexed -- it can return all its information without any I/O, so we can set its fetch_hint to the remaining items (instead of the remaining items in columns[0]'s block)

Sorry for my clumsiness, I still have some questions for constructing RowHandlerColumnIterator:

  1. how to get row_count: currently, I used columns[0].index.row_count, but not clear how to use fetch_hint ?

@skyzh
Copy link
Member

skyzh commented Feb 16, 2022

how to get row_count: currently, I used columns[0].index.row_count, but not clear how to use fetch_hint ?

Oops, I missed this point! I think there are multiple approaches to solve this...

  • Sum up row_count in RowSetIterator::new(), and pass the total_row_count to RowHandlerIterator.
  • ... Or we can store total row count in manifest / rowset / column index, and read it...
  • ... Or we can refactor fetch_hint to return Option<usize>. Note that if there is only one RowHandler column, we still have to calculate the row count in advance.

I think the first approach (which is what you've already done days ago) looks like a good way. Instead of computing total row count in next_batch, doing it in new can be much more efficient (it will be only called once). Also, the row count should only be computed if there is a RowHandler iterator.

fetch_hint is used to determine how many rows to scan from the RowSet.

let mut fetch_size = {
// We find the minimum fetch hints from the column iterators first
let mut min = None;
for it in self.column_iterators.iter().flatten() {
let hint = it.fetch_hint();
if hint != 0 {
if min.is_none() {
min = Some(hint);
} else {
min = Some(min.unwrap().min(hint));
}
}
}
min.unwrap_or(ROWSET_MAX_OUTPUT)
};

Therefore, fetch_hint generally means that "how many rows can be fetched from this column without any I/O". For RowHandlerIterator, it can be the total number of rows - current row id.

If you feel there's difficulty implementing this, feel free to comment in this PR, or ping me on Slack. Thanks!

Signed-off-by: Fedomn <fedomn.ma@gmail.com>
Signed-off-by: Fedomn <fedomn.ma@gmail.com>
Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, for the current implementation! For the remaining bugs, you may create a new issue (and send PR if you have time), thanks!

delete from t where v = 7

query I
select count(*) from t where v > 5

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PhysicalProjection: exprs [InputRef #0]
  PhysicalSimpleAgg: 1 agg calls
    PhysicalTableScan: table #9, columns [0], with_row_handler: true, is_sorted: false, expr: Gt(InputRef #0, Int32(5) (const))

The plan is not optimal. As we already have column 0 scanned, we don't need row handler.

We can refine the binder logic later to handle this case, and get this PR merged at first. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Maybe this should better be a task for optimizer? Only optimizer know how many columns will be scanned indeed...) Let's discuss in #482 cc @st1page

tests/sql/count.slt Show resolved Hide resolved
@skyzh skyzh enabled auto-merge (squash) February 16, 2022 05:49
@skyzh skyzh merged commit 8464ae4 into risinglightdb:main Feb 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

storage: support scan row handler only
3 participants