-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(batch): parallel table scan #3251
Conversation
7b0c556
to
9255aa6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 851 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
849 | 1 | 1 | 0 |
Click to see the invalid file list
- src/common/src/consistent_hashing.rs
9255aa6
to
5abf456
Compare
Just tried another representation use vnode_ranges instead of vnode_bitmap, which may be easier for RowSecScanExecutor to create vnode prefix range iterators |
@@ -37,6 +37,7 @@ impl GrpcExchangeSource { | |||
let task_id = task_output_id.get_task_id()?.clone(); | |||
let client = ComputeClient::new(addr).await?; | |||
let local_execute_plan = exchange_source.local_execute_plan; | |||
let vnode_ranges = todo!(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The vnode_range
should be determined by optimier/fragmenter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be determined by scheduler? Each scan task scans different ranges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think vnodes of a table scan plan node should be determined by optimizer. And the compute node where a task should be sent to determined fragmenter, this way we can reuse this logic in local execution mode. The scheduler should only care about sending task to right compute node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For optimizer & fragmenter, they only have one plan. e.g., scan full table
Then the scan is divided into non-overlapping vnode ranges, and each task of the fragment is assigned a range to scan. How can we do that in optimizer/fragmenter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we can let fragmenter store num_parallelsm
different plans if you want...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here after the filter has been pushed to table scan, it should be able to prune unnecessary vnodes.
Optimizer is responsible for pushing predicate (SARGS) to table scan, but is not responsible for deciding the specific vnode number.
And it cannot do it, because it doesn't know whether (1, 2) are in the same parallel unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, it's not the target of this PR and the vnodes_range
introduced here.
This PR cares mainly about given a range (e.g., full table scan), divide (schedule) it into small ranges and assign them to tasks.
I think the ScanRange
introduced earlier (in BatchSeqScan
) is responsible for this part? And pruning vnodes can be done at the same place where we partition the vnode ranges, instead of optimizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding, the parallel unit is just a partition, which should be maintained by optimizer. It should not know vnode, but should understand parallel unit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per my understanding, the parallel unit is just a partition, which should be maintained by optimizer. It should not know vnode, but should understand parallel unit.
I don't think it as partition 😇 To me, it's exactly same thing to the routing metadata of KV databases like Cassendra, HBase or TikV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, we need to map filters in table scan
to parallel units
, and it can't be don't in scheduler since we need to reuse it in local mode. Maybe fragmenter is a better place.
PTAL the partitioning logic in distributed scheduler ( And the remaining work is the executor part, which is blocked by storage side's work (new encoding & vnode ranges scan API). |
00b1a75
to
a16945a
Compare
948e8d2
to
7424af8
Compare
https://github.com/singularity-data/risingwave/pull/3251/files/354869f6331d125b5b00c0cb6f65ad9cf6d4b483..d3107963513716c8b88591fe559de6a2ec0e6b2a Am I doing right things about distinct agg? 🤡 cc @st1page |
PTAL the final updates:
I'd like to fix the FIXME in a separate PR. We merge this PR if other fixes look good? update: fix already merged #3599 |
3881b08
to
e615829
Compare
e615829
to
aa29d20
Compare
So this does not enable real parallel table scan according to discussions in #3583? I've removed "close" in PR body. 🤣 |
* add vnode_bitmap in row seq scan * use vnode_ranges instead of vnode_bitmap * todo! * use vnode_mapping in table catalog * style: change the representation of vnode_ranges * update local mode * update local mode workers * move vnode ranges from tasks into RowSeqScan * remove build_vnode_mapping * add some comments * trivial fix * use vnode_bitmap instead of vnode_ranges (let table do the conversion instead) * fix vnodes * fix local mode (system table) * buf format * revert the change in table, ignore in executor if vnodes not set * revert cell based table * ignore get row * fix distinct * Revert "fix distinct" This reverts commit 7424af8. * Revert "Revert "fix distinct"" This reverts commit 3cdab8f. * let distinct agg be singleton * single distribution for BatchTopN * remove should_ignore * fmt
* add vnode_bitmap in row seq scan * use vnode_ranges instead of vnode_bitmap * todo! * use vnode_mapping in table catalog * style: change the representation of vnode_ranges * update local mode * update local mode workers * move vnode ranges from tasks into RowSeqScan * remove build_vnode_mapping * add some comments * trivial fix * use vnode_bitmap instead of vnode_ranges (let table do the conversion instead) * fix vnodes * fix local mode (system table) * buf format * revert the change in table, ignore in executor if vnodes not set * revert cell based table * ignore get row * fix distinct * Revert "fix distinct" This reverts commit 7424af8. * Revert "Revert "fix distinct"" This reverts commit 3cdab8f. * let distinct agg be singleton * single distribution for BatchTopN * remove should_ignore * fmt
What's changed and what's your intention?
Mainly added vnode_bitmap to ExecutorBuilder.
Not finished yet... See the
todo!
sBTW, should we use vnode bitmap or other representations? e.g., vnode ranges, maybe like
ParallelUnitMapping
. I used bitmap here because of previous storage implementation 😇Checklist
./risedev check
(or alias,./risedev c
)Refer to a related PR or issue link (optional)
#3237