feat(storage): implement read pruning by vnode #2882

xx01cyx · 2022-05-27T16:30:12Z

What's changed and what's your intention?

Summarize your change

Add vnode parameter to read-interfaces of keyspace and state store.
Use vnode bitmap info to initialize keyspace.
Use new keyspace (the one with vnode) in certain streaming executors.

After this PR gets merged, read pruning by vnode will work properly in both point-get and range-scan.

Limitations

Read pruning does NOT work in batch executor yet. This will be implemented in the future.

Checklist

I have written necessary docs and comments
I have added necessary unit tests and integration tests

skyzh

I would prefer to have a new set of interface (scan_with_vnode) instead of changing existing one. We should migrate little by little.

skyzh · 2022-05-28T02:18:30Z

And I think vnode information should be recorded on Keyspace (keyspace::new_with_vnode) instead of passing it everywhere.

xx01cyx · 2022-05-28T03:29:02Z

I would prefer to have a new set of interface (scan_with_vnode) instead of changing existing one. We should migrate little by little.

Indeed. I'll fix this.

xx01cyx · 2022-05-28T03:35:14Z

And I think vnode information should be recorded on Keyspace (keyspace::new_with_vnode) instead of passing it everywhere.

The vnodes that one executor owns are likely to change when the cluster scales in or scales out. Then we'll have to maintain the vnode info in a multi-version way in keyspace.

skyzh · 2022-05-28T04:44:59Z

The vnodes that one executor owns are likely to change when the cluster scales in or scales out

If there's scale-in and scale-out, the executor will be re-created. 😇🥰

codecov · 2022-05-28T05:57:15Z

Codecov Report

Merging #2882 (8b83502) into main (9af55da) will decrease coverage by 0.04%.
The diff coverage is 60.09%.

❗ Current head 8b83502 differs from pull request most recent head 6d3797d. Consider uploading reports for the commit 6d3797d to get more accurate results

@@            Coverage Diff             @@
##             main    #2882      +/-   ##
==========================================
- Coverage   73.47%   73.42%   -0.05%     
==========================================
  Files         736      736              
  Lines      100716   101010     +294     
==========================================
+ Hits        73997    74163     +166     
- Misses      26719    26847     +128

Flag	Coverage Δ
rust	`73.42% <60.09%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/bench/ss_bench/operations/get.rs	`0.00% <0.00%> (ø)`
...rc/bench/ss_bench/operations/prefix_scan_random.rs	`0.00% <0.00%> (ø)`
src/common/src/hash/dispatcher.rs	`91.89% <ø> (ø)`
src/common/src/hash/key.rs	`85.38% <ø> (ø)`
src/ctl/src/cmd_impl/hummock/list_kv.rs	`0.00% <0.00%> (ø)`
src/meta/src/manager/hash_mapping.rs	`97.39% <ø> (ø)`
src/meta/src/stream/meta.rs	`47.04% <ø> (ø)`
src/meta/src/stream/scheduler.rs	`88.53% <ø> (ø)`
src/meta/src/stream/stream_manager.rs	`68.82% <ø> (ø)`
src/storage/src/hummock/snapshot_tests.rs	`94.68% <ø> (ø)`
... and 30 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

xx01cyx · 2022-05-28T16:02:46Z

If there's scale-in and scale-out, the executor will be re-created. 😇🥰

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

skyzh · 2022-05-28T16:10:36Z

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

xx01cyx · 2022-05-28T16:17:39Z

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

If a fragment scales out from 5 parallel degrees to 10, the number of vnodes owned by one parallel unit will inevitably decrease by half (since total number of vnodes is invariant). How to ensure that new executors would always include their previous vnodes?

xx01cyx · 2022-05-28T16:23:53Z

New executors will always need to include their previous vnodes. We will need a separate barrier to notify compaction complete and update their vnodes.

I think I get what you mean. new executor vnode set = UNION OF previous vnode set AND current vnode set, until all relevant compactions are done, right?

skyzh · 2022-05-28T17:47:18Z

Well, my fault, please ignore my comments.

An executor can only use the latest version of vnodes to query data. If a re-created executor wants to query data written before scaling, it will use a wrong set of vnodes and thus get incorrect result.

This should never happen. Executors will only read data belonging to its own distribution. During scale-out, executors will operate on a complete different set of keys. Therefore, they will not query data written before.

skyzh · 2022-05-28T17:48:42Z

And we do not need to include previous vnode.

src/storage/src/hummock/state_store.rs

fuyufjh · 2022-05-30T03:02:55Z

src/storage/src/hummock/state_store.rs

+        &'a self,
+        key: &'a [u8],
+        epoch: u64,
+        _vnode: Option<VirtualNode>,


Considering there might not be such a "PointGet" operator, I think the type of vnodes should also be Vec<VirtualNode>. Nevermind, it's not a big problem.

I'd prefer always using &'a VirtualNode, so that it will function efficiently even when vnode mapping is large.

I'd prefer always using &'a VirtualNode, so that it will function efficiently even when vnode mapping is large.

Will Option<&'a VirtualNode> still cause some overhead due to the construction of Option?

Nope, it has exactly the same value size as &'a VirtualNode.

fuyufjh · 2022-05-30T03:09:09Z

src/storage/src/keyspace.rs

+    pub async fn get_with_vnode(
+        &self,
+        key: impl AsRef<[u8]>,
+        epoch: u64,
+        vnode: VirtualNode,
+    ) -> StorageResult<Option<Bytes>> {
+        self.store
+            .get(&self.prefixed_key(key), epoch, Some(vnode))
+            .await


This function uses the given vnode instead of self.vnode. What scenario should it be used?

It is used when an executor does point-get with vnode.

I think point get can already be optimized by bloom filter (only 0.01 false positive currently). Maybe we don't need vnode for it. But it would also be okay to use vnode to do some sanity check -- e.g. executors should not point get keys out of its vnode range.

fuyufjh

LGTM

…tmap to initialize keyspace

skyzh · 2022-05-30T10:13:50Z

src/stream/src/executor/managed_state/join/join_entry_state.rs

@@ -230,15 +231,15 @@ mod tests {
            vec![DataType::Int64].into(),
        );
        assert!(!managed_state.is_dirty());
+        let columns = vec![


Accidentally reverted the change?

skyzh · 2022-05-30T10:15:06Z

src/storage/src/table/cell_based_table.rs

@@ -170,16 +185,34 @@ impl<S: StateStore> CellBasedTable<S> {

    pub async fn get_row_by_scan(&self, pk: &Row, epoch: u64) -> StorageResult<Option<Row>> {
        // get row by state_store scan
+        let vnode = self


Why CellBasedTable need to compute vnode? I think this should be provided by executors creating CellBasedTable?

It's just that cell based table provides such interface:

async fn batch_write_rows_inner<const WITH_VALUE_META: bool>

which I think indicates whether to compute value meta in cell based table. 🤔

Value meta needs to be computed when write of course. But for reads, isn't it true that all executors and their state table objects already have value meta assigned to them? For both point get and scan, we should use vnode provided by executors to do filters, instead of compute it.

So we should use the same set of vnodes to do pruning, regardless of type of the read operation (point-get or range-scan). Will this lead to any inefficiency that could be avoided (e.g. less SSTs are pruned out) when we do point-get? cc. @fuyufjh

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

skyzh · 2022-05-30T10:16:40Z

src/storage/src/keyspace.rs

+        vnode: VirtualNode,
+    ) -> StorageResult<Option<Bytes>> {
+        // Construct vnode bitmap.
+        let mut bitmap_inner = [0; VNODE_BITMAP_LEN];


This code seems to appear in multiple places. Is it possible to have a VNodeBitmap::new(vnode, table_id), let the caller to provide a VNodeBitmap?

VNodeBitmap is actually a proto type. Maybe we should define a non-proto type for it.

hzxa21 · 2022-06-01T04:16:20Z

src/storage/src/table/cell_based_table.rs

@@ -170,16 +185,34 @@ impl<S: StateStore> CellBasedTable<S> {

    pub async fn get_row_by_scan(&self, pk: &Row, epoch: u64) -> StorageResult<Option<Row>> {
        // get row by state_store scan
+        let vnode = self


For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

skyzh · 2022-06-01T08:16:14Z

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

I believe bloom filter can already achieve a relatively low false negative. I would prefer use executor-provided vnode in all cases.

hzxa21 · 2022-06-09T12:19:51Z

For reads performed on a single vnode (e.g. we know the dist key beforehand), I think computing vnode on the fly makes sense. In other cases, I think we should just use the vnodes of the executor, which should be initialized on CellBaseTable initialization.

I believe bloom filter can already achieve a relatively low false negative. I would prefer use executor-provided vnode in all cases.

Correct me if i am wrong, after a second thought, I think there is no such case that we don't know dist key beforehand. Therefore, we should always compute and provide a single vnode to the read interface.

xx01cyx added 3 commits May 28, 2022 00:00

feat(storage): add vnode to read-interface of keyspace and state store

0bc272c

merge main and resolve conflicts

c55855d

remove deleted file

399aa93

github-actions bot added the type/feature label May 27, 2022

skyzh reviewed May 28, 2022

View reviewed changes

maintain executor vnode info in keyspace

c0c556e

xx01cyx mentioned this pull request May 29, 2022

feat(meta): inform executors of the vnodes they own #2887

Merged

2 tasks

Little-Wallace reviewed May 29, 2022

View reviewed changes

src/storage/src/hummock/state_store.rs Show resolved Hide resolved

xx01cyx requested a review from fuyufjh May 30, 2022 01:31

fuyufjh reviewed May 30, 2022

View reviewed changes

fuyufjh approved these changes May 30, 2022

View reviewed changes

xx01cyx added 5 commits May 30, 2022 14:06

Merge branch 'main' into cyx/read-by-vnode-api

42c6ffe

Merge branch 'main' into cyx/read-by-vnode-api

7b3233f

use bitmap instead of Vec<Vnode> in point-get interface; use vnode bi…

cb750ed

…tmap to initialize keyspace

fix clippy

3a24c90

fmt

9b425c2

skyzh reviewed May 30, 2022

View reviewed changes

xx01cyx changed the title ~~feat(storage): add vnode to read-interface of keyspace and state store~~ feat(storage): implement read pruning by vnode May 30, 2022

hzxa21 self-requested a review May 31, 2022 05:20

hzxa21 approved these changes Jun 1, 2022

View reviewed changes

xx01cyx added 2 commits June 7, 2022 11:01

merge main and resolve conflicts

9b555c3

merge stashed changes

3d42e73

xx01cyx mentioned this pull request Jun 7, 2022

feat(storage): introduce non-proto type for vnode bitmap #3030

Merged

3 tasks

xx01cyx added 6 commits June 8, 2022 10:35

merge main and resolve conflicts

cf3cfe7

seperate batch and streaming keyspace

08dc305

merge main and resolve conflicts

75b4190

use keyspace with vnode in global simple agg

b1c7bce

merge main and resolve conflicts

f7ce8ea

handle the case where vnode_bitmap is None in get_with_vnode of keyspace

8b83502

remove stale TODO

6d3797d

xx01cyx enabled auto-merge (squash) June 9, 2022 12:57

xx01cyx merged commit 49c207d into main Jun 9, 2022

xx01cyx deleted the cyx/read-by-vnode-api branch June 9, 2022 13:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(storage): implement read pruning by vnode #2882

feat(storage): implement read pruning by vnode #2882

xx01cyx commented May 27, 2022 •

edited

skyzh left a comment

skyzh commented May 28, 2022

xx01cyx commented May 28, 2022

xx01cyx commented May 28, 2022 •

edited

skyzh commented May 28, 2022

codecov bot commented May 28, 2022 •

edited

xx01cyx commented May 28, 2022

skyzh commented May 28, 2022

xx01cyx commented May 28, 2022

xx01cyx commented May 28, 2022

skyzh commented May 28, 2022

skyzh commented May 28, 2022

fuyufjh May 30, 2022 •

edited

skyzh May 30, 2022

xx01cyx May 30, 2022

skyzh May 30, 2022

fuyufjh May 30, 2022

xx01cyx May 30, 2022

skyzh May 30, 2022

fuyufjh left a comment

skyzh May 30, 2022

skyzh May 30, 2022

xx01cyx May 30, 2022

skyzh May 30, 2022 •

edited

xx01cyx May 30, 2022 •

edited

hzxa21 Jun 1, 2022

skyzh May 30, 2022

xx01cyx May 30, 2022

hzxa21 Jun 1, 2022

skyzh commented Jun 1, 2022

hzxa21 commented Jun 9, 2022

feat(storage): implement read pruning by vnode #2882

feat(storage): implement read pruning by vnode #2882

Conversation

xx01cyx commented May 27, 2022 • edited

What's changed and what's your intention?

Summarize your change

Limitations

Checklist

skyzh left a comment

Choose a reason for hiding this comment

skyzh commented May 28, 2022

xx01cyx commented May 28, 2022

xx01cyx commented May 28, 2022 • edited

skyzh commented May 28, 2022

codecov bot commented May 28, 2022 • edited

Codecov Report

xx01cyx commented May 28, 2022

skyzh commented May 28, 2022

xx01cyx commented May 28, 2022

xx01cyx commented May 28, 2022

skyzh commented May 28, 2022

skyzh commented May 28, 2022

fuyufjh May 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fuyufjh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh May 30, 2022 • edited

Choose a reason for hiding this comment

xx01cyx May 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skyzh commented Jun 1, 2022

hzxa21 commented Jun 9, 2022

xx01cyx commented May 27, 2022 •

edited

xx01cyx commented May 28, 2022 •

edited

codecov bot commented May 28, 2022 •

edited

fuyufjh May 30, 2022 •

edited

skyzh May 30, 2022 •

edited

xx01cyx May 30, 2022 •

edited