Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cell_based_table): add sentinel column and cell_based muti_get #1590

Merged
merged 12 commits into from
Apr 11, 2022

Conversation

wcy-fdu
Copy link
Contributor

@wcy-fdu wcy-fdu commented Apr 5, 2022

What's changed and what's your intention?

This PR first adds sentinel column when serialize/deserialize, using None value for sentinel cell. While state_store is getting sentinel column, there are three cases:

  • return None: no exist row
  • return Some(value): exist row

Then cell_based muti_get is implemented by the design above.

BTW, some unit tests have been modified due to corresponding changes.

Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests

Refer to a related PR or issue link (optional)

@wcy-fdu wcy-fdu marked this pull request as draft April 5, 2022 14:47
@codecov
Copy link

codecov bot commented Apr 5, 2022

Codecov Report

Merging #1590 (54f4959) into main (06bcfa9) will increase coverage by 0.00%.
The diff coverage is 95.01%.

@@           Coverage Diff           @@
##             main    #1590   +/-   ##
=======================================
  Coverage   71.37%   71.38%           
=======================================
  Files         600      600           
  Lines       77726    77721    -5     
=======================================
+ Hits        55477    55479    +2     
+ Misses      22249    22242    -7     
Flag Coverage Δ
rust 71.38% <95.01%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/stream/src/executor_v2/mview/materialize.rs 90.67% <ø> (ø)
src/storage/src/cell_based_row_deserializer.rs 98.09% <87.50%> (-0.96%) ⬇️
.../stream/src/executor_v2/mview/table_state_tests.rs 96.38% <94.41%> (+0.08%) ⬆️
src/common/src/util/ordered/serde.rs 92.82% <100.00%> (-0.19%) ⬇️
src/storage/src/table/cell_based_table.rs 77.13% <100.00%> (+1.75%) ⬆️
...am/src/executor/managed_state/top_n/top_n_state.rs 88.45% <100.00%> (ø)
src/stream/src/executor_v2/mview/state.rs 100.00% <100.00%> (ø)
src/connector/src/filesystem/file_common.rs 80.80% <0.00%> (+0.44%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@wcy-fdu wcy-fdu changed the title feat: add sentinel column feat(cell_based_table): add sentinel column and cell_based muti_get Apr 6, 2022
@wcy-fdu wcy-fdu marked this pull request as ready for review April 6, 2022 08:49
@wcy-fdu wcy-fdu requested a review from skyzh April 6, 2022 08:56
src/common/src/util/ordered/serde.rs Outdated Show resolved Hide resolved
src/storage/src/cell_based_row_deserializer.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/common/src/util/ordered/serde.rs Outdated Show resolved Hide resolved
src/storage/src/cell_based_row_deserializer.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Show resolved Hide resolved
.concat();
let state_store_get_res = self.keyspace.get(&key.clone(), epoch).await?;
if let Some(state_store_get_res) = state_store_get_res {
get_res.push((key.clone(), state_store_get_res));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls check all unnecessary clones

src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
@wcy-fdu wcy-fdu requested a review from lmatz April 8, 2022 10:06
@@ -32,7 +32,7 @@ use crate::util::sort_util::{OrderPair, OrderType};
use crate::util::value_encoding::serialize_cell;

/// The special `cell_id` reserved for a whole null row is `i32::MIN`.
pub const NULL_ROW_SPECIAL_CELL_ID: ColumnId = ColumnId::new(i32::MIN);
pub const SENTINEL_CELL_ID: ColumnId = ColumnId::new(-1_i32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, -1 is because we want SENTINEL_CELL_ID to be at the very beginning when we do scan? If so, may add some comments to it.
May give some examples in the comments, e.g. normal row, all-none row, what the storage format would be like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we are encoding the column id now. If simply using be or le encoding, -1 might not be at the beginning when scanning.

result.push((key, None));
}
}
}
if all_null {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May check whether the comments for this whole function is obsolete.

@CLAassistant
Copy link

CLAassistant commented Apr 8, 2022

CLA assistant check
All committers have signed the CLA.

@@ -32,7 +32,7 @@ use crate::util::sort_util::{OrderPair, OrderType};
use crate::util::value_encoding::serialize_cell;

/// The special `cell_id` reserved for a whole null row is `i32::MIN`.
pub const NULL_ROW_SPECIAL_CELL_ID: ColumnId = ColumnId::new(i32::MIN);
pub const SENTINEL_CELL_ID: ColumnId = ColumnId::new(-1_i32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how we are encoding the column id now. If simply using be or le encoding, -1 might not be at the beginning when scanning.

src/common/src/util/ordered/serde.rs Show resolved Hide resolved
src/storage/src/cell_based_row_deserializer.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
src/storage/src/table/cell_based_table.rs Outdated Show resolved Hide resolved
@@ -298,7 +298,7 @@ impl<S: StateStore, const TOP_N_TYPE: usize> ManagedTopNState<S, TOP_N_TYPE> {
let pk_row_bytes = self
.keyspace
.scan_strip_prefix(
number_rows.map(|top_n_count| top_n_count * self.data_types.len()),
number_rows.map(|top_n_count| top_n_count * (self.data_types.len() + 1)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So top-n will possibly scan more cells from storage? If there're nulls in storage, it will scan more that it needs. Hmm... better to have this refactored in the future. For example, use iterator-based APIs. (cc @lmatz)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also update the comments above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I talked to @wcy-fdu about this part requiring refactoring in the future.

@@ -235,7 +235,7 @@ mod tests {
.await
.unwrap()
.len(),
6
9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use get_row API to refactor this test case now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also update comments above

src/stream/src/executor_v2/mview/state.rs Show resolved Hide resolved
#[tokio::test]
async fn test_get_row_by_muti_get() {
let state_store = MemoryStateStore::new();
let column_ids = vec![ColumnId::from(0), ColumnId::from(1), ColumnId::from(2)];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also have some test cases like:

  • variable-length type? e.g. string
  • shuffled column id? e.g., column id 3, 2, 1 instead of 1, 2, 3
  • non-continuous ids? e.g., 1, 4, 8 instead of 1, 2, 3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I will add these cases and remove get_for_test in cell_based_table.

Copy link
Contributor

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Better also have a test case like write using 1, 2, 3 column id, and read using 3, 2.

@wcy-fdu
Copy link
Contributor Author

wcy-fdu commented Apr 11, 2022

LGTM. Better also have a test case like write using 1, 2, 3 column id, and read using 3, 2.

Good idea.

@wcy-fdu wcy-fdu merged commit 0bfc42c into main Apr 11, 2022
@wcy-fdu wcy-fdu deleted the wcy_add_sentinel_column branch April 11, 2022 06:07
@@ -32,7 +32,7 @@ use crate::util::sort_util::{OrderPair, OrderType};
use crate::util::value_encoding::serialize_cell;

/// The special `cell_id` reserved for a whole null row is `i32::MIN`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update this doc as well. :)

.map_err(err)?;
assert!(deserialize_res.is_none());
}
let pk_and_row = cell_based_row_deserializer.take();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we can unwrap here?

assert_eq!(
memory_state_store
.scan::<_, Vec<u8>>(.., None, u64::MAX)
.await
.unwrap()
.len(),
6
9
);

// FIXME(Bugen): restore this test by using new `RowTable` interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we can restore these tests with get_row.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants