Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(streaming): use iterator directly instead of materializing all kv pairs from iterator #1998

Merged
merged 1 commit into from
Apr 29, 2022

Conversation

lmatz
Copy link
Contributor

@lmatz lmatz commented Apr 20, 2022

What's changed and what's your intention?

In #1969, we replace scan in TopNExecutor with iter. However, we still first materialize all the kv pairs from iter.

This PR adds a PkAndRowIterator so that two TopN's states only need to deal with OrderedRow and Row instead of raw Bytes.

Also eliminate an extra comparison between key_from_storage and key_from_buffer due to encounter_same_key check.

Checklist

  • I have written necessary docs and comments

Refer to a related PR or issue link (optional)

@codecov
Copy link

codecov bot commented Apr 20, 2022

Codecov Report

Merging #1998 (6b7e4e6) into main (8ca6a1e) will increase coverage by 0.07%.
The diff coverage is 84.94%.

@@            Coverage Diff             @@
##             main    #1998      +/-   ##
==========================================
+ Coverage   70.94%   71.01%   +0.07%     
==========================================
  Files         625      625              
  Lines       80597    80538      -59     
==========================================
+ Hits        57177    57194      +17     
+ Misses      23420    23344      -76     
Flag Coverage Δ
rust 71.01% <84.94%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ecutor/managed_state/top_n/top_n_bottom_n_state.rs 76.96% <33.33%> (-2.30%) ⬇️
...am/src/executor/managed_state/top_n/top_n_state.rs 92.77% <91.30%> (+0.12%) ⬆️
src/stream/src/executor/managed_state/top_n/mod.rs 100.00% <100.00%> (ø)
src/meta/src/manager/notification.rs 29.60% <0.00%> (-1.44%) ⬇️
src/stream/src/executor_v2/lookup/impl_.rs 96.77% <0.00%> (-0.69%) ⬇️
src/stream/src/executor_v2/lookup.rs 4.83% <0.00%> (-0.08%) ⬇️
src/common/src/util/mod.rs 0.00% <0.00%> (ø)
src/meta/src/rpc/server.rs 0.00% <0.00%> (ø)
src/meta/src/barrier/mod.rs 70.06% <0.00%> (ø)
src/risedevtool/src/task.rs 0.00% <0.00%> (ø)
... and 45 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@lmatz lmatz requested a review from BugenZhao April 21, 2022 04:32
@lmatz
Copy link
Contributor Author

lmatz commented Apr 28, 2022

@BugenZhao @st1page @wcy-fdu
I guess this part will be absorbed into the new relation table, right? Together with the logic of merging storage's iter and flush_buffer's iter.

@wcy-fdu
Copy link
Contributor

wcy-fdu commented Apr 28, 2022

Can top_n use table_iter/row_iter instead of keyspace's iter?
If so, StateTableIter will merge flush_buffer's iter and CellBasedRowIter in the future.

Comment on lines +44 to +48
pub struct PkAndRowIterator<'a, I: StateStoreIter<Item = (Bytes, Bytes)>, const TOP_N_TYPE: usize> {
iter: I,
ordered_row_deserializer: &'a mut OrderedRowDeserializer,
cell_based_row_deserializer: &'a mut CellBasedRowDeserializer,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can be part of CellBasedTable and replace the CellBasedTableRowIter? and after that, we can implement the iter of StateTable composited with MemTable(flush_buffer) and CellBasedTable. and all stream operators can use the StateTable to store their states.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that only TopNExecutor requires that the next of the iterator returns both key and value, and the key is an OrderedRow. This is not required in other stateful executors, i.e. HashAgg and HashJoin. It is also not required in MV.

This is because only TopNExecutor needs an ordered key to fill in its cache, and this filling process does not only rely on the key of the single(insert, row) whose state is being updated.

Use a flag to denote whether it should deserialize and return the key? 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering we will do store a column either in key or value, the caller should just make the decision whether it is necessary to return key. This iterator should figure out by itself whether it needs to deserialize key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we merge first and take some time to decide how to merge this with CellBasedTable and replace the CellBasedTableRowIter? Some other fixes are pending in TopN.

Copy link
Contributor

@st1page st1page left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lmatz lmatz merged commit bc05efd into main Apr 29, 2022
@lmatz lmatz deleted the lz/refactor branch April 29, 2022 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants