Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query engine 1/N #1848

Merged
merged 16 commits into from
Apr 25, 2024
Merged

Query engine 1/N #1848

merged 16 commits into from
Apr 25, 2024

Conversation

batiati
Copy link
Contributor

@batiati batiati commented Apr 16, 2024

This PR is the first part of a multi-PR effort to implement the QueryAPI.
Evolving the existing Scans, adding the required building blocks for queries with the minimal amount of work that makes sense to be reviewed together:

1. ScanBuilder sanitization:

Implements support for id and timestamp (previously, only secondary indexes were supported);
Removes the support for range queries over secondary indexes; only equality by the prefix is supported now.

With that, we removed the unsupported use case of merging scans not sorted by timestamp from this API.

2. ScanRange type:

Range queries (e.g. >, <, >=, <=) are still possible without merging scans, just not supported by the ScanBuilder anymore.

The new type ScanRange is non-composable and acts like an adapter between the raw ScanTree (which returns the CompositeKey) and the ScanLookup (which expects a timestamp).

The implementation of the expiry of pending transfers was changed to use a ScanRange, taking advantage of the new API that allows the caller to execute inline filters during the matching.
Instead of two queries, now a single scan will suffice to fulfill the pending transfer logic. See comments for more details. No noticeable performance gain, but the code was greatly simplified.

3. Zig-zag merge join:

Implements the zig-zag merge join algorithm and all the machinery required to probe scans during the "zag" phase.

This refactors the existing K-way merge implementation to accommodate the required API for probing streams.
The most notable change is that the iterator is now initialized only during the first read, allowing "resets" when the underlying streams change due to a "zag".

There are unit and fuzz tests for the Zig-zag merge implementation.

Pending work and next steps:

  • Merge differences (not implemented yet).
  • Fuzzer for ScanBuilder and ScanMerge (most of the code present in this PR is not reachable in production. During development, I did tests and benchmarks using the existing get_account_transfers operation with a provisory modified state machine, which performed more complex AND clauses).
  • Query API wire protocol (payload, results, errors, pagination, batching, etc);
  • Client APIs (for each target language);

src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
assert(key_from_value(&value_other) == key);

if (constants.verify) {
// It's assumed that streams will produce the same value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why are they all same value type?

(Updated later:) Ok, it is because all of the streams produce only timestamps, not objects or composite keys. Will we later use zig-zag merge iterator with Key != Value?

Copy link
Contributor Author

@batiati batiati Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we later use zig-zag merge iterator with Key != Value?

Yes, scans produce only timestamps, and since the composite key prefix is always fixed (equality), they also take only the timestamp as the probe key, so Key == Value for merging scans.

I don't think we should enforce Key == Value, the intersection logic can be reused as a primitive.

EDIT: I just realized that this comment could be clarified.

src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
src/lsm/zig_zag_merge.zig Show resolved Hide resolved
src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
src/lsm/zig_zag_merge.zig Outdated Show resolved Hide resolved
src/lsm/k_way_merge.zig Show resolved Hide resolved
src/lsm/scan_tree.zig Show resolved Hide resolved
src/lsm/scan_builder.zig Outdated Show resolved Hide resolved
src/lsm/scan_builder.zig Outdated Show resolved Hide resolved
src/lsm/scan_builder.zig Outdated Show resolved Hide resolved
Comment on lines +67 to +69
pub fn reset(it: *ZigZagMergeIterator) void {
_ = it;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever called, or could it be unreachable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called, so ScanMerge doesn't need to know which type of merge_iterator needs a reset.

src/state_machine.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
src/lsm/scan_range.zig Outdated Show resolved Hide resolved
src/state_machine.zig Outdated Show resolved Hide resolved
src/lsm/scan_tree.zig Outdated Show resolved Hide resolved
src/lsm/scan_tree.zig Outdated Show resolved Hide resolved
// Updates the scan range depending on the direction.
switch (self.direction) {
.ascending => {
if (self.key_min == key) return; // No need to move
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the ZigZagMergeIterator avoid calling probe on streams that already have the right key?
Could we assert that key_min != key here instead of returning? (Likewise for the .descending branch.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We avoid probing a scan already in the right key, but it's possible to probe the same scan twice, first as buffered and then as drained.

This if does not check the current key, it only allows double-probing, so we don't have to track it during the iteration.

// Probing the buffered streams that did not match the key.
var probing_iterator = probing.iterator(.{ .kind = .set });
while (probing_iterator.next()) |stream_index| {
stream_probe(it.context, @intCast(stream_index), probe_key);
const key = stream_peek(it.context, @intCast(stream_index)) catch |err| {
switch (err) {
error.Empty => return null,
error.Drained => {
drained.set(stream_index);
probing.unset(stream_index);
continue;
},
}
};

src/lsm/zig_zag_merge.zig Show resolved Hide resolved
Comment on lines 21 to 24
/// ScanBuilder is a helper to create and combine scans using
/// any of the Groove's secondary indexes.
pub fn ScanBuilderType(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just "ScanBuilder is a helper to create and combine scans using any of the Groove's indexes."?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I removed the comment while I was thinking in the TODO next line, since it won't be single-groove soon.

This is the first step to isolate scan_builder as a proper "query builder", that supports full queriable fields (id, timestamp, and secondary indexes) but also removes the support for range scans.
Instead, range queries have much simpler mechanisms that do not allow merging scans by design.

Since the API for range queries needed to be changed, the expiry of pending transfers was used as the main use case for the change, allowing custom logic to be inserted during the scan.
@sentientwaffle
Copy link
Member

Aside from those last 3 nits, this LGTM!

@batiati batiati added this pull request to the merge queue Apr 25, 2024
Merged via the queue into main with commit 7512b69 Apr 25, 2024
26 checks passed
@batiati batiati deleted the batiati-query-engine branch April 25, 2024 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants