Skip to content

Add support for type-level synopses and a string synopsis #1214

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Dec 16, 2020

Conversation

tobim
Copy link
Member

@tobim tobim commented Dec 4, 2020

📔 Description

This adds support for type-level synopses to the meta index. For each field in a partition, it is possible to decide whether to use a specific synopsis, share one with the other fields of the same type, or not to use a synopsis at all.

A new string synopsis and buffered string synopsis is also introduced.

📝 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

🎯 Review Instructions

By commit.

@tobim tobim requested a review from a team December 4, 2020 16:19
Copy link
Member

@mavam mavam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the class string_synopsis? Can't we just use a buffered_bloom_filter_synopsis or buffered_bloom_filter_synopsis<T> if we want to restrict it to a single type? I would argue the lookup function can have a precondition that it was already type-checked and that it can be fully polymorphic.

A related question: how does that buffered_string_synopsis behave differently from the buffered_address_synopsis?

@tobim tobim force-pushed the topic/string-synopsis branch from 934e5bc to 8e94807 Compare December 8, 2020 10:52
@tobim tobim force-pushed the topic/string-synopsis branch 2 times, most recently from fef1295 to 77c4119 Compare December 8, 2020 15:18
@lava lava force-pushed the topic/string-synopsis branch 4 times, most recently from ec6412a to 3599ba6 Compare December 16, 2020 15:21
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reviewed this together in a Slack call. Looks and works great!

@lava lava marked this pull request as ready for review December 16, 2020 15:52
@lava lava force-pushed the topic/string-synopsis branch 2 times, most recently from c3f73c5 to 8bc3f4b Compare December 16, 2020 16:23
@lava lava force-pushed the topic/string-synopsis branch from 8bc3f4b to 56ee0c8 Compare December 16, 2020 16:28
lava added 2 commits December 16, 2020 17:29
Previously, when faced with a query like

    zeek.dns.query == "example.org"

and a database that does not contain any zeek.dns records,
the meta index would schedule *all* partitions since no
synopsis existed that could rule out any particular partition.

With this change, we also ignore partitions that have no potential
matches at all.
lava added 3 commits December 16, 2020 17:29
  * Unify `buffered_address_synopsis` and `buffered_string_synopsis`
    into a single `buffered_synopsis<T>`.
  * String attributes in meta index before checking type equality.
  * Rename `fprate` to `fp-rate` globally.
  * Correctly apply past tense of `shrink` where necesary.
@lava lava force-pushed the topic/string-synopsis branch from 56ee0c8 to 67eaf64 Compare December 16, 2020 16:30
@lava lava merged commit 9ce8be2 into master Dec 16, 2020
@lava lava deleted the topic/string-synopsis branch December 16, 2020 17:29
@@ -12,6 +12,10 @@ Every entry has a category for which we use the following visual abbreviations:

## Unreleased

- 🎁 A new type-level synopsis structure in the meta-index now massively
speeds up string queries with very few results.
[#1214](https://github.com/tenzir/vast/pull/1214)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another case where I don't see the need for a "feature" changelog. It's a non-functional change.

It would only make sense to introduce a performance-related category where we document performance-affecting changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants