Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix query pruning in the catalog #2264

Merged
merged 6 commits into from May 9, 2022
Merged

Conversation

lava
Copy link
Member

@lava lava commented May 4, 2022

Updates the catalog to rewrite queries like http.hostname == "foo" || dns.rrname == "foo" to :string == "foo" (as opposed to rewriting them to http.hostname == "foo", as was done previously).

Additionally, introduces a set of field names that have separate configured bloom filters (presumably with a higher precision than the default one), which should not be touched by the pruner.

馃摑 Checklist

  • All user-facing changes have changelog entries.
  • The changes are reflected on docs.tenzir.com/vast, if necessary.
  • The PR description contains instructions for the reviewer, if necessary.

馃幆 Review Instructions

Review this pull request commit-by-commit.

@lava lava added the bug Incorrect behavior label May 4, 2022
This several issues with the query pruner which was aggressively
reducing string lookups in compound statements using the fact that
all strings in a partition share the same bloom filter.

First, the relational operator was ignored leading to invalid
transformations like

    http.hostname == "foo" || http.hostname != "foo"

being "optimized" to just `http.hostname == "foo"` which is clearly
not the same.

Second, a query like

    http.hostname == "foo" || dns.rrname == "foo"

would also be reduced to just `http.hostname == "foo"`, which
returns false for a partition that contains only suricata.dns
events regardless whether or not the second half of the disjunction
would produced a candidate match.

Finally, with the introduction of configurable false-positive rates
for individual fields in VAST 2.0, even when applying this optimization
correctly it may lead to a lower-precision type level synopsis being
used for the candidate check.
@lava lava force-pushed the story/sc-33307/correct-pruning branch 2 times, most recently from 419341c to 03104ad Compare May 6, 2022 08:20
lava added 2 commits May 6, 2022 11:41
This to work around invalid transformations such as

    'Cause == "Stop"'  -> ':string == "Stop"'

where `Cause` is an enum field. While this patch does
not completely fix the issue, it ensures that the
issue is at least no more likely to appear than
before.
This is done to further optimize queries like

    :string == "foo || :string == "foo"

which are likely to occur as a result of optimization
after the first round of pruning.
@lava lava force-pushed the story/sc-33307/correct-pruning branch from f823f2d to 0cca247 Compare May 6, 2022 09:41
Copy link
Member

@dominiklohmann dominiklohmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testbed metrics dashboard shows a significant reduction in catalog lookup time, so this is working nicely.

I've looked at the code and don't have any complaints either, but please add a changelog entry.

lava added 2 commits May 9, 2022 15:05
The previous version of pruning would still optimize
expressions like

    foo == "asdf" || bar == "asdf"

with `foo` being an unprunable field because the condition
was only checked once a duplicate was detected; ie. for all
string fields except the first.
@lava lava force-pushed the story/sc-33307/correct-pruning branch from e194970 to 4a415b3 Compare May 9, 2022 13:08
@lava lava merged commit 481640c into master May 9, 2022
@lava lava deleted the story/sc-33307/correct-pruning branch May 9, 2022 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior
Projects
None yet
2 participants