feat(eval): batch large |contains lists into Aho-Corasick automaton#99
Merged
Conversation
When a detection item or keyword group has 8 or more plain `|contains`
needles with the same case sensitivity, collapse them into a single
`AhoCorasickSet` matcher. Replaces N sequential `str::contains` calls
with one linear pass over the haystack.
Adds a new `compiler::optimizer` module with `optimize_any_of`, which
partitions an `AnyOf` group into case-insensitive Contains, case-
sensitive Contains, and other matcher buckets. Each Contains bucket
that meets the threshold becomes one `AhoCorasickSet`; the rest stay
as individual matchers. The optimizer is invoked only on `AnyOf`
construction sites — `AllOf` (`|all` modifier) is left untouched, so
AND semantics are preserved.
For case-insensitive matching, needles are pre-lowered (matching the
existing `Contains` invariant) and the hot path lowers the haystack
once via the new `ascii_lowercase_cow` helper, which has a 3-tier
fast path: borrow when already lowercase ASCII, in-place
`make_ascii_lowercase` for ASCII with uppercase, and full Unicode
`to_lowercase` only for non-ASCII input.
Tests:
- 7 unit tests for `optimize_any_of` covering empty/singleton/below-
threshold/at-threshold/separate buckets/mixed matchers/equivalence.
- Proptest `ac_agrees_with_anyof_contains` asserts the optimized and
unoptimized matchers produce identical output on random haystacks.
- Proptests `ascii_lowercase_cow_correct` and
`ascii_lowercase_cow_borrows_when_already_lower_ascii`.
- Regression harness `regression_eval.rs` with three snapshot tests:
contains-heavy corpus, AllOf|contains semantics, keyword AC path.
Benchmarks:
- New `eval_contains_heavy` group (5–200 patterns, 1000 events).
- New `eval_ac_threshold_sweep` cross-product of pattern count
{1,2,4,8,16,32} × haystack length {100B, 1KB, 8KB, 64KB}.
- General `eval_throughput` improves 3–5% on synthetic 100-rule
corpus; `eval_contains_heavy` reaches 10 Melem/s at 50–100 patterns.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a detection item or keyword group has 8+ plain
|containsneedles with the same case sensitivity, the compiler now collapses them into a singleAhoCorasickSetmatcher. This replaces N sequentialstr::containscalls with one linear pass over the haystack.The optimizer is invoked only on
AnyOf(OR) construction sites;AllOf(|allmodifier) is left untouched so AND semantics are preserved.For case-insensitive matching, needles are pre-lowered (matching the existing
Containsinvariant) and the hot path lowers the haystack once via a newascii_lowercase_cowhelper with a 3-tier fast path: borrow when already lowercase ASCII, in-placemake_ascii_lowercasefor ASCII with uppercase, full Unicodeto_lowercaseonly for non-ASCII input.Changes
aho-corasick = "1"direct dep (already transitive viaregex).CompiledMatcher::AhoCorasickSet { automaton, case_insensitive }variant.compiler/optimizer.rsmodule withoptimize_any_ofpartition+bucket implementation. Threshold: 8 (placeholder; refined by sweep benchmark).matcher/helpers::ascii_lowercase_cowhelper.tests/regression_eval.rswith 3 snapshot tests (contains-heavy corpus, AllOf semantics safeguard, keyword AC path).eval_contains_heavy(5–200 patterns) andeval_ac_threshold_sweep(pattern count × haystack length cross-product).Tests
optimize_any_of(empty, singleton, below/at threshold, separate CI/CS buckets, mixed matchers, equivalence).ac_agrees_with_anyof_containsasserts the optimized matcher agrees with the unoptimized version on random haystacks.ascii_lowercase_cowcorrectness vs stdto_lowercaseand the zero-alloc fast path.|contains|allrule must NOT fire on partial matches.All 412 unit tests + 3 regression tests + workspace clippy clean.
Benchmarks (vs
before-p1baseline)eval_throughput/events/1000eval_throughput/events/10000eval_throughput/events/100000New
eval_contains_heavy(1 rule with N |contains needles, 1000 events):The 50-100 sweet spot reflects AC's automaton-traversal cost balancing pattern count.
Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --workspaceeval_throughputand neweval_contains_heavyAcknowledgements
This PR and the subsequent changes are inspired by this post about optimizing the pySigma-backend-spark with Aho-Corasick algorithm for contains modifiers in Spark by Jean-Claude Cote. Also, I'd like to thank Vinicius Morais who informed me about this.