Skip to content

feat(eval): batch large |contains lists into Aho-Corasick automaton#99

Merged
mostafa merged 1 commit into
mainfrom
feat/aho-corasick-matcher
May 13, 2026
Merged

feat(eval): batch large |contains lists into Aho-Corasick automaton#99
mostafa merged 1 commit into
mainfrom
feat/aho-corasick-matcher

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 13, 2026

Summary

When a detection item or keyword group has 8+ plain |contains needles with the same case sensitivity, the compiler now collapses them into a single AhoCorasickSet matcher. This replaces N sequential str::contains calls with one linear pass over the haystack.

The optimizer is invoked only on AnyOf (OR) construction sites; AllOf (|all modifier) is left untouched so AND semantics are preserved.

For case-insensitive matching, needles are pre-lowered (matching the existing Contains invariant) and the hot path lowers the haystack once via a new ascii_lowercase_cow helper with a 3-tier fast path: borrow when already lowercase ASCII, in-place make_ascii_lowercase for ASCII with uppercase, full Unicode to_lowercase only for non-ASCII input.

Changes

  • New aho-corasick = "1" direct dep (already transitive via regex).
  • New CompiledMatcher::AhoCorasickSet { automaton, case_insensitive } variant.
  • New compiler/optimizer.rs module with optimize_any_of partition+bucket implementation. Threshold: 8 (placeholder; refined by sweep benchmark).
  • New matcher/helpers::ascii_lowercase_cow helper.
  • New tests/regression_eval.rs with 3 snapshot tests (contains-heavy corpus, AllOf semantics safeguard, keyword AC path).
  • New bench groups: eval_contains_heavy (5–200 patterns) and eval_ac_threshold_sweep (pattern count × haystack length cross-product).

Tests

  • 7 unit tests for optimize_any_of (empty, singleton, below/at threshold, separate CI/CS buckets, mixed matchers, equivalence).
  • Proptest ac_agrees_with_anyof_contains asserts the optimized matcher agrees with the unoptimized version on random haystacks.
  • Proptests for ascii_lowercase_cow correctness vs std to_lowercase and the zero-alloc fast path.
  • Regression test for the AllOf safeguard: a |contains|all rule must NOT fire on partial matches.

All 412 unit tests + 3 regression tests + workspace clippy clean.

Benchmarks (vs before-p1 baseline)

Group Before After Change
eval_throughput/events/1000 388 K elem/s 404 K elem/s +3.3%
eval_throughput/events/10000 389 K elem/s 406 K elem/s +5.1%
eval_throughput/events/100000 386 K elem/s 405 K elem/s +4.8%

New eval_contains_heavy (1 rule with N |contains needles, 1000 events):

Patterns Throughput
10 7.25 Melem/s
20 8.68 Melem/s
50 10.32 Melem/s
100 10.40 Melem/s
200 7.25 Melem/s

The 50-100 sweet spot reflects AC's automaton-traversal cost balancing pattern count.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test --workspace
  • Criterion benchmarks before/after on eval_throughput and new eval_contains_heavy
  • SigmaHQ corpus regression (run after merge)

Acknowledgements

This PR and the subsequent changes are inspired by this post about optimizing the pySigma-backend-spark with Aho-Corasick algorithm for contains modifiers in Spark by Jean-Claude Cote. Also, I'd like to thank Vinicius Morais who informed me about this.

When a detection item or keyword group has 8 or more plain `|contains`
needles with the same case sensitivity, collapse them into a single
`AhoCorasickSet` matcher. Replaces N sequential `str::contains` calls
with one linear pass over the haystack.

Adds a new `compiler::optimizer` module with `optimize_any_of`, which
partitions an `AnyOf` group into case-insensitive Contains, case-
sensitive Contains, and other matcher buckets. Each Contains bucket
that meets the threshold becomes one `AhoCorasickSet`; the rest stay
as individual matchers. The optimizer is invoked only on `AnyOf`
construction sites — `AllOf` (`|all` modifier) is left untouched, so
AND semantics are preserved.

For case-insensitive matching, needles are pre-lowered (matching the
existing `Contains` invariant) and the hot path lowers the haystack
once via the new `ascii_lowercase_cow` helper, which has a 3-tier
fast path: borrow when already lowercase ASCII, in-place
`make_ascii_lowercase` for ASCII with uppercase, and full Unicode
`to_lowercase` only for non-ASCII input.

Tests:
- 7 unit tests for `optimize_any_of` covering empty/singleton/below-
  threshold/at-threshold/separate buckets/mixed matchers/equivalence.
- Proptest `ac_agrees_with_anyof_contains` asserts the optimized and
  unoptimized matchers produce identical output on random haystacks.
- Proptests `ascii_lowercase_cow_correct` and
  `ascii_lowercase_cow_borrows_when_already_lower_ascii`.
- Regression harness `regression_eval.rs` with three snapshot tests:
  contains-heavy corpus, AllOf|contains semantics, keyword AC path.

Benchmarks:
- New `eval_contains_heavy` group (5–200 patterns, 1000 events).
- New `eval_ac_threshold_sweep` cross-product of pattern count
  {1,2,4,8,16,32} × haystack length {100B, 1KB, 8KB, 64KB}.
- General `eval_throughput` improves 3–5% on synthetic 100-rule
  corpus; `eval_contains_heavy` reaches 10 Melem/s at 50–100 patterns.
@mostafa mostafa merged commit b0d1820 into main May 13, 2026
12 checks passed
@mostafa mostafa deleted the feat/aho-corasick-matcher branch May 13, 2026 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant