Skip to content

feat(eval): lower haystack once for case-insensitive matcher groups#100

Merged
mostafa merged 1 commit into
mainfrom
feat/case-fold-optimization
May 13, 2026
Merged

feat(eval): lower haystack once for case-insensitive matcher groups#100
mostafa merged 1 commit into
mainfrom
feat/case-fold-optimization

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 13, 2026

Summary

When a group of matchers is composed entirely of case-insensitive string predicates, this PR makes the eval engine lower the haystack once and reuse it across all children. Previously, each Contains, StartsWith, EndsWith, Exact matcher with case_insensitive: true allocated its own lowered String per call, scaling allocations with the child count.

A new CompiledMatcher::CaseInsensitiveGroup { children, mode } variant encodes the optimization in the type system. The optimizer wraps an AnyOf into a CaseInsensitiveGroup when every child is pre-lowerable: CI Contains / StartsWith / EndsWith / Exact / AhoCorasickSet, regexes carrying the (?i) flag, and recursively pre-lowerable composites.

Hot path

The new matches_pre_lowered(&str) method assumes the haystack is already Unicode-lowercased and dispatches to children as pure string predicates. Optimizer bugs that feed it a non-pre-lowerable matcher trip a debug_assert!; in release the conservative fallback (false) avoids producing a false positive.

Changes

  • New GroupMode { Any, All } enum.
  • New CompiledMatcher::CaseInsensitiveGroup variant.
  • New is_pre_lowerable validator + regex_is_case_insensitive flag inspector in compiler/optimizer.rs.
  • optimize_any_of extended with wrap_ci_group_or_anyof post-pass.
  • rule_index.rs learned to descend into CaseInsensitiveGroup so detection items with multiple values keep their exact-match indexing.

Tests

  • 4 new unit tests for CI grouping: below-threshold wrap, mixed-case bypass, mixed pre-lowerable children wrapping, CS child poisoning the wrap.
  • is_pre_lowerable_classifies_correctly covering all relevant variants.
  • regex_is_case_insensitive_recognizer covering (?i), (?im), (?si), no-flags, other-flags-only, malformed, negated, and subpattern groups.
  • ci_group_matches_same_haystacks_as_anyof equivalence test on six hand-picked haystacks.

All 471 workspace tests pass; cargo clippy --workspace --all-targets --all-features -- -D warnings is clean.

Benchmarks (eval_contains_heavy, 1 rule × 1000 events)

Pattern count Phase 1 baseline After CI grouping Δ throughput
5 ~232 µs (4.3 Melem/s) 121 µs (8.2 Melem/s) +90%
20 ~125 µs (8.0 Melem/s) 98 µs (10.2 Melem/s) +28%
50 99 µs 100 µs no change
100 99 µs 99 µs no change

The biggest wins land below the Aho-Corasick threshold where multiple individual matchers stay live in the group. Once AC consumes the bucket, only one matcher remains so the CI wrapper provides no further benefit.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test --workspace (471 tests pass, 0 failures)
  • Criterion benchmarks before/after on eval_contains_heavy
  • SigmaHQ corpus regression (run after merge)

When a group of matchers is composed entirely of case-insensitive string
predicates, lowering the haystack once and reusing it across all children
eliminates the redundant `to_lowercase()` allocation that each child would
otherwise do. The savings scale linearly with the child count and are
substantial below the Aho-Corasick threshold where multiple individual
matchers stay live.

The new variant `CompiledMatcher::CaseInsensitiveGroup { children, mode }`
encodes this in the type system. The optimizer wraps an `AnyOf` into a
`CaseInsensitiveGroup` when every child is pre-lowerable: CI Contains /
StartsWith / EndsWith / Exact / AhoCorasickSet, regexes carrying the
`(?i)` flag, and recursively pre-lowerable composites. A new
`is_pre_lowerable` validator gates this and a new `regex_is_case_insensitive`
helper inspects flag prefixes on regex patterns.

The hot path uses a new `matches_pre_lowered(&str)` method that assumes
the haystack is already Unicode-lowercased and dispatches to children as
pure string predicates. The dispatch trips a `debug_assert!` if the
optimizer ever feeds it a non-pre-lowerable matcher; in release builds
the conservative fallback (`false`) avoids producing a false positive.

`rule_index.rs` learned to descend into `CaseInsensitiveGroup` so rules
whose detection items get wrapped (e.g. `EventType: [login, logout]`)
keep their exact-match indexing.

Tests:
- 4 new unit tests in `optimizer.rs`: CI group wrapping below AC
  threshold, mixed-case bypass, mixed pre-lowerable children getting
  wrapped, single CS child poisoning the wrap.
- `is_pre_lowerable_classifies_correctly` covering Contains/StartsWith/
  EndsWith/Exact (CI and CS), CI regex (with and without `i` flag),
  numeric, CIDR.
- `regex_is_case_insensitive_recognizer` covering `(?i)`, `(?im)`,
  `(?si)`, no-flags, other-flags-only, malformed, negated, and
  subpattern groups.
- `ci_group_matches_same_haystacks_as_anyof` equivalence test.

Benchmarks (`eval_contains_heavy` vs the pre-grouping baseline):
- patterns/5: -47% time, +90% throughput (now ~8.2 Melem/s).
- patterns/20: -22% time, +28% throughput (now ~10.2 Melem/s).
- patterns/{50,100}: no significant change (already AC-dominant).
@mostafa mostafa merged commit 34cc311 into main May 13, 2026
10 checks passed
@mostafa mostafa deleted the feat/case-fold-optimization branch May 13, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant