feat(eval): lower haystack once for case-insensitive matcher groups#100
Merged
Conversation
When a group of matchers is composed entirely of case-insensitive string
predicates, lowering the haystack once and reusing it across all children
eliminates the redundant `to_lowercase()` allocation that each child would
otherwise do. The savings scale linearly with the child count and are
substantial below the Aho-Corasick threshold where multiple individual
matchers stay live.
The new variant `CompiledMatcher::CaseInsensitiveGroup { children, mode }`
encodes this in the type system. The optimizer wraps an `AnyOf` into a
`CaseInsensitiveGroup` when every child is pre-lowerable: CI Contains /
StartsWith / EndsWith / Exact / AhoCorasickSet, regexes carrying the
`(?i)` flag, and recursively pre-lowerable composites. A new
`is_pre_lowerable` validator gates this and a new `regex_is_case_insensitive`
helper inspects flag prefixes on regex patterns.
The hot path uses a new `matches_pre_lowered(&str)` method that assumes
the haystack is already Unicode-lowercased and dispatches to children as
pure string predicates. The dispatch trips a `debug_assert!` if the
optimizer ever feeds it a non-pre-lowerable matcher; in release builds
the conservative fallback (`false`) avoids producing a false positive.
`rule_index.rs` learned to descend into `CaseInsensitiveGroup` so rules
whose detection items get wrapped (e.g. `EventType: [login, logout]`)
keep their exact-match indexing.
Tests:
- 4 new unit tests in `optimizer.rs`: CI group wrapping below AC
threshold, mixed-case bypass, mixed pre-lowerable children getting
wrapped, single CS child poisoning the wrap.
- `is_pre_lowerable_classifies_correctly` covering Contains/StartsWith/
EndsWith/Exact (CI and CS), CI regex (with and without `i` flag),
numeric, CIDR.
- `regex_is_case_insensitive_recognizer` covering `(?i)`, `(?im)`,
`(?si)`, no-flags, other-flags-only, malformed, negated, and
subpattern groups.
- `ci_group_matches_same_haystacks_as_anyof` equivalence test.
Benchmarks (`eval_contains_heavy` vs the pre-grouping baseline):
- patterns/5: -47% time, +90% throughput (now ~8.2 Melem/s).
- patterns/20: -22% time, +28% throughput (now ~10.2 Melem/s).
- patterns/{50,100}: no significant change (already AC-dominant).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a group of matchers is composed entirely of case-insensitive string predicates, this PR makes the eval engine lower the haystack once and reuse it across all children. Previously, each
Contains,StartsWith,EndsWith,Exactmatcher withcase_insensitive: trueallocated its own loweredStringper call, scaling allocations with the child count.A new
CompiledMatcher::CaseInsensitiveGroup { children, mode }variant encodes the optimization in the type system. The optimizer wraps anAnyOfinto aCaseInsensitiveGroupwhen every child is pre-lowerable: CIContains/StartsWith/EndsWith/Exact/AhoCorasickSet, regexes carrying the(?i)flag, and recursively pre-lowerable composites.Hot path
The new
matches_pre_lowered(&str)method assumes the haystack is already Unicode-lowercased and dispatches to children as pure string predicates. Optimizer bugs that feed it a non-pre-lowerable matcher trip adebug_assert!; in release the conservative fallback (false) avoids producing a false positive.Changes
GroupMode { Any, All }enum.CompiledMatcher::CaseInsensitiveGroupvariant.is_pre_lowerablevalidator +regex_is_case_insensitiveflag inspector incompiler/optimizer.rs.optimize_any_ofextended withwrap_ci_group_or_anyofpost-pass.rule_index.rslearned to descend intoCaseInsensitiveGroupso detection items with multiple values keep their exact-match indexing.Tests
is_pre_lowerable_classifies_correctlycovering all relevant variants.regex_is_case_insensitive_recognizercovering(?i),(?im),(?si), no-flags, other-flags-only, malformed, negated, and subpattern groups.ci_group_matches_same_haystacks_as_anyofequivalence test on six hand-picked haystacks.All 471 workspace tests pass;
cargo clippy --workspace --all-targets --all-features -- -D warningsis clean.Benchmarks (
eval_contains_heavy, 1 rule × 1000 events)The biggest wins land below the Aho-Corasick threshold where multiple individual matchers stay live in the group. Once AC consumes the bucket, only one matcher remains so the CI wrapper provides no further benefit.
Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --workspace(471 tests pass, 0 failures)eval_contains_heavy