feat(eval): add opt-in bloom filter pre-filtering for substring matchers#102
Merged
Conversation
Adds a per-field bloom filter built at engine load time over the union of
positive substring needles (Contains / StartsWith / EndsWith /
AhoCorasickSet) across all rules. When enabled, `Engine::evaluate` probes
the bloom for each event field and short-circuits any positive substring
detection item whose field cannot possibly contain a needle trigram.
Pre-filtering is **opt-in**, off by default. Toggling is per-engine via
`Engine::set_bloom_prefilter(true)`. Default behavior keeps the Phase 3
fast path so existing workloads see no regression. The trade-off is
documented on the toggle:
- The per-event probe (trigram extraction + double hashing) costs ~1 µs
on a typical CommandLine field.
- On rule sets where most events overlap with at least one needle, the
probe is pure overhead because every detection item still falls through
to the matcher.
- On substring-heavy rule sets paired with mostly-non-matching events
(e.g. high-volume telemetry vs an active threat-intel ruleset), the
bloom skips the matcher entirely on the rejection path.
Implementation:
- New `engine::bloom_index` module:
- `FieldBloom` is a custom bit-array with AHash-based double hashing
(`h_i = h1 + i*h2 mod m`, `h1`/`h2` derived from the same trigram
fed to `AHasher` with two different prefix bytes). 16 bits per
trigram floor so per-probe FPR stays well under 1% even at small N.
- `FieldBloomIndex` collects positive substring needles per field
via a tree walk that excludes anything inside `Not(...)` and skips
`Exact` (already covered by the rule index). Per-field 64 KB cap,
1 MB total budget; lowest-density fields are dropped first when the
budget is exceeded.
- `BloomLookup` trait abstracts the lookup so the eval functions
stay generic. `NoBloom` is a zero-sized stub that always answers
`MaybeMatch` and lets the compiler elide the bloom branch entirely
in the public `evaluate_rule` path.
- `BloomCache` memoizes the per-event verdict per field so each
candidate rule pays at most one probe per shared field.
- `AhoCorasickSet` gained a `needles: Vec<String>` field so the bloom
builder can recover the pre-lowered patterns the optimizer collapsed
into the automaton.
- `CompiledDetectionItem` gained a `bloom_eligible: bool` precomputed
at compile time. The recursive `is_positive_substring_matcher` walk
used to be evaluated per event per item; precomputing keeps the
bloom-on hot path competitive on rule sets with many items.
- `Engine` exposes `set_bloom_prefilter` / `bloom_prefilter_enabled`.
Tests:
- 10 new unit tests in `engine::bloom_index`: empty filter when no
positive substring rules, populates filter for Contains, negated
Contains contributes nothing, unrelated field falls through, skips
haystacks below `NGRAM_SIZE` and above `MAX_BLOOM_SCAN_BYTES`,
Aho-Corasick needles contribute, case-insensitive probe, memory
budget eviction.
- Two new regression tests:
`bloom_prefilter_preserves_match_results` asserts identical output
with bloom on vs off across a 7-event corpus.
`bloom_prefilter_handles_condition_negation` covers
`selection and not other` with a substring detection in `other` and
a digit-only event that hits the `DefinitelyNoMatch` path.
Benchmarks:
- New `eval_bloom_rejection` group with both `default` (off) and
`bloom_prefilter` (on) variants, scaled across 100 / 500 / 1000 /
5000 substring-only rules vs 1000 digit-only events. Bloom on shows
+12% to +19% throughput at >=500 rules, flat at 100 rules where the
rule index alone is fast enough.
- General `eval_throughput` (default path) is within 4% of the
pre-Phase-4 baseline; bloom-eligible benchmarks
(`eval_contains_heavy`, `eval_regex_set_heavy`) are unchanged.
Dependency: adds `ahash = "0.8"` (already transitive via
`regex-automata`).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a per-field bloom filter built at engine load time over the union of positive substring needles (
Contains/StartsWith/EndsWith/AhoCorasickSet) across all rules. When enabled,Engine::evaluateprobes the bloom for each event field and short-circuits any positive substring detection item whose field cannot possibly contain a needle trigram.Pre-filtering is opt-in, off by default. Toggle per-engine via
Engine::set_bloom_prefilter(true).Why opt-in
The per-event probe (trigram extraction + double hashing) costs ~1 µs on a typical CommandLine field. On rule sets where most events overlap with at least one needle, the probe is pure overhead because every detection item still falls through to the matcher. On substring-heavy rule sets paired with mostly-non-matching events (high-volume telemetry vs an active threat-intel ruleset), the bloom skips the matcher entirely on the rejection path. Ship the knob, document the trade-off, let users benchmark their corpus.
Implementation highlights
engine::bloom_indexmodule: custom bit-array bloom with AHash-based double hashing (h_i = h1 + i*h2 mod m). 16 bits per trigram floor keeps per-probe FPR well under 1% even at small N.Not(...)subtrees, and skipsExact(rule index already covers it).BloomLookuptrait abstracts the lookup so eval functions stay generic.NoBloomis a zero-sized stub the compiler elides. Defaultevaluate_ruledoesn't reference any bloom types.BloomCachememoizes per-event verdicts per field so each candidate rule pays at most one probe per shared field.AhoCorasickSetgainedneedles: Vec<String>so the bloom builder can recover patterns the optimizer collapsed into the automaton.CompiledDetectionItem.bloom_eligible: boolprecomputed at compile time so the per-event hot path doesn't walk the matcher tree to decide eligibility.Tests
engine::bloom_indexcover: empty filter when no positive substring rules, populates filter forContains, negated contains contributes nothing, unrelated field falls through toMaybeMatch, skips haystacks belowNGRAM_SIZEand aboveMAX_BLOOM_SCAN_BYTES, Aho-Corasick needles contribute, case-insensitive probe, memory-budget eviction.bloom_prefilter_preserves_match_resultsasserts identical output with bloom on vs off across a 7-event corpus including a digit-only event the bloom would reject.bloom_prefilter_handles_condition_negationcoversselection and not otherwhereotheris a substring detection. The bloom rejectsotherfor digit-only events, the negation flips the result to true; both paths agree.All 444 workspace tests pass; clippy + fmt clean.
Benchmarks
New
eval_bloom_rejectiongroup exercises substring-only rules vs digit-only events that share zero trigrams with the alphabetical patterns.Default-path benchmarks (bloom off) vs the pre-Phase-4 baseline:
eval_throughput: within ±4% (no statistically significant change at 1k events; -3.6% at 10k, -3.9% at 100k events — within criterion's noise threshold).eval_contains_heavyandeval_regex_set_heavy: unchanged.The bloom-on path is faster than bloom-off only when most candidate rules use substring matching AND most events don't share trigrams with the needles. Outside that regime, leaving bloom off is the right choice.
Dependency
Adds
ahash = \"0.8\"(already transitive viaregex-automata).Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --workspaceeval_bloom_rejectionbenchmark with both default and bloom_prefilter variantseval_throughputbenchmark vs pre-Phase-4 baseline