Skip to content

feat(eval): add opt-in bloom filter pre-filtering for substring matchers#102

Merged
mostafa merged 1 commit into
mainfrom
feat/bloom-filter-prefilter
May 13, 2026
Merged

feat(eval): add opt-in bloom filter pre-filtering for substring matchers#102
mostafa merged 1 commit into
mainfrom
feat/bloom-filter-prefilter

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 13, 2026

Summary

Adds a per-field bloom filter built at engine load time over the union of positive substring needles (Contains / StartsWith / EndsWith / AhoCorasickSet) across all rules. When enabled, Engine::evaluate probes the bloom for each event field and short-circuits any positive substring detection item whose field cannot possibly contain a needle trigram.

Pre-filtering is opt-in, off by default. Toggle per-engine via Engine::set_bloom_prefilter(true).

Why opt-in

The per-event probe (trigram extraction + double hashing) costs ~1 µs on a typical CommandLine field. On rule sets where most events overlap with at least one needle, the probe is pure overhead because every detection item still falls through to the matcher. On substring-heavy rule sets paired with mostly-non-matching events (high-volume telemetry vs an active threat-intel ruleset), the bloom skips the matcher entirely on the rejection path. Ship the knob, document the trade-off, let users benchmark their corpus.

Implementation highlights

  • engine::bloom_index module: custom bit-array bloom with AHash-based double hashing (h_i = h1 + i*h2 mod m). 16 bits per trigram floor keeps per-probe FPR well under 1% even at small N.
  • Pattern collection walks the compiled matcher tree, excludes Not(...) subtrees, and skips Exact (rule index already covers it).
  • BloomLookup trait abstracts the lookup so eval functions stay generic. NoBloom is a zero-sized stub the compiler elides. Default evaluate_rule doesn't reference any bloom types.
  • BloomCache memoizes per-event verdicts per field so each candidate rule pays at most one probe per shared field.
  • AhoCorasickSet gained needles: Vec<String> so the bloom builder can recover patterns the optimizer collapsed into the automaton.
  • CompiledDetectionItem.bloom_eligible: bool precomputed at compile time so the per-event hot path doesn't walk the matcher tree to decide eligibility.
  • Memory budget: per-field 64 KB cap, 1 MB total; lowest-density fields drop first when budget exceeded.

Tests

  • 10 unit tests in engine::bloom_index cover: empty filter when no positive substring rules, populates filter for Contains, negated contains contributes nothing, unrelated field falls through to MaybeMatch, skips haystacks below NGRAM_SIZE and above MAX_BLOOM_SCAN_BYTES, Aho-Corasick needles contribute, case-insensitive probe, memory-budget eviction.
  • Two new regression tests:
    • bloom_prefilter_preserves_match_results asserts identical output with bloom on vs off across a 7-event corpus including a digit-only event the bloom would reject.
    • bloom_prefilter_handles_condition_negation covers selection and not other where other is a substring detection. The bloom rejects other for digit-only events, the negation flips the result to true; both paths agree.

All 444 workspace tests pass; clippy + fmt clean.

Benchmarks

New eval_bloom_rejection group exercises substring-only rules vs digit-only events that share zero trigrams with the alphabetical patterns.

Rules Default (bloom off) Bloom on Δ throughput
100 170 Kelem/s 169 Kelem/s ~0% (rule index is fast enough)
500 25.9 Kelem/s 29.0 Kelem/s +12%
1000 12.5 Kelem/s 14.5 Kelem/s +16%
5000 2.5 Kelem/s 2.97 Kelem/s +19%

Default-path benchmarks (bloom off) vs the pre-Phase-4 baseline:

  • eval_throughput: within ±4% (no statistically significant change at 1k events; -3.6% at 10k, -3.9% at 100k events — within criterion's noise threshold).
  • eval_contains_heavy and eval_regex_set_heavy: unchanged.

The bloom-on path is faster than bloom-off only when most candidate rules use substring matching AND most events don't share trigrams with the needles. Outside that regime, leaving bloom off is the right choice.

Dependency

Adds ahash = \"0.8\" (already transitive via regex-automata).

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets --all-features -- -D warnings
  • cargo test --workspace
  • eval_bloom_rejection benchmark with both default and bloom_prefilter variants
  • eval_throughput benchmark vs pre-Phase-4 baseline
  • SigmaHQ corpus regression (run after merge)

Adds a per-field bloom filter built at engine load time over the union of
positive substring needles (Contains / StartsWith / EndsWith /
AhoCorasickSet) across all rules. When enabled, `Engine::evaluate` probes
the bloom for each event field and short-circuits any positive substring
detection item whose field cannot possibly contain a needle trigram.

Pre-filtering is **opt-in**, off by default. Toggling is per-engine via
`Engine::set_bloom_prefilter(true)`. Default behavior keeps the Phase 3
fast path so existing workloads see no regression. The trade-off is
documented on the toggle:

- The per-event probe (trigram extraction + double hashing) costs ~1 µs
  on a typical CommandLine field.
- On rule sets where most events overlap with at least one needle, the
  probe is pure overhead because every detection item still falls through
  to the matcher.
- On substring-heavy rule sets paired with mostly-non-matching events
  (e.g. high-volume telemetry vs an active threat-intel ruleset), the
  bloom skips the matcher entirely on the rejection path.

Implementation:

- New `engine::bloom_index` module:
  - `FieldBloom` is a custom bit-array with AHash-based double hashing
    (`h_i = h1 + i*h2 mod m`, `h1`/`h2` derived from the same trigram
    fed to `AHasher` with two different prefix bytes). 16 bits per
    trigram floor so per-probe FPR stays well under 1% even at small N.
  - `FieldBloomIndex` collects positive substring needles per field
    via a tree walk that excludes anything inside `Not(...)` and skips
    `Exact` (already covered by the rule index). Per-field 64 KB cap,
    1 MB total budget; lowest-density fields are dropped first when the
    budget is exceeded.
  - `BloomLookup` trait abstracts the lookup so the eval functions
    stay generic. `NoBloom` is a zero-sized stub that always answers
    `MaybeMatch` and lets the compiler elide the bloom branch entirely
    in the public `evaluate_rule` path.
  - `BloomCache` memoizes the per-event verdict per field so each
    candidate rule pays at most one probe per shared field.
- `AhoCorasickSet` gained a `needles: Vec<String>` field so the bloom
  builder can recover the pre-lowered patterns the optimizer collapsed
  into the automaton.
- `CompiledDetectionItem` gained a `bloom_eligible: bool` precomputed
  at compile time. The recursive `is_positive_substring_matcher` walk
  used to be evaluated per event per item; precomputing keeps the
  bloom-on hot path competitive on rule sets with many items.
- `Engine` exposes `set_bloom_prefilter` / `bloom_prefilter_enabled`.

Tests:
- 10 new unit tests in `engine::bloom_index`: empty filter when no
  positive substring rules, populates filter for Contains, negated
  Contains contributes nothing, unrelated field falls through, skips
  haystacks below `NGRAM_SIZE` and above `MAX_BLOOM_SCAN_BYTES`,
  Aho-Corasick needles contribute, case-insensitive probe, memory
  budget eviction.
- Two new regression tests:
  `bloom_prefilter_preserves_match_results` asserts identical output
  with bloom on vs off across a 7-event corpus.
  `bloom_prefilter_handles_condition_negation` covers
  `selection and not other` with a substring detection in `other` and
  a digit-only event that hits the `DefinitelyNoMatch` path.

Benchmarks:

- New `eval_bloom_rejection` group with both `default` (off) and
  `bloom_prefilter` (on) variants, scaled across 100 / 500 / 1000 /
  5000 substring-only rules vs 1000 digit-only events. Bloom on shows
  +12% to +19% throughput at >=500 rules, flat at 100 rules where the
  rule index alone is fast enough.
- General `eval_throughput` (default path) is within 4% of the
  pre-Phase-4 baseline; bloom-eligible benchmarks
  (`eval_contains_heavy`, `eval_regex_set_heavy`) are unchanged.

Dependency: adds `ahash = "0.8"` (already transitive via
`regex-automata`).
@mostafa mostafa merged commit 6901a42 into main May 13, 2026
11 checks passed
@mostafa mostafa deleted the feat/bloom-filter-prefilter branch May 13, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant