perf(eval): incremental indexes for single-rule add paths#121
Merged
Conversation
Adds `RuleIndex::append_rule(rule_idx, rule)` so the inverted index can be grown one rule at a time in time proportional to the rule's detection tree size, independent of the total rule count. This is the building block for an upcoming per-rule O(1) `Engine::add_rule` path; this commit ships only the data-structure change, no engine wiring or behaviour change. `RuleIndex::build` becomes a thin wrapper that calls `append_rule` in order over the slice and then trusts `rules.len()` as the authoritative `rule_count` (covers the edge case where the tail of the slice contributes no `(field, exact_value)` pairs and so does not bump the counter via the `max` path). Tests: - `test_append_rule_matches_build` builds two indexes (batched and one-rule-at-a-time) across a mix of indexable, multi-field, OR, keyword, and regex rules, then asserts identical candidate sets across five representative events. - `test_append_rule_grows_rule_count` pins the monotonic `rule_count` contract and per-rule candidate retrieval after each append.
Makes `Engine::add_rule` and `Engine::add_compiled_rule` amortized O(1) per call by folding new rules into the inverted and bloom indexes incrementally instead of rebuilding both indexes from scratch after every push. Bulk loads through `Engine::add_rules`, `Engine::extend_compiled_rules`, and `Engine::add_collection` still take the single-rebuild fast path they already had. `FieldBloomIndex` gains: - `append_rule(rule)` that inserts the rule's positive substring trigrams into existing per-field blooms and creates a tight new bloom on the fly for fields that did not have one yet. No false negatives. False positive rate drifts upward between rebuilds, capped by the doubling watermark below. - `should_rebuild(rule_count)` returning whether the rule count has at least doubled past the last rebuild (with a 64-rule floor), schedule that keeps the amortized per-rule cost O(1) while preventing FPR drift from running away. - A `rebuild_baseline` field tracking the rule count at the most recent full rebuild; reset on every `build` / `build_with_budget`. `Engine::index_append_last_rule` is the new wiring point. It calls `RuleIndex::append_rule` and `FieldBloomIndex::append_rule` on the just-pushed rule, then triggers a full `FieldBloomIndex` rebuild via the existing path when the watermark trips. Cross-rule AC (daachorse) has no incremental update story, so when `cross_rule_ac_enabled` is on this falls back to `Engine::rebuild_index`. Trade-off: between bloom rebuilds, probes may answer `MaybeMatch` where the batched-rebuild path would answer `DefinitelyNoMatch`. Both verdicts are correct (`MaybeMatch` is always safe); the engine just evaluates the rule directly instead of short-circuiting. Tests: - `append_rule_matches_build_verdicts`: positive verdicts match batched for needles from the appended rules; disjoint haystacks remain rejected under batched and at worst conservative (`MaybeMatch`) under incremental. - `append_rule_creates_filter_for_new_field`: rules that introduce a brand-new indexed field get a fresh bloom immediately, with the new field's trigrams probable on the first event. - `should_rebuild_follows_doubling_watermark`: pins the 64-rule floor and the 2x watermark schedule. - `test_add_rule_loop_scales_linearly_on_large_corpus` mirrors the existing batched-load scaling guard for the per-rule entry point, asserting 2000 consecutive `add_rule` calls complete inside a coarse 30s ceiling. `crates/rsigma-eval/README.md` API table updated to reflect the new amortized O(1) characteristic of the single-rule add methods.
This was referenced May 16, 2026
SecurityEnthusiast
pushed a commit
to SecurityEnthusiast/rsigma
that referenced
this pull request
May 17, 2026
Replaces the placeholder Unreleased section with a full release-notes draft following the format of the v0.11.0 / v0.10.0 / v0.9.0 entries. Covers every PR merged to main since v0.11.0: - Daemon and CLI observability (PR timescale#107) - tower-http access logs, per-request OTLP tracing, batch spans, source resolution spans, DLQ visibility, NATS/sink lifecycle, correlation eviction warnings, rule load diagnostics, daemon lifecycle, global `--log-format` flag. - Eval rule loading performance (PRs timescale#119, timescale#121, timescale#122, timescale#123) - batched loaders rebuild indexes once per batch via `Engine::add_rules` / `extend_compiled_rules` / `add_collection`; single-rule path amortized O(1) via `RuleIndex::append_rule` and a doubling-watermark `FieldBloomIndex`. SigmaHQ corpus (~3,120 rules) now loads in ~120 ms. - CLI command groups (PR timescale#124) - the noun-led `engine` / `rule` / `backend` / `pipeline` / `attack` grouping with the existing migration table preserved verbatim. - Test reliability (PRs timescale#115, timescale#123) - cli_daemon_http and cli_daemon_otlp E2E suites de-flaked on macOS under load; eval bloom test made deterministic against random AHash seeds. - Dependency and CI bumps. All command-name references within the draft already use the new noun-led paths (`engine eval`, `rule validate`, etc.) so the next release ships with consistent terminology throughout the notes.
SecurityEnthusiast
pushed a commit
to SecurityEnthusiast/rsigma
that referenced
this pull request
May 20, 2026
The "operability, performance, and documentation" release. * Workspace bumped 0.11.0 -> 0.12.0; all 10 inter-crate dep pins refreshed; Cargo.lock regenerated under --locked. * CHANGELOG.md [Unreleased] section flipped to [0.12.0] - 2026-05-19; comparison link updated to v0.11.0...v0.12.0; tag reference added to the bottom-of-file link block. * CHANGELOG also gained a Documentation site (PR timescale#129) section under the existing observability / eval-perf / CLI-groups / test-reliability / dependencies headings, and the TL;DR theme moved from "operations and load performance" to "operability, performance, and documentation" to reflect the new docs site as a top-line deliverable. Covers all 13 PRs merged since v0.11.0: timescale#107 (observability), timescale#111/timescale#113/timescale#114/timescale#120 (dependency batches), timescale#115/timescale#123 (test reliability), timescale#119/timescale#121/timescale#122/timescale#123 (eval rule loading perf), timescale#124 (CLI command groups), timescale#127 (CLI docs followup), timescale#129 (documentation site).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Engine::add_ruleandEngine::add_compiled_rulewere rebuilding the inverted index and per-field bloom filter from scratch on every call, which made a tight loop of single-rule adds O(N²) in the total rule count. This PR makes both paths amortized O(1) by folding new rules into the indexes incrementally and only re-running the full bloom rebuild when a doubling watermark trips.Batched loads (
add_rules,extend_compiled_rules,add_collection) keep their existing one-rebuild-at-end fast path: for batches, a single full rebuild over the aggregate is still cheaper than N incremental appends.Changes
rsigma-eval/src/rule_index.rsRuleIndex::append_rule(rule_idx, rule)that folds a single rule's(field, exact_value)pairs into the index in time proportional to the rule's detection tree size, independent of the total rule count.RuleIndex::buildbecomes a thin wrapper that callsappend_rulein order, then trustsrules.len()as the authoritativerule_countfor the edge case where the tail of the slice contributes no pairs.rsigma-eval/src/engine/bloom_index.rsFieldBloomIndex::append_rule(rule)that inserts the rule's positive substring trigrams into existing per-field blooms and creates a tightly sized fresh bloom for fields that did not have one yet. No false negatives. FPR can drift upward between rebuilds, capped by the watermark below.FieldBloomIndex::should_rebuild(rule_count)returning whether the rule count has at least doubled since the last full rebuild (with a 64-rule floor). This schedule keeps per-rule cost amortized O(1) while preventing FPR drift from running away.rebuild_baselinefield tracking the rule count at the most recent full rebuild; reset onbuild/build_with_budget.count_unique_trigramshelper shared by the batched and incremental paths.rsigma-eval/src/engine/mod.rsEngine::index_append_last_rulethat folds the just-pushed rule into both indexes incrementally, then triggers a full bloom rebuild via the existingbuild/build_with_budgetpath when the watermark trips. Cross-rule AC (daachorse) has no incremental update story, so whencross_rule_ac_enabledis on this method falls back toEngine::rebuild_index.Engine::add_ruleandEngine::add_compiled_rulenow callindex_append_last_ruleinstead ofrebuild_index. Behaviour preserved across all engine entry points.rsigma-eval/README.mdadd_ruleandadd_compiled_rule.Correctness
Probe semantics are unchanged. Between bloom rebuilds the incremental index can answer
MaybeMatchwhere the batched-rebuild path would answerDefinitelyNoMatch, but both verdicts are safe:MaybeMatchjust means the engine evaluates the rule directly instead of short-circuiting. There are no false negatives.Tests
RuleIndex::tests::test_append_rule_matches_buildandtest_append_rule_grows_rule_countpin equivalence between batched and incremental paths plus the monotonicrule_countcontract.bloom_index::tests::append_rule_matches_build_verdictsconfirms positive verdicts match across both paths and disjoint haystacks remain rejected under batched and at worst conservative (MaybeMatch) under incremental.bloom_index::tests::append_rule_creates_filter_for_new_fieldverifies a new field's trigrams are probable immediately after the first append.bloom_index::tests::should_rebuild_follows_doubling_watermarkpins the 64-rule floor and 2x doubling schedule.engine::tests::test_add_rule_loop_scales_linearly_on_large_corpusmirrors the existing batched-load scaling guard for the single-rule entry point: 2000 consecutiveadd_rulecalls must complete inside a 30s ceiling.Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets --all-features -- -D warningscargo test --workspace --all-features(442 in rsigma-eval lib, full workspace green)rsigma validate ../sigma/rules/against the full SigmaHQ corpus (3120 rules) in release: 1.0s wall.References