feat: field observability for engine daemon and engine eval (--observe-fields)#149
Merged
Conversation
Move FieldCollector out of the rsigma rule fields CLI subcommand into a new rsigma_eval::fields module with a public RuleFieldSet API. Both the CLI command and the upcoming daemon field-observability endpoints will read from the same extractor so the offline view matches what the engine references at runtime. No behaviour change for rsigma rule fields.
Refresh an Arc<RuleFieldSet> on every successful load_rules() and expose it on both RuntimeEngine and LogProcessor. The daemon's upcoming /api/v1/fields/* endpoints will snapshot the set without taking the engine lock; ArcSwap keeps readers wait-free across hot reloads.
Add a non-breaking field_keys method to the Event trait, defaulted to walking to_json() so external Event implementations keep compiling. JsonEvent overrides with a zero-copy recursive walk that yields dot-joined paths (depth-capped at MAX_NESTING_DEPTH) and emits intermediate object names alongside their leaves so callers can inspect coverage at any nesting level. KvEvent and MapEvent forward their flat keys; PlainEvent returns an empty Vec because the synthetic _raw envelope is not a real field. Powers the upcoming FieldObserver without changing the detection hot path: only the opt-in daemon mode iterates field_keys.
Capped, lock-light observer that tallies per-field counts on the daemon's hot path. Wraps a parking_lot::Mutex<HashMap<String, u64>>; the mutex is held only long enough to bump a counter or insert a new key. Once max_keys is reached new fields are dropped and the overflow_dropped counter advances, while existing counters keep updating so high-frequency fields stay visible on a saturated observer. Exposes snapshot() (sorted by descending count) and reset(); both will be called by the upcoming /api/v1/fields/* HTTP endpoints.
…task Add --observe-fields (off by default) and --observe-fields-max-keys (default 10000) to rsigma engine daemon. When enabled, the daemon constructs an Arc<FieldObserver>, attaches it to the LogProcessor via set_field_observer, and stashes a handle on AppState for the upcoming /api/v1/fields/* routes. LogProcessor.process_batch_with_format checks the ArcSwap once per batch and skips iteration entirely when the observer is None, so the hot path stays untouched in the default configuration. Three new Prometheus surfaces refreshed on every /metrics scrape: rsigma_fields_observed_total, rsigma_fields_observer_unique_keys, and rsigma_fields_observer_overflow_dropped_total. EventInputDecoded forwards field_keys to its inner variants so JsonEvent's zero-copy walk is preserved through the runtime adapter.
… reset Add four field-observability HTTP endpoints to the daemon: GET /api/v1/fields snapshot with summary + unknown + missing GET /api/v1/fields/unknown event keys not referenced by any rule GET /api/v1/fields/missing rule keys never observed in events DELETE /api/v1/fields/observer clear counters and report what was cleared Each list endpoint takes ?limit (default 100, cap 1000) and ?offset for deterministic pagination; the response includes total and next_offset. Missing entries surface up to 10 rule titles with a truncated flag so a field touched by hundreds of rules doesn't blow up the payload. All four return 503 with a clear hint when the daemon was started without --observe-fields. The /metrics scrape and every successful endpoint call call into update_field_observer_metrics so the Prometheus gauges stay in sync with snapshots regardless of which surface the operator polls.
Add cli_daemon_fields_observer.rs covering eight scenarios on the field-observability surface: - 503 returns when --observe-fields is not set - /unknown surfaces event fields no rule references - /missing surfaces rule fields never observed in events - / (full) reports summary, unknown, missing, and intersection - DELETE clears counters and reports previous_keys + previous_events - --observe-fields-max-keys 2 saturates and records overflow_dropped - /unknown?limit=&offset= paginates with next_offset - /metrics exposes the three new rsigma_fields_observer_* counters Refactor http_get and add http_delete in tests/common to return the status + body on 4xx/5xx instead of panicking (using ureq's http_status_as_error(false)), so the 503 path can be asserted by its JSON error body. Add DaemonProcess::spawn_http_with_args so tests can opt into --observe-fields without copying the scaffolding flags.
CHANGELOG: new top-of-Unreleased section bundling the two flags, the four HTTP endpoints, the three Prometheus surfaces, and the design notes (default-off, ArcSwap load skip when disabled, parking_lot::Mutex<HashMap> backing). docs/cli/engine/daemon.md: new "Field observability (advanced)" subsection with both flags and cross-references. docs/reference/http-api.md: four rows added to the endpoint summary table and a full "Field observability" section with snapshot, gap, broken-coverage, and reset payload examples. docs/reference/metrics.md: new "Field observability (3 metrics)" section and updated catalogue count from 27 to 30. docs/guide/observability.md: new "Detection coverage with --observe-fields" section walking the operator through the gap/broken-coverage workflow; metrics-catalog cross-reference updated to 30. mkdocs build --strict clean; cargo clippy + fmt + test gates all green (1478 tests across 45 suites).
Root README gains a "Field observability" feature bullet alongside the existing TLS and streaming items. rsigma-cli README: two new daemon flag rows (--observe-fields and --observe-fields-max-keys) sit next to the bloom flags they share the "advanced, opt-in" framing with; four new endpoint rows (/api/v1/fields, /unknown, /missing, /observer) added to the daemon HTTP endpoint table with the gap/broken-coverage descriptions and the "requires --observe-fields" note. rsigma-runtime README: FieldObserver added to the Features list with the LogProcessor::set_field_observer wiring path called out. rsigma-eval README: new "Rule field extraction (fields module)" subsection under Public API listing RuleFieldSet, FieldOrigin, and FieldSource; Event Model section gains an Event::field_keys bullet covering the default impl and JsonEvent override.
70ca2ad to
b691e53
Compare
Pre-fix, update_field_observer_metrics computed the Prometheus counter
delta from snapshot.events_observed minus the prom counter's current
value. After DELETE /api/v1/fields/observer, snapshot.events_observed
resets to 0 while the Prometheus counter sits at its previous lifetime
value, so every subsequent observation was silently dropped from the
counter until the post-reset count climbed past the old lifetime value.
Same problem on overflow_dropped.
Fix by tracking two pairs of atomics on FieldObserver:
- events_observed / overflow_dropped: reset to 0 on reset(); drive
the "since-last-reset" view exposed by the HTTP API.
- lifetime_events_observed / lifetime_overflow_dropped: monotonic,
never reset; drive the Prometheus counters via the bridge.
FieldObservation grows two fields for the lifetime totals. The metrics
bridge in rsigma-cli now reads from those. Add a
lifetime_counters_survive_reset regression test that asserts a reset
between observations does not desync the counters.
While here, three small cleanups surfaced by the same review pass:
- Switch the field-observer hot-path read in
LogProcessor::process_batch_with_format from
ArcSwap::load_full (clones an Arc on every batch) to
ArcSwap::load (hazard-pointer Guard, no allocation). Saves one
Arc allocation per batch in the default --observe-fields=off
configuration.
- Collapse the two-pass walk over snapshot.entries in the
/api/v1/fields handler into a single pass that partitions
unknown vs intersection while building the `seen` set.
- Fix the docstrings that claimed "zero-allocation" / "zero-copy"
for the field_keys walk; in practice the dot-joined leaf paths
are not substrings of the source value and require one String
allocation per leaf. KvEvent's override returns Cow::Borrowed
(truly zero-copy); JsonEvent's allocates per leaf.
Tests: 1479/1479 across 45 suites, including the new regression test
and the existing 8-test cli_daemon_fields_observer integration suite.
Clippy + fmt clean.
Address every item from the second-pass review: 1. paginate move semantics. Take Vec<T> by value and drain the page out instead of cloning the slice. Saves up to `total - limit` clones of serde_json::Value per /api/v1/fields/* call (worst case ~900 clones per request at the default limit of 100 against a saturated 1000- element list). The total is preserved by returning it alongside the page so callers don't need to capture it before the move. 2. JsonEvent::field_keys emits leaves only. Previously a nested object contributed both its own path (`actor`) and its leaves (`actor.id`), which caused operators to see `actor` in the gap signal even when `actor.id` was rule-referenced. Sigma rules nearly always reference leaves via dot-notation, so emitting the intermediate was a net source of false positives. The default Event::field_keys impl (which walks to_json()) is updated in lockstep so the two paths stay consistent. Tests rewritten to match. 3. FieldObserver keys are Arc<str> instead of String. Snapshotting 10 000 keys now costs 10 000 atomic refcount increments plus one Vec allocation, not 10 000 String clones. FieldObservationEntry.field changed from String to Arc<str> (technically a public API change, but the only consumer is the daemon HTTP handler which reads via &*field). The trade-off is one extra allocation per first-time insertion (Arc header) in exchange for cheap repeated snapshots, which is the right side of the trade because /metrics is scraped every 15-30 s while distinct field insertion happens at most once per observation window per unique key. 4. Default Event::field_keys impl cleaned up. The helper now threads Vec<String> internally instead of Vec<Cow<'static, str>>, then wraps into Cow::Owned at the end. Same semantics, far less cognitive overhead at the trait boundary. Tests: 1480/1480 across 45 suites (one new regression test from this commit covering deeply-nested leaves-only behaviour). Clippy + fmt clean.
The field-coverage surface that ships behind --observe-fields on the
daemon is now available on the one-shot evaluator too. That uncovered
accidental coupling: FieldObserver lived in rsigma-runtime (daemon
feature) but only needs the Event trait (rsigma-eval). Move it.
Relocation:
- FieldObserver, FieldObservation, FieldObservationEntry move from
rsigma-runtime/src/field_observer.rs into
rsigma-eval/src/field_observer.rs (no API changes).
- Mutex backing switched from parking_lot::Mutex to std::sync::Mutex
to keep rsigma-eval dependency-light per the workspace constraint.
Lock-poisoning is treated as a programmer bug (the locked region
only does HashMap operations and saturating arithmetic, neither of
which can panic), so the lock() calls .expect(...) consistently.
- rsigma-runtime re-exports the three types via `pub use rsigma_eval::{...}`
so existing consumers (LogProcessor::set_field_observer, the
daemon HTTP handlers, downstream library users) keep compiling
against rsigma_runtime::FieldObserver unchanged.
engine eval mirror (no longer behind the daemon feature):
--observe-fields (off; enables observation)
--observe-fields-max-keys <N> (default 10000; cap)
--observe-fields-report <PATH> (writes JSON; defaults to stderr when
omitted so stdout NDJSON detections
stay machine-consumable)
The report has the same shape as GET /api/v1/fields so the same jq
queries work against either runtime. observe_event(ctx, &event) is
threaded through cmd_eval_with_correlations, cmd_eval_detection_only,
eval_stream_corr/detect, eval_line_corr/corr_json/detect/detect_json,
and eval_evtx_corr/detect; it inlines to a single null-check when the
context is None, so the disabled path costs essentially nothing.
Docs: new "Field observability (offline coverage report)" section in
docs/cli/engine/eval.md; an eval-side walkthrough added to
docs/guide/observability.md alongside the daemon one; CHANGELOG note
covering both the offline report and the relocation/re-export;
rsigma-cli README gets three new flag rows; rsigma-eval README gains a
field_observer module subsection; rsigma-runtime README rewrites its
FieldObserver bullet to point at the new home via the re-export.
Tests: 6 new cli_eval integration tests (full report shape, unknown
fields surfaced, missing fields surfaced, --observe-fields-max-keys
overflow accounting, stderr default when no path, off-by-default
silence). 1486/1486 across 45 suites. Clippy + fmt + mkdocs --strict
clean.
Post-commit review surfaced ~40 lines of duplicated coverage-join
logic between the daemon's GET /api/v1/fields* handlers
(crates/rsigma-cli/src/daemon/server.rs) and the engine eval
render_field_report (crates/rsigma-cli/src/commands/eval.rs): same
single-pass partition over snapshot.entries into unknown vs
intersection, same missing-fields computation against the rule field
set, same `seen` HashSet. The two surfaces could drift on field
semantics with no compile-time check.
Centralize via FieldObservation::coverage(&RuleFieldSet) -> FieldCoverage
in rsigma-eval. Returns borrowed views (Vec<&FieldObservationEntry> for
unknown, Vec<(&str, &FieldOrigin)> for missing) so consumers still own
JSON rendering and pagination but the join itself lives in one place.
Re-exported from rsigma-runtime alongside the other field-observer
types.
Daemon handlers (fields_full, fields_unknown, fields_missing) now call
.coverage(); the now-unused missing_fields free function is removed.
eval's render_field_report does the same.
Two clap polish items the review also flagged:
- --observe-fields-report PATH now `requires = "observe_fields"`, so
supplying the report path without the enabling flag fails with a
clear clap error instead of silently producing no report.
- --observe-fields-max-keys is now NonZeroUsize, so 0 is rejected at
parse time (it would have made every observation overflow with no
useful tracking).
Three new unit tests in field_observer cover the coverage helper
(partition correctness, empty-observer-yields-only-missing,
ordering-preserved). Two new cli_eval integration tests cover the
clap rejections. 1491/1491 across 45 suites. Clippy + fmt clean.
Post-refactor sweep over the docs/readmes/changelog that had drifted
from the actual code:
CHANGELOG.md:
- "parking_lot::Mutex" → "std::sync::Mutex" (the move to rsigma-eval
swapped the backing to keep that crate dependency-light)
- "zero-allocation for JSON" → accurate "one String per leaf path,
depth-capped at 64" wording; KvEvent's Cow::Borrowed override
called out explicitly
- "a few hundred KB of String keys" → Arc<str> keys (the
snapshot-cost optimization that already shipped)
- new "Shared join primitive" paragraph for FieldObservation::coverage
- NonZeroUsize validation and clap-requires behavior on the eval
flags surfaced explicitly
- FieldCoverage added to the rsigma-runtime re-export list
Root README:
- Field observability bullet now mentions both runtimes (daemon
HTTP + eval one-shot JSON report) rather than only the daemon
endpoints, matching the shipped feature set
rsigma-eval README:
- FieldObservation::coverage(&RuleFieldSet) row added to the
field_observer table so contributors find the shared join
primitive when reading the crate API
rsigma-runtime README:
- FieldCoverage added to the re-export list
mkdocs build --strict clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in field-observability surface that surfaces two halves of detection coverage:
RSigma owns both rule parsing and event ingestion, so this join is uniquely cheap to deliver. The same surface is now available on both runtimes:
rsigma engine daemon— live, exposed over HTTP. Off by default; zero overhead unless an operator passes--observe-fields.rsigma engine eval— one-shot, emitted as a JSON report at end-of-run. Off by default; same JSON shape as the daemon endpoint so a singlejqquery works against either runtime (ideal for CI gap analysis).Engine daemon: new flags
--observe-fields--observe-fields-max-keys <N>10000Engine daemon: new HTTP endpoints
GET/api/v1/fieldsGET/api/v1/fields/unknownGET/api/v1/fields/missingDELETE/api/v1/fields/observerEach list endpoint supports
?limit=&offset=withtotalandnext_offset. All four return 503 with a hint when the daemon was started without--observe-fields.Engine daemon: new Prometheus metrics
rsigma_fields_observed_total(counter)rsigma_fields_observer_unique_keys(gauge)rsigma_fields_observer_overflow_dropped_total(counter)Refreshed on every
/metricsscrape and after every successful/api/v1/fields/*call. Bridged from the observer's lifetime counters (not the resettable ones) soDELETE /api/v1/fields/observercannot desync the monotonic counters.Engine eval: new flags
--observe-fields--observe-fields-max-keys <N>10000NonZeroUsize; clap rejects 0 at parse time.--observe-fields-report <PATH>requires = observe-fieldsso the typo case fails fast. Defaults to stderr when omitted so detections on stdout stay machine-consumable.Example:
Architecture
To keep
engine evaldecoupled from thedaemonCargo feature,FieldObserverand the rule-field extraction primitives live inrsigma-eval(the dependency-light core crate that every consumer already links):rsigma_eval::field_observer::{FieldObserver, FieldObservation, FieldObservationEntry, FieldCoverage}— the observer and its snapshot type.rsigma_eval::fields::{RuleFieldSet, FieldOrigin, FieldSource}— extracted from the CLI'srsigma rule fieldsso the offline view matches what the engine references at runtime.FieldObservation::coverage(&RuleFieldSet) -> FieldCoverage— the gap/intersection/missing join. Both runtimes consume this so the partition semantics cannot drift.rsigma-evalstays dependency-light: the observer'sMutex<HashMap<Arc<str>, u64>>usesstd::sync::Mutexrather than addingparking_lot. Lock-poisoning is treated as a programmer bug because the locked region only does HashMap operations and saturating arithmetic, neither of which can panic.rsigma-runtimekeepspub use rsigma_eval::{FieldObserver, …}re-exports so existing imports compile unchanged.Test plan
cargo fmt --all -- --checkcleancargo clippy --workspace --all-targets --all-features -- -D warningscleancargo test --workspace --all-features1491/1491 across 45 suitesmkdocs build --strictcleancli_daemon_fields_observer.rscovers 8 daemon scenarios: 503 when disabled, gap signal, broken-coverage signal, full snapshot, DELETE reset, cap overflow, pagination,/metricssurfacecli_evaladds 8 eval scenarios: full report shape, unknown surfaced, missing surfaced,--observe-fields-max-keysoverflow accounting, stderr default when no path, off-by-default silence, claprequiresrejection,NonZeroUsizerejectionFieldObserverunit tests cover cap enforcement, lifetime-vs-resettable counters, deterministic ordering, plain-event no-opFieldObservation::coverageunit tests cover partition correctness, empty-observer-only-missing, and snapshot ordering preservationcli_fieldstests still pass after theFieldCollectormove