Skip to content

feat: field observability for engine daemon and engine eval (--observe-fields)#149

Merged
mostafa merged 15 commits into
mainfrom
feat/field-observer
May 25, 2026
Merged

feat: field observability for engine daemon and engine eval (--observe-fields)#149
mostafa merged 15 commits into
mainfrom
feat/field-observer

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 25, 2026

Summary

Adds an opt-in field-observability surface that surfaces two halves of detection coverage:

  • Gap signal: event fields no loaded rule references.
  • Broken-coverage signal: rule fields that have never appeared in an event.

RSigma owns both rule parsing and event ingestion, so this join is uniquely cheap to deliver. The same surface is now available on both runtimes:

  • rsigma engine daemon — live, exposed over HTTP. Off by default; zero overhead unless an operator passes --observe-fields.
  • rsigma engine eval — one-shot, emitted as a JSON report at end-of-run. Off by default; same JSON shape as the daemon endpoint so a single jq query works against either runtime (ideal for CI gap analysis).

Engine daemon: new flags

Flag Default Purpose
--observe-fields off Enable the observer.
--observe-fields-max-keys <N> 10000 Hard ceiling on tracked field names. Overflow drops are counted.

Engine daemon: new HTTP endpoints

Method Path Description
GET /api/v1/fields One-shot summary + unknown + missing.
GET /api/v1/fields/unknown Event fields no rule references.
GET /api/v1/fields/missing Rule fields never observed in events.
DELETE /api/v1/fields/observer Reset counters.

Each list endpoint supports ?limit=&offset= with total and next_offset. All four return 503 with a hint when the daemon was started without --observe-fields.

Engine daemon: new Prometheus metrics

  • rsigma_fields_observed_total (counter)
  • rsigma_fields_observer_unique_keys (gauge)
  • rsigma_fields_observer_overflow_dropped_total (counter)

Refreshed on every /metrics scrape and after every successful /api/v1/fields/* call. Bridged from the observer's lifetime counters (not the resettable ones) so DELETE /api/v1/fields/observer cannot desync the monotonic counters.

Engine eval: new flags

Flag Default Purpose
--observe-fields off Enable observation and emit a JSON report at end-of-run.
--observe-fields-max-keys <N> 10000 NonZeroUsize; clap rejects 0 at parse time.
--observe-fields-report <PATH> unset (stderr) Where to write the report. requires = observe-fields so the typo case fails fast. Defaults to stderr when omitted so detections on stdout stay machine-consumable.

Example:

# CI: NDJSON detections to stdout, logs to stderr, report in its own file
rsigma engine eval -r rules/ -e @events.ndjson \
    --observe-fields \
    --observe-fields-report coverage.json

jq '.summary | {events_observed, unknown_count, missing_count}' coverage.json

Architecture

To keep engine eval decoupled from the daemon Cargo feature, FieldObserver and the rule-field extraction primitives live in rsigma-eval (the dependency-light core crate that every consumer already links):

  • rsigma_eval::field_observer::{FieldObserver, FieldObservation, FieldObservationEntry, FieldCoverage} — the observer and its snapshot type.
  • rsigma_eval::fields::{RuleFieldSet, FieldOrigin, FieldSource} — extracted from the CLI's rsigma rule fields so the offline view matches what the engine references at runtime.
  • FieldObservation::coverage(&RuleFieldSet) -> FieldCoverage — the gap/intersection/missing join. Both runtimes consume this so the partition semantics cannot drift.

rsigma-eval stays dependency-light: the observer's Mutex<HashMap<Arc<str>, u64>> uses std::sync::Mutex rather than adding parking_lot. Lock-poisoning is treated as a programmer bug because the locked region only does HashMap operations and saturating arithmetic, neither of which can panic.

rsigma-runtime keeps pub use rsigma_eval::{FieldObserver, …} re-exports so existing imports compile unchanged.

Test plan

  • cargo fmt --all -- --check clean
  • cargo clippy --workspace --all-targets --all-features -- -D warnings clean
  • cargo test --workspace --all-features 1491/1491 across 45 suites
  • mkdocs build --strict clean
  • cli_daemon_fields_observer.rs covers 8 daemon scenarios: 503 when disabled, gap signal, broken-coverage signal, full snapshot, DELETE reset, cap overflow, pagination, /metrics surface
  • cli_eval adds 8 eval scenarios: full report shape, unknown surfaced, missing surfaced, --observe-fields-max-keys overflow accounting, stderr default when no path, off-by-default silence, clap requires rejection, NonZeroUsize rejection
  • FieldObserver unit tests cover cap enforcement, lifetime-vs-resettable counters, deterministic ordering, plain-event no-op
  • FieldObservation::coverage unit tests cover partition correctness, empty-observer-only-missing, and snapshot ordering preservation
  • Existing 16 cli_fields tests still pass after the FieldCollector move

mostafa added 10 commits May 25, 2026 21:50
Move FieldCollector out of the rsigma rule fields CLI subcommand into a
new rsigma_eval::fields module with a public RuleFieldSet API. Both the
CLI command and the upcoming daemon field-observability endpoints
will read from the same
extractor so the offline view matches what the engine references at
runtime. No behaviour change for rsigma rule fields.
Refresh an Arc<RuleFieldSet> on every successful load_rules() and expose
it on both RuntimeEngine and LogProcessor. The daemon's upcoming
/api/v1/fields/* endpoints will snapshot the set without taking
the engine lock; ArcSwap keeps readers wait-free across hot reloads.
Add a non-breaking field_keys method to the Event trait, defaulted to
walking to_json() so external Event implementations keep compiling.
JsonEvent overrides with a zero-copy recursive walk that yields
dot-joined paths (depth-capped at MAX_NESTING_DEPTH) and emits
intermediate object names alongside their leaves so callers can
inspect coverage at any nesting level. KvEvent and MapEvent forward
their flat keys; PlainEvent returns an empty Vec because the synthetic
_raw envelope is not a real field.

Powers the upcoming FieldObserver without changing the detection
hot path: only the opt-in daemon mode iterates field_keys.
Capped, lock-light observer that tallies per-field counts on the
daemon's hot path. Wraps a parking_lot::Mutex<HashMap<String, u64>>;
the mutex is held only long enough to bump a counter or insert a new
key. Once max_keys is reached new fields are dropped and the
overflow_dropped counter advances, while existing counters keep
updating so high-frequency fields stay visible on a saturated
observer.

Exposes snapshot() (sorted by descending count) and reset(); both will
be called by the upcoming /api/v1/fields/* HTTP endpoints.
…task

Add --observe-fields (off by default) and --observe-fields-max-keys
(default 10000) to rsigma engine daemon. When enabled, the daemon
constructs an Arc<FieldObserver>, attaches it to the LogProcessor via
set_field_observer, and stashes a handle on AppState for the upcoming
/api/v1/fields/* routes. LogProcessor.process_batch_with_format
checks the ArcSwap once per batch and skips iteration entirely when the
observer is None, so the hot path stays untouched in the default
configuration.

Three new Prometheus surfaces refreshed on every /metrics scrape:
rsigma_fields_observed_total, rsigma_fields_observer_unique_keys, and
rsigma_fields_observer_overflow_dropped_total. EventInputDecoded
forwards field_keys to its inner variants so JsonEvent's zero-copy
walk is preserved through the runtime adapter.
… reset

Add four field-observability HTTP endpoints to the daemon:

  GET    /api/v1/fields           snapshot with summary + unknown + missing
  GET    /api/v1/fields/unknown   event keys not referenced by any rule
  GET    /api/v1/fields/missing   rule keys never observed in events
  DELETE /api/v1/fields/observer  clear counters and report what was cleared

Each list endpoint takes ?limit (default 100, cap 1000) and ?offset for
deterministic pagination; the response includes total and next_offset.
Missing entries surface up to 10 rule titles with a truncated flag so a
field touched by hundreds of rules doesn't blow up the payload.

All four return 503 with a clear hint when the daemon was started
without --observe-fields. The /metrics scrape and every successful
endpoint call call into update_field_observer_metrics so the Prometheus
gauges stay in sync with snapshots regardless of which surface the
operator polls.
Add cli_daemon_fields_observer.rs covering eight scenarios on the field-observability surface:

  - 503 returns when --observe-fields is not set
  - /unknown surfaces event fields no rule references
  - /missing surfaces rule fields never observed in events
  - / (full) reports summary, unknown, missing, and intersection
  - DELETE clears counters and reports previous_keys + previous_events
  - --observe-fields-max-keys 2 saturates and records overflow_dropped
  - /unknown?limit=&offset= paginates with next_offset
  - /metrics exposes the three new rsigma_fields_observer_* counters

Refactor http_get and add http_delete in tests/common to return the
status + body on 4xx/5xx instead of panicking (using ureq's
http_status_as_error(false)), so the 503 path can be asserted by its
JSON error body. Add DaemonProcess::spawn_http_with_args so tests can
opt into --observe-fields without copying the scaffolding flags.
CHANGELOG: new top-of-Unreleased section bundling the two flags, the
four HTTP endpoints, the three Prometheus surfaces, and the design
notes (default-off, ArcSwap load skip when disabled,
parking_lot::Mutex<HashMap> backing).

docs/cli/engine/daemon.md: new "Field observability (advanced)"
subsection with both flags and cross-references.

docs/reference/http-api.md: four rows added to the endpoint summary
table and a full "Field observability" section with snapshot, gap,
broken-coverage, and reset payload examples.

docs/reference/metrics.md: new "Field observability (3 metrics)"
section and updated catalogue count from 27 to 30.

docs/guide/observability.md: new "Detection coverage with
--observe-fields" section walking the operator through the
gap/broken-coverage workflow; metrics-catalog cross-reference updated
to 30.

mkdocs build --strict clean; cargo clippy + fmt + test gates all
green (1478 tests across 45 suites).
Root README gains a "Field observability" feature bullet alongside the
existing TLS and streaming items.

rsigma-cli README: two new daemon flag rows (--observe-fields and
--observe-fields-max-keys) sit next to the bloom flags they share the
"advanced, opt-in" framing with; four new endpoint rows
(/api/v1/fields, /unknown, /missing, /observer) added to the daemon
HTTP endpoint table with the gap/broken-coverage descriptions and the
"requires --observe-fields" note.

rsigma-runtime README: FieldObserver added to the Features list with
the LogProcessor::set_field_observer wiring path called out.

rsigma-eval README: new "Rule field extraction (fields module)"
subsection under Public API listing RuleFieldSet, FieldOrigin, and
FieldSource; Event Model section gains an Event::field_keys bullet
covering the default impl and JsonEvent override.
The CHANGELOG follows the existing convention of pointing at the
GitHub PR/issue that ships the change (e.g. TLS PR #146, detached
sources PR #140). Roadmap item numbers stay in the plan files; the
release-facing surface uses the PR ID so readers can click through to
the diff and review thread.
mostafa added a commit that referenced this pull request May 25, 2026
The CHANGELOG follows the existing convention of pointing at the
GitHub PR/issue that ships the change (e.g. TLS PR #146, detached
sources PR #140). Roadmap item numbers stay in the plan files; the
release-facing surface uses the PR ID so readers can click through to
the diff and review thread.
@mostafa mostafa force-pushed the feat/field-observer branch from 70ca2ad to b691e53 Compare May 25, 2026 20:34
mostafa added 4 commits May 25, 2026 22:46
Pre-fix, update_field_observer_metrics computed the Prometheus counter
delta from snapshot.events_observed minus the prom counter's current
value. After DELETE /api/v1/fields/observer, snapshot.events_observed
resets to 0 while the Prometheus counter sits at its previous lifetime
value, so every subsequent observation was silently dropped from the
counter until the post-reset count climbed past the old lifetime value.
Same problem on overflow_dropped.

Fix by tracking two pairs of atomics on FieldObserver:

  - events_observed / overflow_dropped: reset to 0 on reset(); drive
    the "since-last-reset" view exposed by the HTTP API.
  - lifetime_events_observed / lifetime_overflow_dropped: monotonic,
    never reset; drive the Prometheus counters via the bridge.

FieldObservation grows two fields for the lifetime totals. The metrics
bridge in rsigma-cli now reads from those. Add a
lifetime_counters_survive_reset regression test that asserts a reset
between observations does not desync the counters.

While here, three small cleanups surfaced by the same review pass:

  - Switch the field-observer hot-path read in
    LogProcessor::process_batch_with_format from
    ArcSwap::load_full (clones an Arc on every batch) to
    ArcSwap::load (hazard-pointer Guard, no allocation). Saves one
    Arc allocation per batch in the default --observe-fields=off
    configuration.
  - Collapse the two-pass walk over snapshot.entries in the
    /api/v1/fields handler into a single pass that partitions
    unknown vs intersection while building the `seen` set.
  - Fix the docstrings that claimed "zero-allocation" / "zero-copy"
    for the field_keys walk; in practice the dot-joined leaf paths
    are not substrings of the source value and require one String
    allocation per leaf. KvEvent's override returns Cow::Borrowed
    (truly zero-copy); JsonEvent's allocates per leaf.

Tests: 1479/1479 across 45 suites, including the new regression test
and the existing 8-test cli_daemon_fields_observer integration suite.
Clippy + fmt clean.
Address every item from the second-pass review:

1. paginate move semantics. Take Vec<T> by value and drain the page out
   instead of cloning the slice. Saves up to `total - limit` clones of
   serde_json::Value per /api/v1/fields/* call (worst case ~900 clones
   per request at the default limit of 100 against a saturated 1000-
   element list). The total is preserved by returning it alongside the
   page so callers don't need to capture it before the move.

2. JsonEvent::field_keys emits leaves only. Previously a nested object
   contributed both its own path (`actor`) and its leaves
   (`actor.id`), which caused operators to see `actor` in the gap
   signal even when `actor.id` was rule-referenced. Sigma rules
   nearly always reference leaves via dot-notation, so emitting the
   intermediate was a net source of false positives. The default
   Event::field_keys impl (which walks to_json()) is updated in
   lockstep so the two paths stay consistent. Tests rewritten to
   match.

3. FieldObserver keys are Arc<str> instead of String. Snapshotting
   10 000 keys now costs 10 000 atomic refcount increments plus one
   Vec allocation, not 10 000 String clones. FieldObservationEntry.field
   changed from String to Arc<str> (technically a public API change,
   but the only consumer is the daemon HTTP handler which reads via
   &*field). The trade-off is one extra allocation per first-time
   insertion (Arc header) in exchange for cheap repeated snapshots,
   which is the right side of the trade because /metrics is scraped
   every 15-30 s while distinct field insertion happens at most once
   per observation window per unique key.

4. Default Event::field_keys impl cleaned up. The helper now threads
   Vec<String> internally instead of Vec<Cow<'static, str>>, then wraps
   into Cow::Owned at the end. Same semantics, far less cognitive
   overhead at the trait boundary.

Tests: 1480/1480 across 45 suites (one new regression test from this
commit covering deeply-nested leaves-only behaviour). Clippy + fmt
clean.
The field-coverage surface that ships behind --observe-fields on the
daemon is now available on the one-shot evaluator too. That uncovered
accidental coupling: FieldObserver lived in rsigma-runtime (daemon
feature) but only needs the Event trait (rsigma-eval). Move it.

Relocation:

  - FieldObserver, FieldObservation, FieldObservationEntry move from
    rsigma-runtime/src/field_observer.rs into
    rsigma-eval/src/field_observer.rs (no API changes).
  - Mutex backing switched from parking_lot::Mutex to std::sync::Mutex
    to keep rsigma-eval dependency-light per the workspace constraint.
    Lock-poisoning is treated as a programmer bug (the locked region
    only does HashMap operations and saturating arithmetic, neither of
    which can panic), so the lock() calls .expect(...) consistently.
  - rsigma-runtime re-exports the three types via `pub use rsigma_eval::{...}`
    so existing consumers (LogProcessor::set_field_observer, the
    daemon HTTP handlers, downstream library users) keep compiling
    against rsigma_runtime::FieldObserver unchanged.

engine eval mirror (no longer behind the daemon feature):

  --observe-fields                (off; enables observation)
  --observe-fields-max-keys <N>   (default 10000; cap)
  --observe-fields-report <PATH>  (writes JSON; defaults to stderr when
                                   omitted so stdout NDJSON detections
                                   stay machine-consumable)

The report has the same shape as GET /api/v1/fields so the same jq
queries work against either runtime. observe_event(ctx, &event) is
threaded through cmd_eval_with_correlations, cmd_eval_detection_only,
eval_stream_corr/detect, eval_line_corr/corr_json/detect/detect_json,
and eval_evtx_corr/detect; it inlines to a single null-check when the
context is None, so the disabled path costs essentially nothing.

Docs: new "Field observability (offline coverage report)" section in
docs/cli/engine/eval.md; an eval-side walkthrough added to
docs/guide/observability.md alongside the daemon one; CHANGELOG note
covering both the offline report and the relocation/re-export;
rsigma-cli README gets three new flag rows; rsigma-eval README gains a
field_observer module subsection; rsigma-runtime README rewrites its
FieldObserver bullet to point at the new home via the re-export.

Tests: 6 new cli_eval integration tests (full report shape, unknown
fields surfaced, missing fields surfaced, --observe-fields-max-keys
overflow accounting, stderr default when no path, off-by-default
silence). 1486/1486 across 45 suites. Clippy + fmt + mkdocs --strict
clean.
Post-commit review surfaced ~40 lines of duplicated coverage-join
logic between the daemon's GET /api/v1/fields* handlers
(crates/rsigma-cli/src/daemon/server.rs) and the engine eval
render_field_report (crates/rsigma-cli/src/commands/eval.rs): same
single-pass partition over snapshot.entries into unknown vs
intersection, same missing-fields computation against the rule field
set, same `seen` HashSet. The two surfaces could drift on field
semantics with no compile-time check.

Centralize via FieldObservation::coverage(&RuleFieldSet) -> FieldCoverage
in rsigma-eval. Returns borrowed views (Vec<&FieldObservationEntry> for
unknown, Vec<(&str, &FieldOrigin)> for missing) so consumers still own
JSON rendering and pagination but the join itself lives in one place.
Re-exported from rsigma-runtime alongside the other field-observer
types.

Daemon handlers (fields_full, fields_unknown, fields_missing) now call
.coverage(); the now-unused missing_fields free function is removed.
eval's render_field_report does the same.

Two clap polish items the review also flagged:

  - --observe-fields-report PATH now `requires = "observe_fields"`, so
    supplying the report path without the enabling flag fails with a
    clear clap error instead of silently producing no report.
  - --observe-fields-max-keys is now NonZeroUsize, so 0 is rejected at
    parse time (it would have made every observation overflow with no
    useful tracking).

Three new unit tests in field_observer cover the coverage helper
(partition correctness, empty-observer-yields-only-missing,
ordering-preserved). Two new cli_eval integration tests cover the
clap rejections. 1491/1491 across 45 suites. Clippy + fmt clean.
@mostafa mostafa changed the title feat(daemon): unknown-field discovery API (--observe-fields) feat: field observability for engine daemon and engine eval (--observe-fields) May 25, 2026
Post-refactor sweep over the docs/readmes/changelog that had drifted
from the actual code:

CHANGELOG.md:
  - "parking_lot::Mutex" → "std::sync::Mutex" (the move to rsigma-eval
    swapped the backing to keep that crate dependency-light)
  - "zero-allocation for JSON" → accurate "one String per leaf path,
    depth-capped at 64" wording; KvEvent's Cow::Borrowed override
    called out explicitly
  - "a few hundred KB of String keys" → Arc<str> keys (the
    snapshot-cost optimization that already shipped)
  - new "Shared join primitive" paragraph for FieldObservation::coverage
  - NonZeroUsize validation and clap-requires behavior on the eval
    flags surfaced explicitly
  - FieldCoverage added to the rsigma-runtime re-export list

Root README:
  - Field observability bullet now mentions both runtimes (daemon
    HTTP + eval one-shot JSON report) rather than only the daemon
    endpoints, matching the shipped feature set

rsigma-eval README:
  - FieldObservation::coverage(&RuleFieldSet) row added to the
    field_observer table so contributors find the shared join
    primitive when reading the crate API

rsigma-runtime README:
  - FieldCoverage added to the re-export list

mkdocs build --strict clean.
@mostafa mostafa merged commit 07e78a9 into main May 25, 2026
12 checks passed
@mostafa mostafa deleted the feat/field-observer branch May 25, 2026 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant