Skip to content

feat(daemon): smart correlation state restoration during replay#61

Merged
mostafa merged 5 commits into
mainfrom
feat/smart-state-restore
May 1, 2026
Merged

feat(daemon): smart correlation state restoration during replay#61
mostafa merged 5 commits into
mainfrom
feat/smart-state-restore

Conversation

@mostafa
Copy link
Copy Markdown
Member

@mostafa mostafa commented May 1, 2026

Summary

  • Replace the blanket "skip state on any replay" heuristic with position-aware auto-restore. The daemon now tracks the NATS JetStream high-water mark (stream sequence + published timestamp) in SQLite alongside the correlation snapshot and compares the replay start point against it on restart: forward replay restores state (preserving cross-boundary correlation windows), backward replay skips state (avoiding double-counting).
  • Add --keep-state / --clear-state as explicit operator overrides for cases where the automatic decision is not what you want.
  • Wire --timestamp-fallback wallclock|skip to the CLI so forensic replay over logs without standard timestamp fields can opt into "detections only, no correlation updates."

Behavior matrix

Scenario Old behavior New behavior
Resume (default) Restore Restore (unchanged)
--replay-from-sequence N where N > stored Skip Restore (forward catch-up)
--replay-from-sequence N where N <= stored Skip Skip (unchanged)
--replay-from-time T where T > stored Skip Restore (forward catch-up)
--replay-from-time T where T <= stored Skip Skip (unchanged)
--replay-from-latest Skip Skip (unchanged)
--clear-state Skip Skip (unchanged)
--keep-state (new) N/A Restore (explicit override)
No stored position + any replay Skip Skip (unchanged, safe default)

Implementation

  • AckToken::nats_stream_position() extracts (stream_sequence, published_unix_timestamp) from JetStream message metadata before acking.
  • SourcePosition struct stored as source_sequence / source_timestamp INTEGER columns in SQLite (auto-migrated from old schema via PRAGMA table_info check).
  • StateRestoreMode enum (Auto / ForceClear / ForceKeep) replaces the old clear_correlation_state: bool on DaemonConfig.
  • decide_state_restore() encapsulates the full decision tree, testable in isolation.
  • High-water mark tracked via AtomicU64 / AtomicI64 in the ack task, read by the periodic and shutdown state savers.
  • --timestamp-fallback wired through build_correlation_config to set TimestampFallback::Skip on the CorrelationConfig.

Test plan

  • 5 unit tests for SqliteStateStore (round-trip with/without position, position updates, old schema migration, empty DB)
  • 10 unit tests for decide_state_restore (ForceClear, ForceKeep, Resume, Latest, forward/backward/equal sequence, forward/backward time, no stored position)
  • cargo clippy --workspace --all-targets --all-features -- -D warnings clean
  • cargo clippy --workspace --all-targets -- -D warnings clean (without NATS feature)
  • Existing NATS integration + E2E tests still pass
  • Manual: daemon with --state-db, shut down, restart with --replay-from-sequence above stored position, verify state restored
  • Manual: same with sequence below stored position, verify state cleared

mostafa added 5 commits May 1, 2026 18:19
Replace the blanket "skip state on any replay" heuristic with a
position-aware decision.  The daemon now tracks the NATS JetStream
high-water mark (stream sequence + published timestamp) in SQLite
alongside the correlation snapshot, and on restart compares the
replay start point against the stored position:

- Forward replay (start > stored): restore state, preserving
  cross-boundary correlation windows.
- Backward replay (start <= stored): skip state to avoid
  double-counting.
- --replay-from-latest: always skip (starting fresh).
- --keep-state / --clear-state: explicit operator overrides.

Also wires --timestamp-fallback (wallclock|skip) to the CLI so
forensic replay over logs without standard timestamp fields can
opt into "detections only, no correlation updates."

Replaces the old boolean `clear_correlation_state` config field
with the richer `StateRestoreMode` enum (Auto / ForceClear /
ForceKeep).  The SQLite schema is auto-migrated on open to add
`source_sequence` and `source_timestamp` columns.
15 tests covering:

- SqliteStateStore: round-trip with/without SourcePosition, position
  updates, schema migration from old format, empty database.
- decide_state_restore: ForceClear skips, ForceKeep restores, Auto
  with Resume/Latest/forward sequence/backward sequence/equal
  sequence/forward time/backward time/no stored position.
Cover --clear-state, --keep-state, --timestamp-fallback, schema
migration, and NATS sequence-aware forward/backward replay logic.
On macOS, FSEvents fires multiple events on tempfile creation, filling
the bounded reload channel before the test calls POST /api/v1/reload.
Retry with 500ms backoff so the debounce loop drains the channel first.
Also handle non-2xx responses in http_post instead of panicking.
@mostafa mostafa merged commit 5f73e9f into main May 1, 2026
9 checks passed
@mostafa mostafa deleted the feat/smart-state-restore branch May 1, 2026 16:55
mostafa added a commit that referenced this pull request May 1, 2026
…ADMEs

Both READMEs were missing documentation for the v0.9.0 NATS production
hardening (PR #59) and smart state restoration (PR #61). This adds all
missing CLI flags, usage examples, and explanatory sections for auth/TLS,
DLQ, replay, consumer groups, state restore, and timestamp fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant