Skip to content

feat(daemon): Phase 5 — parallel re-promote + USN replay on warm load + background refresh timer (closes #93, #94, #95, #96)#98

Merged
githubrobbi merged 3 commits into
mainfrom
after-phase-4-fixes
Apr 28, 2026
Merged

feat(daemon): Phase 5 — parallel re-promote + USN replay on warm load + background refresh timer (closes #93, #94, #95, #96)#98
githubrobbi merged 3 commits into
mainfrom
after-phase-4-fixes

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

Closes #93. Closes #94. Closes #95. Closes #96.

Fixes four post-Phase-4 issues identified during the v0.5.80 reference-host validation: a 2.6× perf regression in the Phase 4 re-promote path (#93), two correctness bugs that served stale data on warm cache load (#94), the absence of a background USN refresh timer (#95), and silent-failure modes in the compact-cache loader that hid root causes from production logs (#96).

What's in this PR

Two commits on after-phase-4-fixes:

Commit Issues Files
89848e27d #93 + env-var idle TTLs index/mod.rs, index/tests.rs, cache/policy.rs
0c05791c7 #94, #95, #96 9 files in uffs-core and uffs-daemon

#93 — Parallelize Phase 4 re-promote

Replaced the serial for letter in needs_promote { spawn_blocking; await; swap; } loop in IndexManager::ensure_warm_for_dispatch with a tokio::task::JoinSet fan-out so per-letter body loads run concurrently in the blocking pool. Each closure is std::panic::catch_unwind-wrapped to preserve the letter identifier through panics.

Real-world impact. v0.5.80 reference host: 6-drive Phase 4 re-promote dropped from 15.1 s sequential to ~max(per-drive) parallel — matching the existing 5.7 s cold-boot WARM baseline that was already parallel.

Bonus: added UFFS_HOT_TO_WARM_IDLE_SECS / UFFS_WARM_TO_PARKED_IDLE_SECS / UFFS_PARKED_TO_COLD_IDLE_SECS env-var overrides cached via OnceLock so the 30 min Warm→Parked idle can be compressed for fast iteration:

UFFS_WARM_TO_PARKED_IDLE_SECS=360 \
UFFS_HOT_TO_WARM_IDLE_SECS=60 \
    uffs daemon start --drive C,D,E,F,G,M,S

#94 — USN replay on warm cache load + Phase 4 re-promote

The user's conceptual model:

Tier USN replay needed?
COLD (full MFT scan) No — IS the truth
WARM (cached snapshot) Yes — stale by definition
Phase 4 re-promote Yes — Parked snapshot from boot

Two bugs in v0.5.80 violated this:

  1. compact_loader::try_compact_cache_hit returned the on-disk compact cache directly without USN replay. 5/7 drives at the v0.5.80 reference run served stale data.
  2. DiskBodyLoader::load (Phase 4 re-promote) bypassed TTL entirely with load_compact_cache(letter, u64::MAX, 0, true). A daemon up for a week served week-old indices on every re-promote.

Fix

  • New pub fn load_drive_with_usn_refresh(letter) in uffs-core/src/compact_loader.rs — Windows-conditional helper that goes through MftReader::read_index_cachedapply_usn_updates_to_fresh_index (existing infra), rebuilds compact, and submits a background compact-cache save. Non-Windows stub returns an anyhow error so callers fall back transparently.
  • New RefreshTiming struct (mft / compact / trigram / total ms).
  • compact_loader::load_driveremoved the try_compact_cache_hit fast-path bypass. Cold-boot WARM now always goes through MftIndex (cache + USN replay on Windows; offline-file is a no-op).
  • DiskBodyLoader::load — primary path is now load_drive_with_usn_refresh; falls back to bare load_compact_cache if the helper errors (drive G error 1179, journal unavailable, non-Windows).

Cost. Cold-boot WARM at N=7 drives parallel moves from ~5.7 s to ~7–10 s (one-time per daemon start) per the issue's predicted budget.

#95 — Background USN refresh timer

A daemon up for an hour previously served 1-hour-stale data; up for a week, 1-week-stale. No background USN apply existed (uffs-daemon had zero USN references).

Fix

  • New spawn_usn_refresh_controller task in lib.rs — ticks every UFFS_USN_REFRESH_INTERVAL_SECS (default 300 s = 5 min, env-var overridable). Mirrors spawn_idle_demote_controller's structure.
  • New IndexManager::refresh_usn_for_warm_shards — three-phase read-lock detectparallel JoinSet USN refreshper-letter replace_warm_body Arc-swap. Mirrors the Phase 4 perf: parallelize re-promote in ensure_warm_for_dispatch (15.1s sequential → ~5s parallel) #93 parallelism pattern; catch_unwind-wrapped per closure. Aggregate refreshed/failed/total_ms emitted at info-level on target: "shard.refresh".
  • New ShardRegistry::replace_warm_body(letter, body) — swaps the body of a Warm/Hot shard while preserving stats; returns rebuilt registry; emits reason = "usn-refresh" on target: "shard.transition".
  • New cache::policy::USN_REFRESH_INTERVAL_SECS constant + UFFS_USN_REFRESH_INTERVAL_SECS env-var override cached via OnceLock.

#96 — Structured LoadCacheError

load_compact_cache previously returned Option<DriveCompactIndex> and silently collapsed eight failure modes into a single None. Production logs showed "compact cache miss" with no way to distinguish "file missing on first boot" from "decryption failed (key rotated)" from "stale by epoch" from "runtime tempfile collision".

Fix

  • New pub enum LoadCacheError with variants for each path: Missing, StaleByTtl, StaleVsMft, StaleByEpoch, Io, KeyUnavailable, DecryptFailed, DecompressFailed, ParseError, RuntimeTempfile, Deserialize.
  • load_compact_cache now returns Result<DriveCompactIndex, LoadCacheError>.
  • is_compact_cache_fresh (bool) replaced by check_compact_cache_freshness (Result<(), LoadCacheError>).
  • All callers updated to log the structured error at appropriate levels.

Tests

cargo test -p uffs-daemon -p uffs-core --lib157 passed, 0 failed (was 153 before; net +4 new tests).

Test Coverage
ensure_warm_for_dispatch_promotes_in_parallel #93SlowBodyLoader records peak in-flight; pins peak ≥ 2 and wall ≤ 1.5 × delay
read_env_secs_falls_back_when_env_unset env-var override fallback contract
refresh_usn_for_warm_shards_no_op_when_empty #95 — fast-path early-return
refresh_usn_for_warm_shards_no_op_when_no_warm_or_hot #95 — Parked-only registry skips JoinSet
refresh_usn_for_warm_shards_handles_helper_errors_gracefully #95 — cross-platform graceful failure

Lint gates

Gate Status
cargo test -p uffs-daemon -p uffs-core --lib 157 passed
just lint-prod clean
just lint-tests clean
just check-windows clean
lint-pre-push (rustdoc + doc-tests + smoke + xwin) 167 s ✅

Risk

Medium. The cold-boot path change (#94) adds ~2 s per drive of compact rebuild work, taking N=7-drive parallel cold-boot from ~5.7 s to ~7–10 s. This is a one-time cost per daemon start and is acceptable per the issue spec. The Phase 4 re-promote change is pure improvement (parallel + USN-refreshed). The USN refresh timer (#95) runs entirely in the background with catch_unwind per-drive isolation; per-drive failures don't cascade.

Manual Windows verify checklist

After merge, suggested steps on the v0.5.80 reference host:

  1. Cold-boot WARM regression. daemon kill && daemon start --drive C,D,E,F,G,M,S. Look for target=shard.transition reason=promote lines and the new target=shard.transition "♻️ USN-refreshed body ready" lines (one per drive). Total parallel time should be ≤ 10 s.
  2. Phase 4 re-promote regression. UFFS_WARM_TO_PARKED_IDLE_SECS=360 daemon start ..., wait ~6.5 min, run * query — observe parallel re-promote with USN deltas applied. Total should be < 5 s for N=6 drives (vs 15.1 s sequential pre-Phase 4 perf: parallelize re-promote in ensure_warm_for_dispatch (15.1s sequential → ~5s parallel) #93).
  3. Background refresh tick. UFFS_USN_REFRESH_INTERVAL_SECS=60 daemon start ..., watch for target=shard.refresh "USN refresh tick complete" lines every minute with refreshed=N matching the Warm/Hot drive count.
  4. Stale-cache detection. Touch a file; within one refresh interval the daemon search should return it.

Closes #93.  Refs #94, #95.

* `ensure_warm_for_dispatch` now fans out per-letter body loads via
  `tokio::task::JoinSet` instead of the serial loop.  Each closure
  is wrapped in `std::panic::catch_unwind` so a panicking
  `BodyLoader::load` keeps its letter identifier in the JoinSet
  result; without that, the JoinError arm would lose the letter.
  Per-letter write-lock swap is unchanged (sub-µs pointer swap, the
  cumulative contention at N=7 is < 10 µs).

  Real-world impact: 6-drive Phase 4 re-promote on the v0.5.80
  reference host should drop from 15.1 s sequential to ~max(per-drive)
  parallel, matching the 5.7 s cold-boot WARM baseline.

* `cache::policy` constants (HOT/WARM/PARKED idle TTLs) become
  defaults, with new `UFFS_HOT_TO_WARM_IDLE_SECS` /
  `UFFS_WARM_TO_PARKED_IDLE_SECS` / `UFFS_PARKED_TO_COLD_IDLE_SECS`
  env-var overrides cached via `OnceLock`.  `next_state_for_idle`
  reads via the new getters (`hot_to_warm_idle_secs` etc.).
  Compresses the Phase 4 testing flow's 30 min Warm \u2192 Parked idle
  to a 6 min cycle without shipping a non-default policy.

  Effective override is logged on first read at `shard.policy`
  target so production has no silent policy drift.

Tests:
* New `ensure_warm_for_dispatch_promotes_in_parallel` — uses a
  `SlowBodyLoader` that records peak in-flight; pins peak \u2265 2 and
  wall \u2264 1.5 \u00d7 delay (vs. 3 \u00d7 delay baseline).
* New `read_env_secs_falls_back_when_env_unset` — pins fallback
  contract against a deliberately-unique env name.

Verified: `cargo test -p uffs-daemon --lib` 154 passed, 0 failed.
`just lint-prod`, `just lint-tests`, `just check-windows` all clean.
…r + structured load errors

Closes #94.  Closes #95.  Closes #96.

## #96 \u2014 LoadCacheError taxonomy

`compact_cache::load_compact_cache` previously returned `Option<DriveCompactIndex>`
and silently collapsed eight failure modes into a single `None`.  Production logs
showed "compact cache miss" with no way to distinguish "file missing on
first boot" from "decryption failed (key rotated)" from "stale by epoch".

New `pub enum LoadCacheError` with variants for each path: `Missing`,
`StaleByTtl`, `StaleVsMft`, `StaleByEpoch`, `Io`, `KeyUnavailable`,
`DecryptFailed`, `DecompressFailed`, `ParseError`, `RuntimeTempfile`,
`Deserialize`.  All callers updated to log the structured error.

## #94 \u2014 USN replay on warm cache load

Conceptual model: WARM = cached snapshot \u2192 USN replay REQUIRED.  Two
v0.5.80 bugs violated this:
1. `try_compact_cache_hit` returned compact cache directly without USN
   replay (5/7 drives at the v0.5.80 reference run served stale data).
2. `DiskBodyLoader::load` (Phase 4 re-promote) called
   `load_compact_cache(letter, u64::MAX, 0, true)` \u2014 TTL bypassed.

Fix: new `load_drive_with_usn_refresh(letter)` helper in
`compact_loader.rs` (Windows-conditional; non-Windows stub errors so
callers fall back).  Removed `try_compact_cache_hit` from `load_drive`.
`DiskBodyLoader::load` now uses the helper as primary path with
`load_compact_cache` as fallback for unavailable journals.

## #95 \u2014 background USN refresh timer

New `spawn_usn_refresh_controller` ticks every
`UFFS_USN_REFRESH_INTERVAL_SECS` (default 5 min).  New
`IndexManager::refresh_usn_for_warm_shards` mirrors the #93 three-phase
parallel-JoinSet pattern: detect \u2192 parallel refresh \u2192 per-letter
`replace_warm_body` Arc-swap.  New `ShardRegistry::replace_warm_body`
preserves stats and emits `reason = "usn-refresh"` on `shard.transition`.

## Tests

3 new tests in `index/tests.rs`:
* refresh_usn_no_op_when_empty
* refresh_usn_no_op_when_no_warm_or_hot
* refresh_usn_handles_helper_errors_gracefully (cross-platform)

Verified: `cargo test -p uffs-daemon -p uffs-core --lib` 157 passed,
0 failed.  `just lint-prod`, `just lint-tests`, `just check-windows`
all clean.
@githubrobbi githubrobbi changed the title feat(daemon,core): Phase 5 — parallel re-promote + USN replay on warm load + background refresh timer (closes #93, #94, #95, #96) feat(daemon): Phase 5 — parallel re-promote + USN replay on warm load + background refresh timer (closes #93, #94, #95, #96) Apr 28, 2026
@githubrobbi githubrobbi enabled auto-merge (squash) April 28, 2026 17:02
@githubrobbi githubrobbi merged commit f79a94c into main Apr 28, 2026
22 checks passed
@githubrobbi githubrobbi deleted the after-phase-4-fixes branch April 28, 2026 17:17
githubrobbi added a commit that referenced this pull request Apr 28, 2026
Drop the in-source defaults so dev iteration can exercise the
Phase 4 re-promote path (#93 parallelised) without waiting 30 min
or remembering to set `UFFS_*_IDLE_SECS` env vars on every
launch.

| Constant                       | Before    | After  |
|--------------------------------|----------:|-------:|
| `HOT_TO_WARM_IDLE_SECS`        | 300 s     |   60 s |
| `WARM_TO_PARKED_IDLE_SECS`     | 1800 s    |  300 s |
| `PARKED_TO_COLD_IDLE_SECS`     | 86 400 s  |  86 400 s (unchanged) |

Total Hot \u2192 Parked test cycle drops from 35 min to ~6 min, which
is the iteration time the Phase 5 (Windows pressure) work needs.

The bloom + path-trie pre-check from Phase 4 Commit F means a
Parked re-promote only re-decrypts the compact body when the
query actually hits the drive, so the extra demote/promote churn
at this cadence is bounded by query pattern, not by the timer
alone.

The compile-time invariant
`HOT_TO_WARM_IDLE_SECS < WARM_TO_PARKED_IDLE_SECS < PARKED_TO_COLD_IDLE_SECS`
still holds (60 < 300 < 86 400).

Production users who want longer retention can still set the
`UFFS_HOT_TO_WARM_IDLE_SECS` / `UFFS_WARM_TO_PARKED_IDLE_SECS` /
`UFFS_PARKED_TO_COLD_IDLE_SECS` env vars; the override pathway
(introduced in PR #98 alongside #93) is unchanged.

Verified: `cargo test -p uffs-daemon --lib` 157 passed.
`just lint-prod`, `just lint-tests`, `just check-windows` clean.
githubrobbi added a commit that referenced this pull request May 15, 2026
#255)

Release pipeline #98 (v0.5.99) burned ~32 min building binaries on all three platforms, then died in <1 s at the 'Create GitHub Release' step with the cryptic:

    HTTP 403 — Resource not accessible by integration

Root cause was a repo-level setting flip: 'Settings → Actions → General → Workflow permissions' had been switched from 'Read and write' to 'Read-only', which clamps every job's GITHUB_TOKEN to read scope at runtime regardless of what the workflow file declares.  v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem.  The audit log is not retrievable (free-tier org), so the actor/timestamp of the flip is unrecoverable.

This commit prevents the next ~30 min of silent build time:

  1. Pre-flight permissions probe in 'release-preparation'.  Creates a draft release with a throwaway tag (run_id+attempt) and immediately deletes it on success.  Exercises the EXACT REST path that softprops/action-gh-release uses 30+ min later, so a permissions clamp surfaces in ~1 s with a precise error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note).  Draft releases do not create git tags, so failed cleanup leaks at worst an invisible draft row — never an orphan tag.

  2. Explicit per-job 'permissions:' block on 'create-github-release' pinning 'contents: write', 'id-token: write', 'attestations: write'.  Documents the scope needs at the point of use rather than relying on inheritance from the top-level block ~520 lines up.  Does NOT change runtime behaviour by itself — the repo-level clamp still wins — but pairs with the pre-flight to make the failure mode self-documenting from the YAML alone.

Why a synthetic-release probe rather than a direct '/repos/.../actions/permissions/workflow' API call: that endpoint requires the 'administration: read' scope, which neither this workflow nor the default GITHUB_TOKEN carries.  Widening to add it would expand the permission surface; the create-draft probe stays inside 'contents: write' which the job already declares.

Local validation: actionlint clean on both touched workflow files; lint-fast + lint-pre-push will gate the push.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment