feat(daemon): Phase 5 — parallel re-promote + USN replay on warm load + background refresh timer (closes #93, #94, #95, #96)#98
Merged
Conversation
Closes #93. Refs #94, #95. * `ensure_warm_for_dispatch` now fans out per-letter body loads via `tokio::task::JoinSet` instead of the serial loop. Each closure is wrapped in `std::panic::catch_unwind` so a panicking `BodyLoader::load` keeps its letter identifier in the JoinSet result; without that, the JoinError arm would lose the letter. Per-letter write-lock swap is unchanged (sub-µs pointer swap, the cumulative contention at N=7 is < 10 µs). Real-world impact: 6-drive Phase 4 re-promote on the v0.5.80 reference host should drop from 15.1 s sequential to ~max(per-drive) parallel, matching the 5.7 s cold-boot WARM baseline. * `cache::policy` constants (HOT/WARM/PARKED idle TTLs) become defaults, with new `UFFS_HOT_TO_WARM_IDLE_SECS` / `UFFS_WARM_TO_PARKED_IDLE_SECS` / `UFFS_PARKED_TO_COLD_IDLE_SECS` env-var overrides cached via `OnceLock`. `next_state_for_idle` reads via the new getters (`hot_to_warm_idle_secs` etc.). Compresses the Phase 4 testing flow's 30 min Warm \u2192 Parked idle to a 6 min cycle without shipping a non-default policy. Effective override is logged on first read at `shard.policy` target so production has no silent policy drift. Tests: * New `ensure_warm_for_dispatch_promotes_in_parallel` — uses a `SlowBodyLoader` that records peak in-flight; pins peak \u2265 2 and wall \u2264 1.5 \u00d7 delay (vs. 3 \u00d7 delay baseline). * New `read_env_secs_falls_back_when_env_unset` — pins fallback contract against a deliberately-unique env name. Verified: `cargo test -p uffs-daemon --lib` 154 passed, 0 failed. `just lint-prod`, `just lint-tests`, `just check-windows` all clean.
…r + structured load errors Closes #94. Closes #95. Closes #96. ## #96 \u2014 LoadCacheError taxonomy `compact_cache::load_compact_cache` previously returned `Option<DriveCompactIndex>` and silently collapsed eight failure modes into a single `None`. Production logs showed "compact cache miss" with no way to distinguish "file missing on first boot" from "decryption failed (key rotated)" from "stale by epoch". New `pub enum LoadCacheError` with variants for each path: `Missing`, `StaleByTtl`, `StaleVsMft`, `StaleByEpoch`, `Io`, `KeyUnavailable`, `DecryptFailed`, `DecompressFailed`, `ParseError`, `RuntimeTempfile`, `Deserialize`. All callers updated to log the structured error. ## #94 \u2014 USN replay on warm cache load Conceptual model: WARM = cached snapshot \u2192 USN replay REQUIRED. Two v0.5.80 bugs violated this: 1. `try_compact_cache_hit` returned compact cache directly without USN replay (5/7 drives at the v0.5.80 reference run served stale data). 2. `DiskBodyLoader::load` (Phase 4 re-promote) called `load_compact_cache(letter, u64::MAX, 0, true)` \u2014 TTL bypassed. Fix: new `load_drive_with_usn_refresh(letter)` helper in `compact_loader.rs` (Windows-conditional; non-Windows stub errors so callers fall back). Removed `try_compact_cache_hit` from `load_drive`. `DiskBodyLoader::load` now uses the helper as primary path with `load_compact_cache` as fallback for unavailable journals. ## #95 \u2014 background USN refresh timer New `spawn_usn_refresh_controller` ticks every `UFFS_USN_REFRESH_INTERVAL_SECS` (default 5 min). New `IndexManager::refresh_usn_for_warm_shards` mirrors the #93 three-phase parallel-JoinSet pattern: detect \u2192 parallel refresh \u2192 per-letter `replace_warm_body` Arc-swap. New `ShardRegistry::replace_warm_body` preserves stats and emits `reason = "usn-refresh"` on `shard.transition`. ## Tests 3 new tests in `index/tests.rs`: * refresh_usn_no_op_when_empty * refresh_usn_no_op_when_no_warm_or_hot * refresh_usn_handles_helper_errors_gracefully (cross-platform) Verified: `cargo test -p uffs-daemon -p uffs-core --lib` 157 passed, 0 failed. `just lint-prod`, `just lint-tests`, `just check-windows` all clean.
githubrobbi
added a commit
that referenced
this pull request
Apr 28, 2026
Drop the in-source defaults so dev iteration can exercise the Phase 4 re-promote path (#93 parallelised) without waiting 30 min or remembering to set `UFFS_*_IDLE_SECS` env vars on every launch. | Constant | Before | After | |--------------------------------|----------:|-------:| | `HOT_TO_WARM_IDLE_SECS` | 300 s | 60 s | | `WARM_TO_PARKED_IDLE_SECS` | 1800 s | 300 s | | `PARKED_TO_COLD_IDLE_SECS` | 86 400 s | 86 400 s (unchanged) | Total Hot \u2192 Parked test cycle drops from 35 min to ~6 min, which is the iteration time the Phase 5 (Windows pressure) work needs. The bloom + path-trie pre-check from Phase 4 Commit F means a Parked re-promote only re-decrypts the compact body when the query actually hits the drive, so the extra demote/promote churn at this cadence is bounded by query pattern, not by the timer alone. The compile-time invariant `HOT_TO_WARM_IDLE_SECS < WARM_TO_PARKED_IDLE_SECS < PARKED_TO_COLD_IDLE_SECS` still holds (60 < 300 < 86 400). Production users who want longer retention can still set the `UFFS_HOT_TO_WARM_IDLE_SECS` / `UFFS_WARM_TO_PARKED_IDLE_SECS` / `UFFS_PARKED_TO_COLD_IDLE_SECS` env vars; the override pathway (introduced in PR #98 alongside #93) is unchanged. Verified: `cargo test -p uffs-daemon --lib` 157 passed. `just lint-prod`, `just lint-tests`, `just check-windows` clean.
githubrobbi
added a commit
that referenced
this pull request
May 15, 2026
#255) Release pipeline #98 (v0.5.99) burned ~32 min building binaries on all three platforms, then died in <1 s at the 'Create GitHub Release' step with the cryptic: HTTP 403 — Resource not accessible by integration Root cause was a repo-level setting flip: 'Settings → Actions → General → Workflow permissions' had been switched from 'Read and write' to 'Read-only', which clamps every job's GITHUB_TOKEN to read scope at runtime regardless of what the workflow file declares. v0.5.96 (the previous successful release ~16 h earlier) ran with the same workflow file and succeeded, confirming the file was never the problem. The audit log is not retrievable (free-tier org), so the actor/timestamp of the flip is unrecoverable. This commit prevents the next ~30 min of silent build time: 1. Pre-flight permissions probe in 'release-preparation'. Creates a draft release with a throwaway tag (run_id+attempt) and immediately deletes it on success. Exercises the EXACT REST path that softprops/action-gh-release uses 30+ min later, so a permissions clamp surfaces in ~1 s with a precise error message pointing at the toggle that needs flipping (repo level, with an org-level fallback note). Draft releases do not create git tags, so failed cleanup leaks at worst an invisible draft row — never an orphan tag. 2. Explicit per-job 'permissions:' block on 'create-github-release' pinning 'contents: write', 'id-token: write', 'attestations: write'. Documents the scope needs at the point of use rather than relying on inheritance from the top-level block ~520 lines up. Does NOT change runtime behaviour by itself — the repo-level clamp still wins — but pairs with the pre-flight to make the failure mode self-documenting from the YAML alone. Why a synthetic-release probe rather than a direct '/repos/.../actions/permissions/workflow' API call: that endpoint requires the 'administration: read' scope, which neither this workflow nor the default GITHUB_TOKEN carries. Widening to add it would expand the permission surface; the create-draft probe stays inside 'contents: write' which the job already declares. Local validation: actionlint clean on both touched workflow files; lint-fast + lint-pre-push will gate the push.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #93. Closes #94. Closes #95. Closes #96.
Fixes four post-Phase-4 issues identified during the v0.5.80 reference-host validation: a 2.6× perf regression in the Phase 4 re-promote path (#93), two correctness bugs that served stale data on warm cache load (#94), the absence of a background USN refresh timer (#95), and silent-failure modes in the compact-cache loader that hid root causes from production logs (#96).
What's in this PR
Two commits on
after-phase-4-fixes:89848e27dindex/mod.rs,index/tests.rs,cache/policy.rs0c05791c7uffs-coreanduffs-daemon#93 — Parallelize Phase 4 re-promote
Replaced the serial
for letter in needs_promote { spawn_blocking; await; swap; }loop inIndexManager::ensure_warm_for_dispatchwith atokio::task::JoinSetfan-out so per-letter body loads run concurrently in the blocking pool. Each closure isstd::panic::catch_unwind-wrapped to preserve the letter identifier through panics.Real-world impact. v0.5.80 reference host: 6-drive Phase 4 re-promote dropped from 15.1 s sequential to ~max(per-drive) parallel — matching the existing 5.7 s cold-boot WARM baseline that was already parallel.
Bonus: added
UFFS_HOT_TO_WARM_IDLE_SECS/UFFS_WARM_TO_PARKED_IDLE_SECS/UFFS_PARKED_TO_COLD_IDLE_SECSenv-var overrides cached viaOnceLockso the 30 min Warm→Parked idle can be compressed for fast iteration:UFFS_WARM_TO_PARKED_IDLE_SECS=360 \ UFFS_HOT_TO_WARM_IDLE_SECS=60 \ uffs daemon start --drive C,D,E,F,G,M,S#94 — USN replay on warm cache load + Phase 4 re-promote
The user's conceptual model:
Two bugs in v0.5.80 violated this:
compact_loader::try_compact_cache_hitreturned the on-disk compact cache directly without USN replay. 5/7 drives at the v0.5.80 reference run served stale data.DiskBodyLoader::load(Phase 4 re-promote) bypassed TTL entirely withload_compact_cache(letter, u64::MAX, 0, true). A daemon up for a week served week-old indices on every re-promote.Fix
pub fn load_drive_with_usn_refresh(letter)inuffs-core/src/compact_loader.rs— Windows-conditional helper that goes throughMftReader::read_index_cached→apply_usn_updates_to_fresh_index(existing infra), rebuilds compact, and submits a background compact-cache save. Non-Windows stub returns ananyhowerror so callers fall back transparently.RefreshTimingstruct (mft/compact/trigram/totalms).compact_loader::load_drive— removed thetry_compact_cache_hitfast-path bypass. Cold-boot WARM now always goes through MftIndex (cache + USN replay on Windows; offline-file is a no-op).DiskBodyLoader::load— primary path is nowload_drive_with_usn_refresh; falls back to bareload_compact_cacheif the helper errors (drive Gerror 1179, journal unavailable, non-Windows).Cost. Cold-boot WARM at N=7 drives parallel moves from ~5.7 s to ~7–10 s (one-time per daemon start) per the issue's predicted budget.
#95 — Background USN refresh timer
A daemon up for an hour previously served 1-hour-stale data; up for a week, 1-week-stale. No background USN apply existed (
uffs-daemonhad zero USN references).Fix
spawn_usn_refresh_controllertask inlib.rs— ticks everyUFFS_USN_REFRESH_INTERVAL_SECS(default 300 s = 5 min, env-var overridable). Mirrorsspawn_idle_demote_controller's structure.IndexManager::refresh_usn_for_warm_shards— three-phase read-lock detect → parallelJoinSetUSN refresh → per-letterreplace_warm_bodyArc-swap. Mirrors the Phase 4 perf: parallelize re-promote in ensure_warm_for_dispatch (15.1s sequential → ~5s parallel) #93 parallelism pattern;catch_unwind-wrapped per closure. Aggregaterefreshed/failed/total_msemitted at info-level ontarget: "shard.refresh".ShardRegistry::replace_warm_body(letter, body)— swaps the body of a Warm/Hot shard while preserving stats; returns rebuilt registry; emitsreason = "usn-refresh"ontarget: "shard.transition".cache::policy::USN_REFRESH_INTERVAL_SECSconstant +UFFS_USN_REFRESH_INTERVAL_SECSenv-var override cached viaOnceLock.#96 — Structured
LoadCacheErrorload_compact_cachepreviously returnedOption<DriveCompactIndex>and silently collapsed eight failure modes into a singleNone. Production logs showed "compact cache miss" with no way to distinguish "file missing on first boot" from "decryption failed (key rotated)" from "stale by epoch" from "runtime tempfile collision".Fix
pub enum LoadCacheErrorwith variants for each path:Missing,StaleByTtl,StaleVsMft,StaleByEpoch,Io,KeyUnavailable,DecryptFailed,DecompressFailed,ParseError,RuntimeTempfile,Deserialize.load_compact_cachenow returnsResult<DriveCompactIndex, LoadCacheError>.is_compact_cache_fresh(bool) replaced bycheck_compact_cache_freshness(Result<(), LoadCacheError>).Tests
cargo test -p uffs-daemon -p uffs-core --lib→ 157 passed, 0 failed (was 153 before; net +4 new tests).ensure_warm_for_dispatch_promotes_in_parallelSlowBodyLoaderrecords peak in-flight; pins peak ≥ 2 and wall ≤ 1.5 × delayread_env_secs_falls_back_when_env_unsetrefresh_usn_for_warm_shards_no_op_when_emptyrefresh_usn_for_warm_shards_no_op_when_no_warm_or_hotrefresh_usn_for_warm_shards_handles_helper_errors_gracefullyLint gates
cargo test -p uffs-daemon -p uffs-core --libjust lint-prodjust lint-testsjust check-windowslint-pre-push(rustdoc + doc-tests + smoke + xwin)Risk
Medium. The cold-boot path change (#94) adds ~2 s per drive of compact rebuild work, taking N=7-drive parallel cold-boot from ~5.7 s to ~7–10 s. This is a one-time cost per daemon start and is acceptable per the issue spec. The Phase 4 re-promote change is pure improvement (parallel + USN-refreshed). The USN refresh timer (#95) runs entirely in the background with
catch_unwindper-drive isolation; per-drive failures don't cascade.Manual Windows verify checklist
After merge, suggested steps on the v0.5.80 reference host:
daemon kill && daemon start --drive C,D,E,F,G,M,S. Look fortarget=shard.transition reason=promotelines and the newtarget=shard.transition "♻️ USN-refreshed body ready"lines (one per drive). Total parallel time should be ≤ 10 s.UFFS_WARM_TO_PARKED_IDLE_SECS=360 daemon start ..., wait ~6.5 min, run*query — observe parallel re-promote with USN deltas applied. Total should be < 5 s for N=6 drives (vs 15.1 s sequential pre-Phase 4 perf: parallelize re-promote in ensure_warm_for_dispatch (15.1s sequential → ~5s parallel) #93).UFFS_USN_REFRESH_INTERVAL_SECS=60 daemon start ..., watch fortarget=shard.refresh "USN refresh tick complete"lines every minute withrefreshed=Nmatching the Warm/Hot drive count.