Skip to content

fix(soak): close Phase 6 / Phase 7 24-h gates + pin daemon log strings#218

Merged
githubrobbi merged 2 commits into
mainfrom
fix/phase6-phase7-soak-validator-and-log-string-pins
May 13, 2026
Merged

fix(soak): close Phase 6 / Phase 7 24-h gates + pin daemon log strings#218
githubrobbi merged 2 commits into
mainfrom
fix/phase6-phase7-soak-validator-and-log-string-pins

Conversation

@githubrobbi
Copy link
Copy Markdown
Collaborator

Why

The 2026-05-11 Phase 6 + Phase 7 24-h soak runs (LOG/uffs_soak/phase6-20260509-213122 and LOG/uffs_soak/phase7-20260510-214412) each landed 1 failure in the validator — and in both cases the daemon was actually doing the right thing. The soak harness's log-scrape just couldn't see it.

Phase 6 — Drive C warm_ttl_sec exceeds peers failed at 300s vs 300s

The post-EMA-fix daemon was computing warm_ttl_sec ≈ 3687s for drive C every tick during the synthetic-load window — the Phase 6 adaptive-bonus formula engaged correctly. The validator's parse_max_ttl_field just couldn't see it because crate::index::transitions::evaluate_idle_demote only emits a DEBUG-level shard.ttl event when it actually proposes a demote. During the synthetic load drive C sits in Warm/Hot with idle_secs ≈ 0, so the demote-eval ladder never reaches the DEBUG-level idle-demote / min-tier-clamp arms. The catch-all (None, _) arm fires at TRACE — and the harness's RUST_LOG=shard.ttl=debug filtered it out.

Fix (scripts/dev/long-soak.rs:746): shard.ttl=debugshard.ttl=trace. ~23 k extra trace events over 24 h (~3.5 MB), marginal against the existing 75 MB log volume.

Phase 7 — Encrypted-cache refresh fired during soak failed at found 0 refresh-class events

The validator hunted for USN refresh tick|trigger_save|threshold.*save|encrypted cache refreshnone of those alternatives match the daemon's actual INFO line:

```
INFO Journal poll: triggered background compact-cache save drive=F reason=AgeElapsed cursor=151008
```

Fix (scripts/dev/long-soak.rs:1244): regex re-anchored on compact-cache save.

Retroactive verification: the existing 24-h Phase 7 log contains 11 such events:
```
$ grep -c 'compact-cache save' LOG/uffs_soak/phase7-20260510-214412/daemon.log
11
```

Daemon-side regression pins

Two new tests so a future log-message rename fails CI before reaching a 24-h soak gate:

  • crate::cache::journal_loop::tests::save_log_message::compact_cache_save_log_message_pins_string_target_and_level — pins target = uffs_daemon::cache::journal_loop, level = INFO, and the literal compact-cache save substring on the journal-loop save event.
  • crate::index::tests::shard_ttl_events::below_ttl_event_pins_target_level_message_and_reason — pins target = shard.ttl, level = TRACE, message = Adaptive idle-demote evaluation: not yet idle past TTL, reason = below-ttl, and the four TTL fields the harness's parse_max_ttl_field depends on.

Both tests reuse the existing crate::index::tests::tracing_capture EventLog scaffold. Visibility hoisted from pub(super)pub(crate) on the scaffold items + pub(crate) on mod tests; in index/mod.rs and mod tracing_capture; in index/tests/mod.rs. No production-code visibility changes — only sharing of an already-test-only scaffold across crate-internal modules.

Verification

```
cargo test --package uffs-daemon --lib # 283 + 2 = 285 pass
cargo clippy --package uffs-daemon --tests # clean
cargo fmt --package uffs-daemon -- --check # clean
just lint-fast # green (ran via pre-commit hook)
just lint-pre-push # green (ran via pre-push hook)
```

What's NOT in this PR

  • No production-code behavior change. The daemon's log messages, levels, and event fields are unchanged — only newly pinned by tests.
  • No new dependencies.
  • No CI workflow / just-recipe changes.

Operator next steps after merge

  1. Pull main on the Windows soak host.
  2. --dev smoke (5 min each):
    ```
    rust-script scripts/dev/long-soak.rs phase6 --dev
    rust-script scripts/dev/long-soak.rs phase7 --dev
    ```
  3. Re-run the 24-h gates:
    ```
    just soak phase6
    just soak phase7
    ```

Closes the Phase 6 / Phase 7 follow-up gates documented in docs/architecture/memory-tiering-windows-host-validation.md.

@githubrobbi githubrobbi enabled auto-merge (squash) May 13, 2026 11:15
githubrobbi added a commit that referenced this pull request May 13, 2026
…st-PR-218 contract pins

Companion to PR #218 (harness fixes + daemon-side log-string regression tests).  Separates the harness-bug noise from what the soak actually proved:

Phase 6 (LOG/uffs_soak/phase6-20260509-213122)

  * 8 of 9 contract assertions PASS end-to-end:

    1. C never demotes below Warm (0 to=Parked events; 2871 min-tier-clamp events)

    2. D / E / F / G / M / S each fired \u2265 2 Warm\u2192Parked transitions

    + the original 6 sub-assertions pinning the demote-controller / floor-clamp surface

  * 1 assertion deferred: adaptive bonus (warm_ttl_sec) was computed every tick but emitted only at TRACE; the pre-PR-218 RUST_LOG was shard.ttl=debug.  Code + unit-test contract intact (PR #146 + new shard_ttl_events test in PR #218); direct end-to-end evidence requires one more 24-h run with the trace-level fix.

Phase 7 (LOG/uffs_soak/phase7-20260510-214412)

  * 6 of 7 contract assertions PASS at run time

  * 7 of 7 with the PR #218 regex fix \u2014 the save pipeline emitted 11 'compact-cache save' events during the soak; the pre-fix regex 'USN refresh tick|trigger_save|threshold.*save|encrypted cache refresh' matched none of them.  Retroactively closes Phase 7 with no new soak required.

Doc updates:

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a72:

      \u2013 RUST_LOG bumped to shard.ttl=trace with inline rationale

      \u2013 criterion 3 grep pivoted from chosen_ttl_sec to warm_ttl_sec (the rate-sensitive Warm\u2192Parked edge present on every drive regardless of tier; pre-fix compare was structurally impossible across tiers)

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a73:

      \u2013 criterion 2 grep re-anchored on 'compact-cache save' with the literal log-line shape + unit-test pin reference

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a76:

      \u2013 new \u00a74.5b 'Phase 6 24-h capture \u2014 adaptive-bonus visibility deferred' sub-section

      \u2013 new \u00a74.5c 'Phase 7 24-h capture \u2014 retroactively ALL GREEN' sub-section (with WS-ratio footnote noting the vacuous-pass observation for future tightening)

  * PR-attachment checklist breadcrumb filename updated chosen_ttl_sec \u2192 warm_ttl_sec with back-compat note

  * docs/architecture/memory-tiering-bake-criteria.md:

      \u2013 Phase 7 box ticked [x] retroactively with full rationale

      \u2013 Phase 6 box stays [ ] with status note describing what's pending

  * CHANGELOG Unreleased: new 'Fixed \u2014 Phase 6 + Phase 7 24-h soak harness (PR #218)' section listing both harness fixes, both new regression tests, the visibility hoist, and the doc updates.
Phase 6 — scripts/dev/long-soak.rs:746 (RUST_LOG bump)

  shard.ttl=debug filtered the daemon's catch-all below-ttl event (TRACE-level), so drive C's adaptive warm_ttl_sec bonus was computed every tick but never written to the log.  The post-EMA-fix daemon was correct; the synthetic-load assertion just couldn't see the engagement.  Bumping shard.ttl to TRACE adds ~3.5 MB to the 75 MB log — marginal cost for full observability.

Phase 7 — scripts/dev/long-soak.rs:1244 (regex anchor)

  The pre-fix validator hunted for 'USN refresh tick|trigger_save|threshold.*save|encrypted cache refresh' — none of those alternatives match the daemon's actual INFO message 'Journal poll: triggered background compact-cache save'.  Re-anchored on 'compact-cache save'; the 2026-05-11 24-h log retroactively passes with 11 matched events.

Daemon-side regression pins:

* crates/uffs-daemon/src/cache/journal_loop/tests/save_log_message.rs — pins target = 'uffs_daemon::cache::journal_loop', level = INFO, and the literal 'compact-cache save' substring on the journal-loop save event.

* crates/uffs-daemon/src/index/tests/shard_ttl_events.rs — adds below_ttl_event_pins_target_level_message_and_reason, pinning target = 'shard.ttl', level = TRACE, message = 'Adaptive idle-demote evaluation: not yet idle past TTL', reason = 'below-ttl', and the four TTL fields.

Both tests reuse the existing crate::index::tests::tracing_capture EventLog scaffold, which is hoisted from pub(super) to pub(crate) and exposed via 'pub(crate) mod tests' in src/index/mod.rs + 'pub(crate) mod tracing_capture' in src/index/tests/mod.rs.  No production-code visibility changes.

Verification:

  cargo test --package uffs-daemon --lib   — 283 + 2 = 285 tests pass

  cargo clippy --package uffs-daemon --tests   — clean

  cargo fmt --package uffs-daemon -- --check   — clean

  grep -c 'compact-cache save' LOG/uffs_soak/phase7-20260510-214412/daemon.log = 11   — Phase 7 retroactively passes.
…st-PR-218 contract pins

Companion to PR #218 (harness fixes + daemon-side log-string regression tests).  Separates the harness-bug noise from what the soak actually proved:

Phase 6 (LOG/uffs_soak/phase6-20260509-213122)

  * 8 of 9 contract assertions PASS end-to-end:

    1. C never demotes below Warm (0 to=Parked events; 2871 min-tier-clamp events)

    2. D / E / F / G / M / S each fired \u2265 2 Warm\u2192Parked transitions

    + the original 6 sub-assertions pinning the demote-controller / floor-clamp surface

  * 1 assertion deferred: adaptive bonus (warm_ttl_sec) was computed every tick but emitted only at TRACE; the pre-PR-218 RUST_LOG was shard.ttl=debug.  Code + unit-test contract intact (PR #146 + new shard_ttl_events test in PR #218); direct end-to-end evidence requires one more 24-h run with the trace-level fix.

Phase 7 (LOG/uffs_soak/phase7-20260510-214412)

  * 6 of 7 contract assertions PASS at run time

  * 7 of 7 with the PR #218 regex fix \u2014 the save pipeline emitted 11 'compact-cache save' events during the soak; the pre-fix regex 'USN refresh tick|trigger_save|threshold.*save|encrypted cache refresh' matched none of them.  Retroactively closes Phase 7 with no new soak required.

Doc updates:

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a72:

      \u2013 RUST_LOG bumped to shard.ttl=trace with inline rationale

      \u2013 criterion 3 grep pivoted from chosen_ttl_sec to warm_ttl_sec (the rate-sensitive Warm\u2192Parked edge present on every drive regardless of tier; pre-fix compare was structurally impossible across tiers)

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a73:

      \u2013 criterion 2 grep re-anchored on 'compact-cache save' with the literal log-line shape + unit-test pin reference

  * docs/architecture/memory-tiering-windows-host-validation.md \u00a76:

      \u2013 new \u00a74.5b 'Phase 6 24-h capture \u2014 adaptive-bonus visibility deferred' sub-section

      \u2013 new \u00a74.5c 'Phase 7 24-h capture \u2014 retroactively ALL GREEN' sub-section (with WS-ratio footnote noting the vacuous-pass observation for future tightening)

  * PR-attachment checklist breadcrumb filename updated chosen_ttl_sec \u2192 warm_ttl_sec with back-compat note

  * docs/architecture/memory-tiering-bake-criteria.md:

      \u2013 Phase 7 box ticked [x] retroactively with full rationale

      \u2013 Phase 6 box stays [ ] with status note describing what's pending

  * CHANGELOG Unreleased: new 'Fixed \u2014 Phase 6 + Phase 7 24-h soak harness (PR #218)' section listing both harness fixes, both new regression tests, the visibility hoist, and the doc updates.
@githubrobbi githubrobbi force-pushed the fix/phase6-phase7-soak-validator-and-log-string-pins branch from 47814d1 to b08bbd9 Compare May 13, 2026 11:25
@githubrobbi githubrobbi merged commit f484cfb into main May 13, 2026
26 checks passed
@githubrobbi githubrobbi deleted the fix/phase6-phase7-soak-validator-and-log-string-pins branch May 13, 2026 11:39
githubrobbi added a commit that referenced this pull request May 15, 2026
…capture trail (#244)

Companion doc pass to PR #218 (which fixed the Phase 6 + Phase 7 soak harness).  Records the 2026-05-13/14 capture findings: Phase 7 retroactively closed 7-of-7 against the existing log via the PR #218 regex fix; ws-trace 4-of-4 closed against the fresh 24-h capture in LOG/uffs_soak/wstrace-20260513-113344.

Findings landed:

  * memory-tiering-windows-host-validation.md

      \u2013 new \u00a74.5d capturing the 2026-05-13 ws-trace 24-h run.  Documents the apparent 30x WS drop (5.37 GB \u2192 184 MB) as the benign EmptyWorkingSet page-trim from the Phase 5 G2 wiring, NOT a leak: pm_bytes decreased only 3% (6.53 \u2192 6.36 GB) and the daemon's own per-drive RESIDENT accounting stayed at ~5.0 GiB across all 7 drives for the full 24h while serving 289/289 keep-warm probes.

      \u2013 \u00a74.5c Phase 7 footnote softened: the 'vacuous pass' concern is resolved by the ws-trace evidence (the same WS-trim pattern was observed under sustained Warm-tier load).

      \u2013 \u00a73 criterion 3 (WS \u2264 1.5x) now carries an inline note pointing operators at the PM-bytes cross-check + \u00a74.5d for the full WS-vs-PM breakdown.

  * memory-tiering-bake-criteria.md

      \u2013 \u00a71.7 'Working-Set trajectory' carries an inline note on the WS-bound semantics (EmptyWorkingSet trim is the right reading, not a leak).

      \u2013 The \u00a73 'WS trajectory captured' exit-criteria checkbox ticks [x] with status note + post-v0.6.0 refinement recommendation (re-anchor the validator on pm_bytes).

      \u2013 Two pre-existing dangling \u00a76.2 / \u00a76.3 refs corrected to '\u00a76 sub-section \u00a74.5b / \u00a74.5c' (the target doc's sub-sections are numbered \u00a74.5x, not \u00a76.x).

  * memory-tiering-readiness-validation-2026-05-05.md

      \u2013 'Remaining items for the v0.6.0 cut' list updated: Phase 7 + ws-trace removed (closed retroactively); only Phase 6 re-run + one-week bake remain.

      \u2013 Closing note updated: 'Phase 7 and ws-trace are closed; Phase 8 has been closed since 2026-05-05.'

  * CHANGELOG.md Unreleased

      \u2013 New 'Verified \u2014 Phase 7 + ws-trace 24-h Windows-host soaks (2026-05-13/14)' section capturing both retroactive closures with cross-refs to the docs landing in this same PR.

      \u2013 Fix pre-existing dangling \u00a76.2 / \u00a76.3 cross-refs in the prior 'Fixed \u2014 Phase 6 + Phase 7 24-h soak harness (PR #218)' entry.

Conclusion: 2 of 3 v0.6.0 24-h Windows-host soak gates are closed.  Only the Phase 6 24-h re-run (with the post-PR-218 shard.ttl=trace harness fix) plus the one-week 'main' bake remain before the v0.6.0 cut.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant