Skip to content

fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355)#3364

Merged
senamakel merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/3355-auth-profiles-rename-retry
Jun 4, 2026
Merged

fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355)#3364
senamakel merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/3355-auth-profiles-rename-retry

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented Jun 4, 2026

Summary

Problem

AuthProfilesStore::write_persisted_locked (src/openhuman/credentials/profiles.rs:911-946) serialised the persisted JSON to a tmp file and renamed it onto auth-profiles.json via raw std::fs::write + std::fs::rename. On Windows, AV / Search-Indexer / Defender briefly holding the destination handle returns ERROR_SHARING_VIOLATION (32), ERROR_ACCESS_DENIED (5), or ERROR_DELETE_PENDING (303) — exactly the family crate::openhuman::util::retry_with_backoff is built to absorb. The sibling .lock-create at profiles.rs:987 was already routed through this helper (PR #1641 / #2085 / #2180); the .json write+rename path was the missing partial.

Result: any single transient AV hold flipped a normally-retry-safe operation into a permanent error. Compounding it, load_locked (profiles.rs:744) calls write_persisted_locked whenever its in-memory purge set is non-empty (#3125-class drops, legacy kind values per #2439, decrypt-failure recovery), and the frontend polls openhuman.app_state_snapshot every ~2s — so one persistent AV hold amplified into 10k+ identical Sentry events in 24h on TAURI-RUST-92J.

Solution

write_persisted_locked now wraps both the tmp write and the rename in retry_with_backoff("…", PERSIST_RETRY_ATTEMPTS, PERSIST_RETRY_BASE_MS, …). Constants are set to 6 attempts at base 100ms (worst-case ≈ 6.3 s), matching the proven .lock-create budget and staying well inside LOCK_TIMEOUT_MS so concurrent acquire_lock callers never starve behind a single retry-loop owner. Outer with_context is preserved on both calls so the Sentry fingerprint shape stays stable across releases.

This is the real fix to the underlying file-system race — not noise suppression. Genuinely unrecoverable failures still propagate via Err and reach the RPC error path, so a user whose AV is permanently hostile remains visible in Sentry as honest signal.

Regression coverage uses a #[cfg(test)]-only injection counter on AuthProfilesStore (force_next_transient_failures + consume_test_transient_failure) that returns an error chain containing __TEST_TRANSIENT__ — the test sentinel that is_transient_fs_error already accepts. Three tests exercise the retry semantics:

  • one-shot transient is absorbed and the profile lands on disk,
  • a burst of transients within the retry budget still resolves to a successful write,
  • sustained failures beyond the budget still surface as Err with the outer with_context preserved.

Production binaries carry zero new surface area from the injection helpers (gated behind #[cfg(test)]).

Submission Checklist

  • N/A: no UI changes (Rust-only fix; no i18n keys touched, no React surface)
  • Tests added — three regression tests in src/openhuman/credentials/profiles_tests.rs exercising one-shot retry, retry-within-budget burst, and retry-exhaustion behaviour. cargo test --lib openhuman::credentials::profiles is green locally (38 passed, 0 failed).
  • cargo check --manifest-path Cargo.toml clean on the changed crate.
  • cargo fmt clean on the changed files.
  • N/A: cargo clippy -D warnings — pre-existing lints elsewhere in the crate (useless_vec in tokenjuice/text/process.rs, map_or simplifications in this file at lines 1106 / 1205 / 1211, assertion-on-constant in profiles_tests.rs at 600 / 619 / 623) are unchanged by this PR and untouched on main.
  • N/A: rustdoc / gitbooks — no architecture or contributor-facing behaviour change (internal retry semantics on an existing private method).
  • N/A: feature flag / capability catalog — no user-visible feature added, removed, or renamed.

Impact

  • Windows users whose AV / Search-Indexer / Defender intermittently held a handle on auth-profiles.json no longer see app_state_snapshot fail on every poll once a profile gets dropped — the write retries and lands on disk as soon as the handle releases.
  • Sentry TAURI-RUST-92J event volume drops sharply once the next release ships; the issue stays open until clean so any genuinely sustained AV interference still surfaces.
  • No behaviour change on macOS / Linux: is_transient_fs_error only matches Windows raw OS codes (5 / 32 / 33 / 303 / 1224), so non-Windows callers see identical first-attempt semantics.
  • No schema change, no migration, no RPC surface change, no public API surface change.

Related

oxoxDev added 2 commits June 4, 2026 17:38
…uth-profiles.json (tinyhumansai#3355)

Wrap fs::write(tmp) + fs::rename(tmp -> auth-profiles.json) in
retry_with_backoff so transient Windows AV / Search-Indexer / Defender
handles on the destination (ERROR_SHARING_VIOLATION 32,
ERROR_ACCESS_DENIED 5, ERROR_DELETE_PENDING 303) absorb the same way
the sibling auth-profiles.lock create at acquire_lock already does
(tinyhumansai#1641 / tinyhumansai#2085). Same 6-attempt / 100ms-base budget so we stay inside
LOCK_TIMEOUT_MS and concurrent acquire_lock callers never starve. Outer
with_context preserved so the Sentry fingerprint shape stays stable.

cfg(test)-only failure-injection counter + consume_test_transient_failure
helper expose the retry path to regression coverage on non-Windows hosts
via the __TEST_TRANSIENT__ sentinel that is_transient_fs_error already
recognises (src/openhuman/util.rs:618). Zero production surface.

Closes tinyhumansai#3355
Sentry-Issue: TAURI-RUST-92J
…nyhumansai#3355)

Three new regression tests pinned to the Sentry TAURI-RUST-92J root
cause:

  * write_persisted_locked_retries_one_shot_transient — single injected
    transient is absorbed; profile lands on disk.
  * write_persisted_locked_absorbs_burst_of_transients — 5 injected
    failures (one below the 6-attempt budget) still resolve to a
    successful write, covering the common multi-tick AV hold.
  * write_persisted_locked_exhausts_retries_on_persistent_transient —
    sustained failures beyond the budget still surface as Err with the
    outer with_context preserved, so genuinely-unrecoverable failures
    remain honest signal in Sentry instead of being silently swallowed.
@oxoxDev oxoxDev requested a review from a team June 4, 2026 12:09
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jun 4, 2026

Warning

Review limit reached

@oxoxDev, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 42 minutes and 12 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 178d1395-e737-4836-a2ca-caf903bb5d11

📥 Commits

Reviewing files that changed from the base of the PR and between e3ebaca and acf3cdc.

📒 Files selected for processing (2)
  • src/openhuman/credentials/profiles.rs
  • src/openhuman/credentials/profiles_tests.rs

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-reasoned root-cause fix with correct retry semantics (verified it returns immediately on non-transient errors and only retries the Windows transient family) and good Sentry-fingerprint discipline. One substantive item to resolve before merge — the rename retry path, which is the headline of the PR, is never actually exercised by the tests — plus a small comment-accuracy fix. Inline below.

PERSIST_RETRY_BASE_MS,
|| {
self.consume_test_transient_failure()?;
fs::rename(&tmp_path, &self.path).context("rename auth profile tmp -> store")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename retry path — the headline of this PR — is never exercised by any test. force_transient_failures is a single counter shared by both retry stages, and the write stage above always runs first and drains it. In all three tests the queue hits 0 before fs::rename is reached, so this closure's consume_test_transient_failure() always returns Ok and the rename retry loop never iterates:

  • one-shot (queue 1): consumed by write attempt 1, write succeeds attempt 2, rename runs clean.
  • burst (queue 5): consumed by write attempts 1–5, write succeeds attempt 6, rename runs clean.
  • exhaust (queue 6): write stage exhausts and returns Err before rename is reached (its assertion is "Failed to write temporary auth profile file" — write-stage only).

So the rename wrapper has line coverage (it runs once, successfully) but no test proves it actually retries — which is exactly the "missing partial" the PR is about. A future refactor could delete this retry_with_backoff and every test would stay green. Recommend per-stage injection (separate counters or a stage tag) plus two tests: a rename-only transient that's absorbed, and a rename-stage exhaustion asserting the "Failed to replace auth profile store" outer context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 3c549a7 (tests) + 6b0f998 (impl) via #3398. Split the failure-injection counter into per-stage force_next_write_failures / force_next_rename_failures, plus two new tests: rename_stage_retries_one_shot_transient and rename_stage_exhausts_retries_and_cleans_up_tmp — the second now asserts the rename outer with_context("Failed to replace auth profile store …") is preserved, which is exactly the unreachable line you flagged.

/// Retry budget for the JSON write + rename in `write_persisted_locked`.
/// Same shape as the lock-create call at the bottom of `acquire_lock` (which
/// is what closed Sentry OPENHUMAN-TAURI-H1 / H8 in #1641 / #2085). 6 attempts
/// at base 100ms doubles up to ~6.3s worst-case before surfacing. Sized to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backoff math is off. With attempts = 6, retry_with_backoff sleeps only 5 times (the last failure breaks without sleeping): 100+200+400+800+1600 ≈ 3.1s per stage, not ~6.3s. The ~6.3s figure is actually the combined worst case of the two sequential stages (write ≈3.1s + rename ≈3.1s). The conclusion still holds — combined ≈6.2s is far inside LOCK_TIMEOUT_MS (30_000 + 5_000 = 35s) — but since this reasoning is load-bearing, suggest rewording to "≈3.1s per stage, ≈6.2s across both write+rename."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6b0f998 via #3398. Reworded to "≈3.1s per stage, ≈6.2s across both write+rename" and explicit cite of LOCK_TIMEOUT_MS = 35_000 as the safety margin.

})?;

fs::rename(&tmp_path, &self.path).with_context(|| {
retry_with_backoff(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (pre-existing, optional): tmp_name is …tmp.{pid}.{nanos} — unique per call — so a permanently-failing fs::rename orphans a distinct tmp file every call. With the ~2s app_state_snapshot poll, sustained failures accumulate orphan tmps. Not introduced here (the original also left a tmp on rename failure), but the retry lengthens each failing call and the failure window, so the accumulation is a bit more pronounced. A best-effort let _ = fs::remove_file(&tmp_path); on rename-exhaustion would keep the directory clean. Low priority.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 6b0f998 via #3398. Added best-effort fs::remove_file(&tmp_path) after the rename retry exhausts; rename_stage_exhausts_retries_and_cleans_up_tmp asserts no orphan auth-profiles.json.tmp.* remains. Result deliberately dropped — same AV that blocked the rename may block the unlink.

@senamakel senamakel merged commit 81e527e into tinyhumansai:main Jun 4, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(credentials): retry transient Windows FS errors when replacing auth-profiles.json (TAURI-RUST-92J)

3 participants