fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355) by oxoxDev · Pull Request #3364 · tinyhumansai/openhuman

oxoxDev · 2026-06-04T12:09:04Z

Summary

Wrap the JSON fs::write(tmp) + fs::rename(tmp → auth-profiles.json) calls in write_persisted_locked with retry_with_backoff, matching the auth-profiles .lock-create retry budget that already closed the sibling Windows transient-FS bugs OPENHUMAN-TAURI-H1 / H8 in fix(windows): retry-with-backoff for transient FS errors on auth-profiles.lock + .openhuman wipe (#9E, #9C, #4Y, #61, #5Q, #9F, #4M) #1641 / fix(credentials): recover from leaked auth-profile lock on Windows (Sentry OPENHUMAN-TAURI-H1) #2085.
3 regression tests pinned to the failure mode via a #[cfg(test)]-only __TEST_TRANSIENT__ injection point that the existing is_transient_fs_error classifier already recognises (src/openhuman/util.rs:618).
Closes fix(credentials): retry transient Windows FS errors when replacing auth-profiles.json (TAURI-RUST-92J) #3355 — Sentry TAURI-RUST-92J, 10,158 events / 24h, single Windows 11 24H2 user, release openhuman@0.56.0+e8968077aeb5.

Problem

AuthProfilesStore::write_persisted_locked (src/openhuman/credentials/profiles.rs:911-946) serialised the persisted JSON to a tmp file and renamed it onto auth-profiles.json via raw std::fs::write + std::fs::rename. On Windows, AV / Search-Indexer / Defender briefly holding the destination handle returns ERROR_SHARING_VIOLATION (32), ERROR_ACCESS_DENIED (5), or ERROR_DELETE_PENDING (303) — exactly the family crate::openhuman::util::retry_with_backoff is built to absorb. The sibling .lock-create at profiles.rs:987 was already routed through this helper (PR #1641 / #2085 / #2180); the .json write+rename path was the missing partial.

Result: any single transient AV hold flipped a normally-retry-safe operation into a permanent error. Compounding it, load_locked (profiles.rs:744) calls write_persisted_locked whenever its in-memory purge set is non-empty (#3125-class drops, legacy kind values per #2439, decrypt-failure recovery), and the frontend polls openhuman.app_state_snapshot every ~2s — so one persistent AV hold amplified into 10k+ identical Sentry events in 24h on TAURI-RUST-92J.

Solution

write_persisted_locked now wraps both the tmp write and the rename in retry_with_backoff("…", PERSIST_RETRY_ATTEMPTS, PERSIST_RETRY_BASE_MS, …). Constants are set to 6 attempts at base 100ms (worst-case ≈ 6.3 s), matching the proven .lock-create budget and staying well inside LOCK_TIMEOUT_MS so concurrent acquire_lock callers never starve behind a single retry-loop owner. Outer with_context is preserved on both calls so the Sentry fingerprint shape stays stable across releases.

This is the real fix to the underlying file-system race — not noise suppression. Genuinely unrecoverable failures still propagate via Err and reach the RPC error path, so a user whose AV is permanently hostile remains visible in Sentry as honest signal.

Regression coverage uses a #[cfg(test)]-only injection counter on AuthProfilesStore (force_next_transient_failures + consume_test_transient_failure) that returns an error chain containing __TEST_TRANSIENT__ — the test sentinel that is_transient_fs_error already accepts. Three tests exercise the retry semantics:

one-shot transient is absorbed and the profile lands on disk,
a burst of transients within the retry budget still resolves to a successful write,
sustained failures beyond the budget still surface as Err with the outer with_context preserved.

Production binaries carry zero new surface area from the injection helpers (gated behind #[cfg(test)]).

Submission Checklist

N/A: no UI changes (Rust-only fix; no i18n keys touched, no React surface)
Tests added — three regression tests in src/openhuman/credentials/profiles_tests.rs exercising one-shot retry, retry-within-budget burst, and retry-exhaustion behaviour. cargo test --lib openhuman::credentials::profiles is green locally (38 passed, 0 failed).
cargo check --manifest-path Cargo.toml clean on the changed crate.
cargo fmt clean on the changed files.
N/A: cargo clippy -D warnings — pre-existing lints elsewhere in the crate (useless_vec in tokenjuice/text/process.rs, map_or simplifications in this file at lines 1106 / 1205 / 1211, assertion-on-constant in profiles_tests.rs at 600 / 619 / 623) are unchanged by this PR and untouched on main.
N/A: rustdoc / gitbooks — no architecture or contributor-facing behaviour change (internal retry semantics on an existing private method).
N/A: feature flag / capability catalog — no user-visible feature added, removed, or renamed.

Impact

Windows users whose AV / Search-Indexer / Defender intermittently held a handle on auth-profiles.json no longer see app_state_snapshot fail on every poll once a profile gets dropped — the write retries and lands on disk as soon as the handle releases.
Sentry TAURI-RUST-92J event volume drops sharply once the next release ships; the issue stays open until clean so any genuinely sustained AV interference still surfaces.
No behaviour change on macOS / Linux: is_transient_fs_error only matches Windows raw OS codes (5 / 32 / 33 / 303 / 1224), so non-Windows callers see identical first-attempt semantics.
No schema change, no migration, no RPC surface change, no public API surface change.

Closes fix(credentials): retry transient Windows FS errors when replacing auth-profiles.json (TAURI-RUST-92J) #3355
Sentry-Issue: TAURI-RUST-92J
Sibling retries on the same store: fix(windows): retry-with-backoff for transient FS errors on auth-profiles.lock + .openhuman wipe (#9E, #9C, #4Y, #61, #5Q, #9F, #4M) #1641, fix(credentials): recover from leaked auth-profile lock on Windows (Sentry OPENHUMAN-TAURI-H1) #2085, fix(credentials): diagnose + recover from H8 auth-profile-lock create failures #2180 (.lock-create transient-FS retry — same helper, same Windows error family).
Related load-driven drop paths whose persist this fix now covers: fix(auth-profiles): tolerate legacy kind values on load #2439 (legacy kind value tolerance), fix(auth): gracefully drop OAuth profiles with missing access_token #3125 / fix(oauth): reject persisted profiles without access tokens #3180 (OAuth missing access_token drop).

…uth-profiles.json (tinyhumansai#3355) Wrap fs::write(tmp) + fs::rename(tmp -> auth-profiles.json) in retry_with_backoff so transient Windows AV / Search-Indexer / Defender handles on the destination (ERROR_SHARING_VIOLATION 32, ERROR_ACCESS_DENIED 5, ERROR_DELETE_PENDING 303) absorb the same way the sibling auth-profiles.lock create at acquire_lock already does (tinyhumansai#1641 / tinyhumansai#2085). Same 6-attempt / 100ms-base budget so we stay inside LOCK_TIMEOUT_MS and concurrent acquire_lock callers never starve. Outer with_context preserved so the Sentry fingerprint shape stays stable. cfg(test)-only failure-injection counter + consume_test_transient_failure helper expose the retry path to regression coverage on non-Windows hosts via the __TEST_TRANSIENT__ sentinel that is_transient_fs_error already recognises (src/openhuman/util.rs:618). Zero production surface. Closes tinyhumansai#3355 Sentry-Issue: TAURI-RUST-92J

…nyhumansai#3355) Three new regression tests pinned to the Sentry TAURI-RUST-92J root cause: * write_persisted_locked_retries_one_shot_transient — single injected transient is absorbed; profile lands on disk. * write_persisted_locked_absorbs_burst_of_transients — 5 injected failures (one below the 6-attempt budget) still resolve to a successful write, covering the common multi-tick AV hold. * write_persisted_locked_exhausts_retries_on_persistent_transient — sustained failures beyond the budget still surface as Err with the outer with_context preserved, so genuinely-unrecoverable failures remain honest signal in Sentry instead of being silently swallowed.

coderabbitai · 2026-06-04T12:09:13Z

Warning

Review limit reached

@oxoxDev, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 42 minutes and 12 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 178d1395-e737-4836-a2ca-caf903bb5d11

📥 Commits

Reviewing files that changed from the base of the PR and between e3ebaca and acf3cdc.

📒 Files selected for processing (2)

src/openhuman/credentials/profiles.rs
src/openhuman/credentials/profiles_tests.rs

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

M3gA-Mind

Solid, well-reasoned root-cause fix with correct retry semantics (verified it returns immediately on non-transient errors and only retries the Windows transient family) and good Sentry-fingerprint discipline. One substantive item to resolve before merge — the rename retry path, which is the headline of the PR, is never actually exercised by the tests — plus a small comment-accuracy fix. Inline below.

M3gA-Mind · 2026-06-04T13:30:41Z

+            PERSIST_RETRY_BASE_MS,
+            || {
+                self.consume_test_transient_failure()?;
+                fs::rename(&tmp_path, &self.path).context("rename auth profile tmp -> store")


The rename retry path — the headline of this PR — is never exercised by any test. force_transient_failures is a single counter shared by both retry stages, and the write stage above always runs first and drains it. In all three tests the queue hits 0 before fs::rename is reached, so this closure's consume_test_transient_failure() always returns Ok and the rename retry loop never iterates:

one-shot (queue 1): consumed by write attempt 1, write succeeds attempt 2, rename runs clean.

burst (queue 5): consumed by write attempts 1–5, write succeeds attempt 6, rename runs clean.

exhaust (queue 6): write stage exhausts and returns Err before rename is reached (its assertion is "Failed to write temporary auth profile file" — write-stage only).

So the rename wrapper has line coverage (it runs once, successfully) but no test proves it actually retries — which is exactly the "missing partial" the PR is about. A future refactor could delete this retry_with_backoff and every test would stay green. Recommend per-stage injection (separate counters or a stage tag) plus two tests: a rename-only transient that's absorbed, and a rename-stage exhaustion asserting the "Failed to replace auth profile store" outer context.

Addressed in 3c549a7 (tests) + 6b0f998 (impl) via #3398. Split the failure-injection counter into per-stage force_next_write_failures / force_next_rename_failures, plus two new tests: rename_stage_retries_one_shot_transient and rename_stage_exhausts_retries_and_cleans_up_tmp — the second now asserts the rename outer with_context("Failed to replace auth profile store …") is preserved, which is exactly the unreachable line you flagged.

M3gA-Mind · 2026-06-04T13:30:41Z

+/// Retry budget for the JSON write + rename in `write_persisted_locked`.
+/// Same shape as the lock-create call at the bottom of `acquire_lock` (which
+/// is what closed Sentry OPENHUMAN-TAURI-H1 / H8 in #1641 / #2085). 6 attempts
+/// at base 100ms doubles up to ~6.3s worst-case before surfacing. Sized to


Backoff math is off. With attempts = 6, retry_with_backoff sleeps only 5 times (the last failure breaks without sleeping): 100+200+400+800+1600 ≈ 3.1s per stage, not ~6.3s. The ~6.3s figure is actually the combined worst case of the two sequential stages (write ≈3.1s + rename ≈3.1s). The conclusion still holds — combined ≈6.2s is far inside LOCK_TIMEOUT_MS (30_000 + 5_000 = 35s) — but since this reasoning is load-bearing, suggest rewording to "≈3.1s per stage, ≈6.2s across both write+rename."

Addressed in 6b0f998 via #3398. Reworded to "≈3.1s per stage, ≈6.2s across both write+rename" and explicit cite of LOCK_TIMEOUT_MS = 35_000 as the safety margin.

M3gA-Mind · 2026-06-04T13:30:41Z

        })?;

-        fs::rename(&tmp_path, &self.path).with_context(|| {
+        retry_with_backoff(


Nit (pre-existing, optional): tmp_name is …tmp.{pid}.{nanos} — unique per call — so a permanently-failing fs::rename orphans a distinct tmp file every call. With the ~2s app_state_snapshot poll, sustained failures accumulate orphan tmps. Not introduced here (the original also left a tmp on rename failure), but the retry lengthens each failing call and the failure window, so the accumulation is a bit more pronounced. A best-effort let _ = fs::remove_file(&tmp_path); on rename-exhaustion would keep the directory clean. Low priority.

Addressed in 6b0f998 via #3398. Added best-effort fs::remove_file(&tmp_path) after the rename retry exhausts; rename_stage_exhausts_retries_and_cleans_up_tmp asserts no orphan auth-profiles.json.tmp.* remains. Result deliberately dropped — same AV that blocked the rename may block the unlink.

…#3364 follow-up) (#3398)

oxoxDev added 2 commits June 4, 2026 17:38

oxoxDev requested a review from a team June 4, 2026 12:09

M3gA-Mind reviewed Jun 4, 2026

View reviewed changes

senamakel merged commit 81e527e into tinyhumansai:main Jun 4, 2026
19 checks passed

oxoxDev mentioned this pull request Jun 5, 2026

test(credentials): cover rename-stage retry + best-effort tmp cleanup (#3364 follow-up) #3398

Merged

3 tasks

senamakel pushed a commit that referenced this pull request Jun 5, 2026

test(credentials): cover rename-stage retry + best-effort tmp cleanup (…

6056ed9

…#3364 follow-up) (#3398)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355)#3364

fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355)#3364
senamakel merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/3355-auth-profiles-rename-retry

oxoxDev commented Jun 4, 2026

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Review limit reached

Uh oh!

M3gA-Mind left a comment

Uh oh!

M3gA-Mind Jun 4, 2026

Uh oh!

oxoxDev Jun 5, 2026

Uh oh!

M3gA-Mind Jun 4, 2026

Uh oh!

oxoxDev Jun 5, 2026

Uh oh!

M3gA-Mind Jun 4, 2026

Uh oh!

oxoxDev Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

oxoxDev commented Jun 4, 2026

Summary

Problem

Solution

Submission Checklist

Impact

Related

Uh oh!

coderabbitai Bot commented Jun 4, 2026

Review limit reached

Uh oh!

M3gA-Mind left a comment

Choose a reason for hiding this comment

Uh oh!

M3gA-Mind Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

oxoxDev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

M3gA-Mind Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

oxoxDev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

M3gA-Mind Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

oxoxDev Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants