fix(credentials): retry transient Windows FS errors when persisting auth-profiles.json (#3355)#3364
Conversation
…uth-profiles.json (tinyhumansai#3355) Wrap fs::write(tmp) + fs::rename(tmp -> auth-profiles.json) in retry_with_backoff so transient Windows AV / Search-Indexer / Defender handles on the destination (ERROR_SHARING_VIOLATION 32, ERROR_ACCESS_DENIED 5, ERROR_DELETE_PENDING 303) absorb the same way the sibling auth-profiles.lock create at acquire_lock already does (tinyhumansai#1641 / tinyhumansai#2085). Same 6-attempt / 100ms-base budget so we stay inside LOCK_TIMEOUT_MS and concurrent acquire_lock callers never starve. Outer with_context preserved so the Sentry fingerprint shape stays stable. cfg(test)-only failure-injection counter + consume_test_transient_failure helper expose the retry path to regression coverage on non-Windows hosts via the __TEST_TRANSIENT__ sentinel that is_transient_fs_error already recognises (src/openhuman/util.rs:618). Zero production surface. Closes tinyhumansai#3355 Sentry-Issue: TAURI-RUST-92J
…nyhumansai#3355) Three new regression tests pinned to the Sentry TAURI-RUST-92J root cause: * write_persisted_locked_retries_one_shot_transient — single injected transient is absorbed; profile lands on disk. * write_persisted_locked_absorbs_burst_of_transients — 5 injected failures (one below the 6-attempt budget) still resolve to a successful write, covering the common multi-tick AV hold. * write_persisted_locked_exhausts_retries_on_persistent_transient — sustained failures beyond the budget still surface as Err with the outer with_context preserved, so genuinely-unrecoverable failures remain honest signal in Sentry instead of being silently swallowed.
|
Warning Review limit reached
More reviews will be available in 42 minutes and 12 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
Comment |
M3gA-Mind
left a comment
There was a problem hiding this comment.
Solid, well-reasoned root-cause fix with correct retry semantics (verified it returns immediately on non-transient errors and only retries the Windows transient family) and good Sentry-fingerprint discipline. One substantive item to resolve before merge — the rename retry path, which is the headline of the PR, is never actually exercised by the tests — plus a small comment-accuracy fix. Inline below.
| PERSIST_RETRY_BASE_MS, | ||
| || { | ||
| self.consume_test_transient_failure()?; | ||
| fs::rename(&tmp_path, &self.path).context("rename auth profile tmp -> store") |
There was a problem hiding this comment.
The rename retry path — the headline of this PR — is never exercised by any test. force_transient_failures is a single counter shared by both retry stages, and the write stage above always runs first and drains it. In all three tests the queue hits 0 before fs::rename is reached, so this closure's consume_test_transient_failure() always returns Ok and the rename retry loop never iterates:
- one-shot (queue 1): consumed by write attempt 1, write succeeds attempt 2, rename runs clean.
- burst (queue 5): consumed by write attempts 1–5, write succeeds attempt 6, rename runs clean.
- exhaust (queue 6): write stage exhausts and returns
Errbefore rename is reached (its assertion is"Failed to write temporary auth profile file"— write-stage only).
So the rename wrapper has line coverage (it runs once, successfully) but no test proves it actually retries — which is exactly the "missing partial" the PR is about. A future refactor could delete this retry_with_backoff and every test would stay green. Recommend per-stage injection (separate counters or a stage tag) plus two tests: a rename-only transient that's absorbed, and a rename-stage exhaustion asserting the "Failed to replace auth profile store" outer context.
There was a problem hiding this comment.
Addressed in 3c549a7 (tests) + 6b0f998 (impl) via #3398. Split the failure-injection counter into per-stage force_next_write_failures / force_next_rename_failures, plus two new tests: rename_stage_retries_one_shot_transient and rename_stage_exhausts_retries_and_cleans_up_tmp — the second now asserts the rename outer with_context("Failed to replace auth profile store …") is preserved, which is exactly the unreachable line you flagged.
| /// Retry budget for the JSON write + rename in `write_persisted_locked`. | ||
| /// Same shape as the lock-create call at the bottom of `acquire_lock` (which | ||
| /// is what closed Sentry OPENHUMAN-TAURI-H1 / H8 in #1641 / #2085). 6 attempts | ||
| /// at base 100ms doubles up to ~6.3s worst-case before surfacing. Sized to |
There was a problem hiding this comment.
Backoff math is off. With attempts = 6, retry_with_backoff sleeps only 5 times (the last failure breaks without sleeping): 100+200+400+800+1600 ≈ 3.1s per stage, not ~6.3s. The ~6.3s figure is actually the combined worst case of the two sequential stages (write ≈3.1s + rename ≈3.1s). The conclusion still holds — combined ≈6.2s is far inside LOCK_TIMEOUT_MS (30_000 + 5_000 = 35s) — but since this reasoning is load-bearing, suggest rewording to "≈3.1s per stage, ≈6.2s across both write+rename."
| })?; | ||
|
|
||
| fs::rename(&tmp_path, &self.path).with_context(|| { | ||
| retry_with_backoff( |
There was a problem hiding this comment.
Nit (pre-existing, optional): tmp_name is …tmp.{pid}.{nanos} — unique per call — so a permanently-failing fs::rename orphans a distinct tmp file every call. With the ~2s app_state_snapshot poll, sustained failures accumulate orphan tmps. Not introduced here (the original also left a tmp on rename failure), but the retry lengthens each failing call and the failure window, so the accumulation is a bit more pronounced. A best-effort let _ = fs::remove_file(&tmp_path); on rename-exhaustion would keep the directory clean. Low priority.
There was a problem hiding this comment.
Summary
fs::write(tmp)+fs::rename(tmp → auth-profiles.json)calls inwrite_persisted_lockedwithretry_with_backoff, matching the auth-profiles.lock-create retry budget that already closed the sibling Windows transient-FS bugsOPENHUMAN-TAURI-H1/H8in fix(windows): retry-with-backoff for transient FS errors on auth-profiles.lock + .openhuman wipe (#9E, #9C, #4Y, #61, #5Q, #9F, #4M) #1641 / fix(credentials): recover from leaked auth-profile lock on Windows (Sentry OPENHUMAN-TAURI-H1) #2085.#[cfg(test)]-only__TEST_TRANSIENT__injection point that the existingis_transient_fs_errorclassifier already recognises (src/openhuman/util.rs:618).TAURI-RUST-92J, 10,158 events / 24h, single Windows 11 24H2 user, releaseopenhuman@0.56.0+e8968077aeb5.Problem
AuthProfilesStore::write_persisted_locked(src/openhuman/credentials/profiles.rs:911-946) serialised the persisted JSON to a tmp file and renamed it ontoauth-profiles.jsonvia rawstd::fs::write+std::fs::rename. On Windows, AV / Search-Indexer / Defender briefly holding the destination handle returnsERROR_SHARING_VIOLATION (32),ERROR_ACCESS_DENIED (5), orERROR_DELETE_PENDING (303)— exactly the familycrate::openhuman::util::retry_with_backoffis built to absorb. The sibling.lock-create atprofiles.rs:987was already routed through this helper (PR #1641 / #2085 / #2180); the.jsonwrite+rename path was the missing partial.Result: any single transient AV hold flipped a normally-retry-safe operation into a permanent error. Compounding it,
load_locked(profiles.rs:744) callswrite_persisted_lockedwhenever its in-memory purge set is non-empty (#3125-class drops, legacykindvalues per#2439, decrypt-failure recovery), and the frontend pollsopenhuman.app_state_snapshotevery ~2s — so one persistent AV hold amplified into 10k+ identical Sentry events in 24h onTAURI-RUST-92J.Solution
write_persisted_lockednow wraps both the tmp write and the rename inretry_with_backoff("…", PERSIST_RETRY_ATTEMPTS, PERSIST_RETRY_BASE_MS, …). Constants are set to 6 attempts at base 100ms (worst-case ≈ 6.3 s), matching the proven.lock-create budget and staying well insideLOCK_TIMEOUT_MSso concurrentacquire_lockcallers never starve behind a single retry-loop owner. Outerwith_contextis preserved on both calls so the Sentry fingerprint shape stays stable across releases.This is the real fix to the underlying file-system race — not noise suppression. Genuinely unrecoverable failures still propagate via
Errand reach the RPC error path, so a user whose AV is permanently hostile remains visible in Sentry as honest signal.Regression coverage uses a
#[cfg(test)]-only injection counter onAuthProfilesStore(force_next_transient_failures+consume_test_transient_failure) that returns an error chain containing__TEST_TRANSIENT__— the test sentinel thatis_transient_fs_erroralready accepts. Three tests exercise the retry semantics:Errwith the outerwith_contextpreserved.Production binaries carry zero new surface area from the injection helpers (gated behind
#[cfg(test)]).Submission Checklist
src/openhuman/credentials/profiles_tests.rsexercising one-shot retry, retry-within-budget burst, and retry-exhaustion behaviour.cargo test --lib openhuman::credentials::profilesis green locally (38 passed, 0 failed).cargo check --manifest-path Cargo.tomlclean on the changed crate.cargo fmtclean on the changed files.cargo clippy -D warnings— pre-existing lints elsewhere in the crate (useless_vecintokenjuice/text/process.rs,map_orsimplifications in this file at lines 1106 / 1205 / 1211, assertion-on-constant inprofiles_tests.rsat 600 / 619 / 623) are unchanged by this PR and untouched onmain.gitbooks— no architecture or contributor-facing behaviour change (internal retry semantics on an existing private method).Impact
auth-profiles.jsonno longer seeapp_state_snapshotfail on every poll once a profile gets dropped — the write retries and lands on disk as soon as the handle releases.TAURI-RUST-92Jevent volume drops sharply once the next release ships; the issue stays open until clean so any genuinely sustained AV interference still surfaces.is_transient_fs_erroronly matches Windows raw OS codes (5 / 32 / 33 / 303 / 1224), so non-Windows callers see identical first-attempt semantics.Related
.lock-create transient-FS retry — same helper, same Windows error family).kindvalue tolerance), fix(auth): gracefully drop OAuth profiles with missing access_token #3125 / fix(oauth): reject persisted profiles without access tokens #3180 (OAuth missingaccess_tokendrop).