fix(windows): retry-with-backoff for transient FS errors on auth-profiles.lock + .openhuman wipe (#9E, #9C, #4Y, #61, #5Q, #9F, #4M)#1641
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (17)
📝 WalkthroughWalkthroughAdds synchronous and asynchronous exponential-backoff retry helpers with transient filesystem error classification; applies them to lock-file creation and moves RPC directory deletions into async context with retries for transient filesystem sharing/locking failures. ChangesRetry infrastructure and resilient filesystem operations
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/openhuman/util.rs (1)
321-355: ⚡ Quick winDemote retry-churn logs to
debug/trace.These failures are expected transient noise in the exact path this PR is trying to quiet. Logging every retry at
warn!and every recovery atinfo!will still spam normal logs; keep the final exhausted failure surfaced by callers, but log the per-retry churn here atdebug/traceinstead. As per coding guidelines, "Uselog/tracingatdebug/tracelevel; include stable grep-friendly prefixes ([domain],[rpc]) and correlation fields; never log secrets or full PII".Also applies to: 383-417
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/util.rs` around lines 321 - 355, The retry churn logs inside the retry loop should be lowered from warn/info to debug/trace: change the tracing::warn! that logs each failed attempt (includes fields op = op_name, attempt, max_attempts, error = %e, retry_in_ms) to tracing::debug! or tracing::trace!, and change the tracing::info! on successful recovery (which logs op_name and retries = i) to tracing::debug!/trace! as well; keep the same structured fields (op = op_name, retries, attempt, max_attempts, error = %e, retry_in_ms) and the "[util]" grep-friendly prefix, and leave terminal/exhausted failures returned to callers unchanged (code paths that set last_err/return Err(e) remain the same).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/util.rs`:
- Around line 307-365: Reject zero attempts before entering the loop: in
retry_with_backoff check if attempts == 0 and immediately return an
anyhow::Error (e.g. anyhow::anyhow!("{} requires attempts > 0", op_name) or
similar) instead of letting the loop skip and hit expect; apply the same
precondition check to the other retry helper in this file (the second retry
function around lines 368-427) so both public utilities return a proper error
for attempts == 0 rather than panicking.
---
Nitpick comments:
In `@src/openhuman/util.rs`:
- Around line 321-355: The retry churn logs inside the retry loop should be
lowered from warn/info to debug/trace: change the tracing::warn! that logs each
failed attempt (includes fields op = op_name, attempt, max_attempts, error = %e,
retry_in_ms) to tracing::debug! or tracing::trace!, and change the
tracing::info! on successful recovery (which logs op_name and retries = i) to
tracing::debug!/trace! as well; keep the same structured fields (op = op_name,
retries, attempt, max_attempts, error = %e, retry_in_ms) and the "[util]"
grep-friendly prefix, and leave terminal/exhausted failures returned to callers
unchanged (code paths that set last_err/return Err(e) remain the same).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2f8b6a5c-af34-4f5d-89af-f3061bb77bc3
📒 Files selected for processing (3)
src/openhuman/credentials/profiles.rssrc/openhuman/memory/tree/read_rpc.rssrc/openhuman/util.rs
…i#1641 CR) CodeRabbit major: both retry_with_backoff and retry_with_backoff_async skipped the loop body when attempts == 0 and panicked at `last_err.expect("attempts > 0")`. As public utilities, they should fail gracefully — surface anyhow::ensure! at the entry point so the caller gets a normal Err. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed in 378e633: both helpers now reject |
graycyrus
left a comment
There was a problem hiding this comment.
PR #1641 Review — fix(windows): retry-with-backoff for transient FS errors
Walkthrough
This PR surgically targets a class of Windows-specific cold-start failures where mandatory file locking by CEF subprocesses, AV scanners, or the Ollama binary causes one-shot filesystem operations to fail with ERROR_SHARING_VIOLATION (32), ERROR_ACCESS_DENIED (5), ERROR_LOCK_VIOLATION (33), or ERROR_USER_MAPPED_FILE (1224). The fix adds two generic retry helpers — a sync version (using std::thread::sleep) and an async version (using tokio::time::sleep) — to src/openhuman/util.rs, then wires them into the two affected code paths: lock-file creation in AuthProfilesStore::acquire_lock and filesystem tree removal in wipe_all_rpc / reset_tree_rpc.
The approach is well-scoped, the test coverage is solid, and the helper design is reusable for future FS-touching code.
Changes
| File | Summary |
|---|---|
src/openhuman/util.rs |
Adds retry_with_backoff (sync), retry_with_backoff_async (tokio), and is_transient_fs_error (Windows OS error codes 5/32/33/1224), plus 7 unit tests |
src/openhuman/credentials/profiles.rs |
acquire_lock wraps OpenOptions::create_new with retry_with_backoff; AlreadyExists detection via anyhow chain inspection |
src/openhuman/memory/tree/read_rpc.rs |
wipe_all_rpc and reset_tree_rpc split SQLite work from FS cleanup, use retry_with_backoff_async + tokio::fs::remove_dir_all |
Actionable comments posted inline (4)
- [minor]
util.rs— tracing macros duplicate every field in the message string - [minor]
read_rpc.rs:1359— unnecessary.clone()ondirs_removed - [minor]
util.rs— missing async non-transient short-circuit test - [minor]
read_rpc.rs— doc comment step numbering no longer matches code order
Nitpicks (2)
util.rs— Public functions defined after#[cfg(test)] mod testsblock. Convention is#[cfg(test)]at the bottom. Not a correctness issue but reads oddly top-to-bottom.util.rs:382—2u64.pow(i)is safe at current hardcoded params (attempts=6), but a saturating cap (e.g.sleep_ms.min(30_000)) would make the helper unconditionally safe for future callers.
Verified / looks good ✓
AlreadyExistschain walk inprofiles.rscorrectly pierces anyhow wrappingis_transient_fs_erroris#[cfg(windows)]-gated — POSIX behavior unchangedcfg!(test)gate for__TEST_TRANSIENT__is dead code in release builds- Sync
retry_with_backoffonly called fromacquire_lock(already inspawn_blockingcontext) NotFoundhandling inwipe_all_rpccorrectly treated as no-op- 7 unit tests cover happy path, retries, exhaustion, non-transient bail, and
attempts == 0guard
| i + 1, | ||
| attempts, | ||
| e, | ||
| sleep_ms |
There was a problem hiding this comment.
[minor] Both tracing::warn! and tracing::info! pass structured key-value fields (op, attempt, error, retry_in_ms) and then re-format all those same values into the message string. Structured log consumers (JSON formatter, Sentry, DataDog) will emit each field twice.
// before — fields duplicated in message
tracing::warn!(
op = op_name,
attempt = i + 1,
max_attempts = attempts,
error = %e,
retry_in_ms = sleep_ms,
"[util] {} failed (attempt {}/{}): {}. Retrying in {}ms...",
op_name, i + 1, attempts, e, sleep_ms
);
// after — message is a static label; readers query structured fields
tracing::warn!(
op = op_name,
attempt = i + 1,
max_attempts = attempts,
error = %e,
retry_in_ms = sleep_ms,
"[util] transient fs error, will retry"
);Same fix applies to the tracing::info! at the success-after-retry path, and to both copies in retry_with_backoff_async.
|
|
||
| let resp = WipeAllResponse { | ||
| rows_deleted, | ||
| dirs_removed: dirs_removed.clone(), |
There was a problem hiding this comment.
[minor] dirs_removed is a locally-owned Vec<String> that's moved into WipeAllResponse and never used again. The .clone() is unnecessary.
// before
dirs_removed: dirs_removed.clone(),
// after
dirs_removed,| let mut calls = 0; | ||
| let result = retry_with_backoff("zero_sync", 0, 1, || { | ||
| calls += 1; | ||
| Ok::<i32, anyhow::Error>(42) |
There was a problem hiding this comment.
[minor] There's test_retry_with_backoff_bail_on_non_transient for the sync helper but no equivalent for the async path. The async helper has the same if !is_transient_fs_error(&e) { return Err(e); } branch and it should be covered.
#[tokio::test]
async fn test_retry_with_backoff_async_bail_on_non_transient() {
use std::sync::atomic::{AtomicU32, Ordering};
let calls = AtomicU32::new(0);
let result = retry_with_backoff_async("test_async_non_transient", 3, 1, || async {
calls.fetch_add(1, Ordering::SeqCst);
anyhow::bail!("permanent error");
#[allow(unreachable_code)]
Ok::<i32, anyhow::Error>(0)
})
.await;
let err = result.unwrap_err();
assert_eq!(err.to_string(), "permanent error");
assert_eq!(calls.load(Ordering::SeqCst), 1);
}| Ok((chunks_requeued, jobs_enqueued)) | ||
| Ok(total) | ||
| })?; | ||
|
|
There was a problem hiding this comment.
[minor] The inline doc comments still describe the steps as 1 → 2 → 3 (truncate → remove wiki/summaries → re-enqueue chunks), but the refactored code runs them as 1 → 3 → 2 — the wiki/summaries removal now happens after chunk re-enqueue. The comment says "Step 3 — flip every chunk back…" but this block is now step 2 in execution order. Update the step numbering to reflect the actual new order, or future readers will be confused.
Add retry_with_backoff (sync) and retry_with_backoff_async (async) plus is_transient_fs_error classifier. Used by Windows-only paths where mandatory file locking trips ERROR_SHARING_VIOLATION (32), ERROR_ACCESS_DENIED (5), ERROR_LOCK_VIOLATION (33), and ERROR_USER_MAPPED_FILE (1224) on transient sub-second races between CEF subprocess, antivirus scanner, and our own profile cleanup. Sleep base_ms * 2^i between attempts; warn! on retry, info! on success-after-retry. Final-failure path surfaces the original error unwrapped so the caller's report_error funnel still sees a real Sentry event. Cross-platform: helper compiles on all targets but is_transient_fs_error only matches Windows-specific os_error codes; POSIX errors fall through and are surfaced immediately (no retry on UNIX EACCES — those are genuinely permanent). Co-Authored-By: oxoxDev <164490987+oxoxDev@users.noreply.github.com>
… sharing violations Wrap OpenOptions::create_new on `auth-profiles.lock` in retry_with_backoff (6 attempts, 100ms base ~3.1s total). On Windows, sharing violations are transient — the holding process (often a stale core sidecar PID or AV scanner) typically releases within seconds. Hard-failing on the first attempt produced flood of OPENHUMAN-TAURI-9E, -9C, -4Y, -61, and the read-path variant -5Q. The existing AlreadyExists busy-wait loop is preserved — that's the intentional contention case (another openhuman PID legitimately holds the lock). Only the unexpected error path now retries. Closes OPENHUMAN-TAURI-9E, OPENHUMAN-TAURI-9C, OPENHUMAN-TAURI-4Y, OPENHUMAN-TAURI-61, OPENHUMAN-TAURI-5Q. Co-Authored-By: oxoxDev <164490987+oxoxDev@users.noreply.github.com>
Wrap fs::remove_dir_all for memory_tree wipe + reset paths in retry_with_backoff_async (6 attempts, 200ms base ~12.4s total). The blocking SQL truncation stays in spawn_blocking; only the async filesystem cleanup uses the retry helper. Windows file-busy (os error 32) on user-data tree removal is invariably caused by a transient handle in a CEF subprocess or AV scan — releases within seconds in practice. The previous hard-fail produced OPENHUMAN-TAURI-9F and -4M. Closes OPENHUMAN-TAURI-9F, OPENHUMAN-TAURI-4M. Co-Authored-By: oxoxDev <164490987+oxoxDev@users.noreply.github.com>
…i#1641 CR) CodeRabbit major: both retry_with_backoff and retry_with_backoff_async skipped the loop body when attempts == 0 and panicked at `last_err.expect("attempts > 0")`. As public utilities, they should fail gracefully — surface anyhow::ensure! at the entry point so the caller gets a normal Err. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1641) - util.rs retry helpers: drop duplicate KV-vs-message-string interpolation in tracing::warn!/info!; structured fields are the canonical source. - util.rs: add test_retry_with_backoff_async_bail_on_non_transient covering the missing async bail path (sync test already existed). - memory/tree/read_rpc.rs: drop unnecessary dirs_removed.clone() — moved into WipeAllResponse directly. - memory/tree/read_rpc.rs: update inline step-order doc comments to match the refactored 1→3→2 sequence. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
378e633 to
fc16fd5
Compare
|
graycyrus 4 minors addressed in fc16fd5:
cargo test --lib openhuman::util — passes (23 tests now). fmt + check clean. |
|
graycyrus 4 minors addressed in
|
…erflow & comment fixes (tinyhumansai#1641) - credentials/profiles.rs: resolved merge conflict combining PR's retry_with_backoff wrapper with main's pid-write guard and clear_lock_if_stale stale-PID recovery (Issue tinyhumansai#1612) - util.rs: replace base_ms * 2u64.pow(i) with saturating_mul/saturating_pow capped at 30 000 ms to prevent overflow on large attempt counts in both sync and async retry helpers - memory/tree/read_rpc.rs: clarify wake_workers() comment to reflect post-cleanup ordering (prevents reader confusion about race window) - cargo fmt: reformat saturating chain onto single line
|
Deferred review notes from pr-manager (post-merge-conflict pass) These were flagged during the reviewer agent's pass but deferred for human consideration — they are non-blocking nitpicks and one question for the author. Nitpicks
Question for author
Should the hook be narrowed with |
…tinyhumansai#1335) (tinyhumansai#1488) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
retry_with_backoff/retry_with_backoff_asyncinsrc/openhuman/util.rsthat retry on Windows transient FS error codes (5, 32, 33, 1224).AuthProfilesStore::acquire_locknow retries lock creation 6× with exponential backoff (100ms base).memory::tree::wipe_all_rpc+reset_tree_rpcnow retry.openhumantree removal 6× with 200ms base.report_error— only the transient-retry phase is quieted.Problem
Failed to create auth profile lock at C:\Users\…\auth-profiles.lock(OPENHUMAN-TAURI-9E ×1, -9C ×1, -4Y ×1, -61 ×1; read-path -5Q ×1) — CEF subprocess / AV scanner holds a handle on the parent dir for a sub-second window when openhuman tries to create the lock file. Windows file locking is mandatory, socreate_newfails withERROR_SHARING_VIOLATION(32) orERROR_ACCESS_DENIED(5) instead of blocking.Failed to remove C:\Users\…\.openhuman: The process cannot access the file because it is being used by another process. (os error 32)(OPENHUMAN-TAURI-9F ×1, -4M ×1) — same shape on tree-wipe; CEF profile / Ollama binary handle releases within seconds.Solution
src/openhuman/util.rs: newretry_with_backoff(sync,std::thread::sleep) andretry_with_backoff_async(tokio). Both honor a sharedis_transient_fs_errorclassifier that matches Windowsos_errorcodes 5, 32, 33, 1224. On non-Windows, transient classification returns false (POSIXEACCESis not transient — surface immediately).src/openhuman/credentials/profiles.rs:acquire_lockwrapsOpenOptions::create_newwithretry_with_backoff("create auth profile lock", 6, 100, …). The existingAlreadyExistsbusy-wait loop is preserved — that's the legitimate "another openhuman PID is signing in" case. Only the unexpected error path retries.src/openhuman/memory/tree/read_rpc.rs:wipe_all_rpc+reset_tree_rpcseparate the SQLite truncation (stillspawn_blocking) from the filesystem cleanup (nowtokio::fswrapped inretry_with_backoff_async). Avoids blocking the executor while retries are sleeping.tracing::warn!with structured tags (op,attempt,retry_in_ms,error) keeps the diagnostic locally visible; final-failure surfaces unwrapped soreport_errorstill emits the Sentry event when retries are exhausted. First-attempt success is silent.src/openhuman/util.rs— immediate success, success-after-retries (sync + async), failure-after-all-attempts, non-transient bails immediately. Cross-platform tests use a__TEST_TRANSIENT__error-message marker so the retry logic exercises on POSIX CI.Submission Checklist
src/openhuman/util.rscovering happy path + failure modes + non-transient short-circuit.cargo test --lib openhuman::credentialspasses 112/0).Closesin## Related.Impact
is_transient_fs_errorreturns false on non-Windows and the retry short-circuits.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
2289277133504848069)Failedat the publish stage, but the produced patch was complete and clean. Pulled viajules remote pull --session, applied to a fresh branch offupstream/mainwith a single conflict insrc/openhuman/util.rs(resolved by keeping HEAD's existing truncate tests + appending Jules's new retry tests).Summary by CodeRabbit
Bug Fixes
Refactor
Chores / Tests