fix(credentials): diagnose + recover from H8 auth-profile-lock create failures by YellowSnnowmann · Pull Request #2180 · tinyhumansai/openhuman

YellowSnnowmann · 2026-05-19T07:19:22Z

Summary

Embed io::ErrorKind and raw_os_error() in the surfaced context for "Failed to create auth profile lock" so Sentry fingerprints split by root-cause OS code instead of collapsing into one ticket.
Classify Windows ERROR_DELETE_PENDING (OS code 303) as a transient FS error so retry_with_backoff rides out the AV / Search-indexer race that left the lock file in delete-pending limbo.
Add Unix coverage that locks in the new diagnostic context format and Windows coverage that asserts 303 is classified transient.

Problem

Sentry OPENHUMAN-TAURI-H8 ("Failed to create auth profile lock") was unresolved with 822 events, priority high, substatus escalating, last seen 2026-05-17. Tags showed:

os.name: macOS 659 / windows 180
domain = rpc, operation = invoke_method
method: openhuman.app_state_snapshot (710), openhuman.billing_get_current_plan (68), openhuman.team_get_usage (57)
elapsed_ms: 1–4 ms for ~95% of events

Two issues:

AuthProfilesStore::acquire_lock wraps every non-AlreadyExists OpenOptions::create_new failure with a single static string. Every distinct OS errno collapses into one Sentry fingerprint, so we cannot tell which underlying error is hot — the 659 macOS events are opaque.
On Windows, create_new immediately after the prior owner's fs::remove_file can return ERROR_DELETE_PENDING (303) when AV / Search-indexer still holds a handle. Until now is_transient_fs_error only matched 5/32/33/1224, so 303 returned a kind = Other io::Error that retry_with_backoff treated as fatal and bailed at attempt 1 (~2 ms — matching the elapsed_ms = 2 on the latest H8 event for team_get_usage on Windows 10.0.26200).

The branch is companion to OPENHUMAN-TAURI-H1 ("Timed out waiting for auth profile lock") which was already resolved by stale-PID recovery + retry-with-backoff in #1641 / #1636.

Solution

src/openhuman/credentials/profiles.rs — in the non-transient Err branch of the create_new match, walk the anyhow chain for an io::Error, pull its kind() and raw_os_error(), and embed both in the with_context format string:

let io = e.chain().find_map(|c| c.downcast_ref::<std::io::Error>());
let kind = io.map(|ioe| ioe.kind());
let os_code = io.and_then(|ioe| ioe.raw_os_error());
return Err(e).with_context(|| {
    format!("Failed to create auth profile lock (kind={:?}, os_code={:?})", kind, os_code)
});

The top-level message stays stable ("Failed to create auth profile lock …") so the existing Sentry rule keeps matching, but the trailing (kind=…, os_code=…) makes the fingerprint split per OS code. Once this ships, future events will tell us exactly which macOS errno dominates the 659-event cluster, unlocking a one-line follow-up to is_transient_fs_error on the macOS branch.

src/openhuman/util.rs — add 303 to the Windows match in is_transient_fs_error alongside 5/32/33/1224, with an inline comment documenting the AV/indexer delete-pending race and the H8 evidence that justifies it. retry_with_backoff (6 attempts, 100 ms base, exponential up to 30 s) now absorbs the race for callers of acquire_lock.

src/openhuman/credentials/profiles_tests.rs — #[cfg(unix)] test forces a non-AlreadyExists create_new failure by stripping write permission off the state dir (chmod 0500), then asserts the surfaced error string contains the stable top-level message, kind=, and os_code=. Doesn't assert on the kind itself (Windows would surface a different one for the same scenario), only that the diagnostic markers are present.

src/openhuman/util.rs — #[cfg(windows)] test builds io::Error::from_raw_os_error(303), wraps in anyhow::Error::new, and asserts is_transient_fs_error returns true.

Design notes / tradeoffs

Kept the top-level "Failed to create auth profile lock" prefix verbatim to preserve the existing Sentry issue grouping; only the suffix changes. The fingerprint splits by OS code via the new suffix without invalidating the existing alert / rule.
Did not pre-emptively add macOS errnos to is_transient_fs_error — speculation on the 659 macOS events without data would either over-retry (e.g. swallowing genuine PermissionDenied) or under-retry. Diagnostics first, fix after.
Did not change RPC-level retry semantics; the lock retry remains internal to AuthProfilesStore.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — acquire_lock_error_message_includes_io_kind_and_os_code (Unix) covers the non-AlreadyExists failure path; is_transient_fs_error_classifies_windows_delete_pending (Windows) covers the new transient classification.
Diff coverage ≥ 80% — cargo test --lib credentials::profiles 29/29 green and cargo test --lib openhuman::util:: 29/29 green; both new units exercise every changed Rust line in profiles.rs (lines 595–614) and util.rs (line 630).
Coverage matrix updated — N/A: behaviour-only change (no new feature row; H8 fix is internal to the auth-profile lock path already tracked under the credentials domain).
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: behaviour-only change (see above).
No new external network dependencies introduced (mock backend used per Testing Strategy) — both new tests are pure Rust unit tests; no network, no mock backend needed.
Manual smoke checklist updated if this touches release-cut surfaces (../docs/RELEASE-MANUAL-SMOKE.md) — N/A: no user-visible surface change; only error-message text and retry policy on an internal lock path.
Linked issue closed via Closes #NNN in the ## Related section.

Impact

Platforms: desktop only (Windows + macOS + Linux). The transient-classification change is Windows-specific (#[cfg(windows)] match arm); the diagnostic context change runs on every platform.
Performance: on Windows, callers that previously failed in ~2 ms now retry up to 6 times with exponential backoff (100 ms base, capped at 30 s) — worst case is multi-second latency on a contended lock under AV interference, vs. an immediate RPC error today. On macOS / Linux, retry behaviour is unchanged; only the error string surfaced to Sentry differs.
Security: no change to lock semantics; the file path is still redacted from error messages (preserved from Redact auth profile lock paths from errors #1515).
Migration / compatibility: none. The top-level Sentry message prefix is preserved so existing alerts continue to match; the change is purely additive in the error context suffix.

Summary by CodeRabbit

Bug Fixes
- Improved error messages for authentication profile lock failures with richer diagnostic context for easier troubleshooting.
- Enhanced detection and handling of transient Windows filesystem errors to enable automatic retries and improved resilience.
Tests
- Added unit tests covering lock-creation error reporting and Windows transient filesystem error detection.

…enhance error context in auth profile lock creation

…tion failure in read-only state directory

coderabbitai · 2026-05-19T07:19:29Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6f09f85-fd8c-42eb-8e05-d504eb96264c

📥 Commits

Reviewing files that changed from the base of the PR and between 16c83a4 and ec020fd.

📒 Files selected for processing (1)

src/openhuman/credentials/profiles_tests.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/openhuman/credentials/profiles_tests.rs

📝 Walkthrough

Walkthrough

Adds richer io::ErrorKind/raw_os_error diagnostics when lock-file creation fails and treats Windows ERROR_DELETE_PENDING (303) as a transient filesystem error; includes unit tests for the annotation helper and the new transient classification.

Changes

Auth Profile Lock Resilience

Layer / File(s)	Summary
Lock creation failure annotation and diagnostics `src/openhuman/credentials/profiles.rs`, `src/openhuman/credentials/profiles_tests.rs`	`AuthProfilesStore::acquire_lock` now uses `annotate_lock_create_failure` to wrap non-`AlreadyExists` lock creation errors with extracted `io::ErrorKind` and `raw_os_error` details. Two unit tests validate annotation for both `io::Error` and non-`io::Error` chains.
Windows transient filesystem error detection `src/openhuman/util.rs`	`is_transient_fs_error` on Windows now treats error code 303 (`ERROR_DELETE_PENDING`) as transient. A Windows-only test asserts an `io::Error` constructed from raw OS error 303 is classified as transient.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

fix(auth): auth profile lock timeout/creation failure blocks all RPC calls (regression) #2072: Addresses the same auth-profile lock creation failure modes by improving diagnostics and transient-error handling (ERROR_DELETE_PENDING) relevant to retry behavior.

Suggested reviewers

graycyrus

Poem

🐰 I found a lock that told me why,

a kind and code that caught my eye,
On Windows winds the retry sings,
Small errors bloom into spring,
Hop, hop — diagnostics fly!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title accurately describes the main change: enhancing diagnostics and recovery for auth profile lock creation failures, which is the primary focus across all modified files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…e lock creation into a separate function for improved testability and clarity

…creation failures, ensuring platform independence and clarity

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/credentials/profiles_tests.rs`:
- Around line 454-468: The test constructs a platform-specific raw OS error code
incorrectly by hardcoding 13; update the test in profiles_tests.rs so io_err is
set with cfg attributes (e.g., #[cfg(windows)] use 5 / ERROR_ACCESS_DENIED and
#[cfg(not(windows))] use 13 / EACCES) before calling
annotate_lock_create_failure, and change the subsequent os_code assertion to
expect the platform-specific value (use the same cfg to set the expected_os_code
variable or two separate assertions) so the test validates the real OS error
code on Windows and non-Windows platforms while keeping the checks for the
top-level message and ErrorKind.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9aeb69de-2b61-4f3b-a03d-6e7130829740

📥 Commits

Reviewing files that changed from the base of the PR and between c25fc8e and 16c83a4.

📒 Files selected for processing (3)

src/openhuman/credentials/profiles.rs
src/openhuman/credentials/profiles_tests.rs
src/openhuman/util.rs

…reation failure tests for better accuracy and consistency

YellowSnnowmann added 2 commits May 19, 2026 11:56

fix(util): classify ERROR_DELETE_PENDING as transient on Windows and …

6b57188

…enhance error context in auth profile lock creation

test(credentials): add test for error message details on lock acquisi…

1957a6b

…tion failure in read-only state directory

YellowSnnowmann added 2 commits May 19, 2026 13:16

refactor(credentials): extract error context handling for auth profil…

7d10533

…e lock creation into a separate function for improved testability and clarity

test(credentials): enhance error message tests for auth profile lock …

16c83a4

…creation failures, ensuring platform independence and clarity

YellowSnnowmann marked this pull request as ready for review May 19, 2026 08:08

YellowSnnowmann requested a review from a team May 19, 2026 08:08

coderabbitai Bot requested changes May 19, 2026

View reviewed changes

Comment thread src/openhuman/credentials/profiles_tests.rs Outdated

test(credentials): improve platform-specific error handling in lock c…

ec020fd

…reation failure tests for better accuracy and consistency

coderabbitai Bot approved these changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(credentials): diagnose + recover from H8 auth-profile-lock create failures#2180

fix(credentials): diagnose + recover from H8 auth-profile-lock create failures#2180
YellowSnnowmann wants to merge 5 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/sentry-h8-auth-profile-lock-diagnostics

YellowSnnowmann commented May 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

YellowSnnowmann commented May 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Design notes / tradeoffs

Submission Checklist

Impact

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

YellowSnnowmann commented May 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading