Skip to content

fix(credentials): diagnose + recover from H8 auth-profile-lock create failures#2180

Open
YellowSnnowmann wants to merge 5 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/sentry-h8-auth-profile-lock-diagnostics
Open

fix(credentials): diagnose + recover from H8 auth-profile-lock create failures#2180
YellowSnnowmann wants to merge 5 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/sentry-h8-auth-profile-lock-diagnostics

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 19, 2026

Summary

  • Embed io::ErrorKind and raw_os_error() in the surfaced context for "Failed to create auth profile lock" so Sentry fingerprints split by root-cause OS code instead of collapsing into one ticket.
  • Classify Windows ERROR_DELETE_PENDING (OS code 303) as a transient FS error so retry_with_backoff rides out the AV / Search-indexer race that left the lock file in delete-pending limbo.
  • Add Unix coverage that locks in the new diagnostic context format and Windows coverage that asserts 303 is classified transient.

Problem

Sentry OPENHUMAN-TAURI-H8 ("Failed to create auth profile lock") was unresolved with 822 events, priority high, substatus escalating, last seen 2026-05-17. Tags showed:

  • os.name: macOS 659 / windows 180
  • domain = rpc, operation = invoke_method
  • method: openhuman.app_state_snapshot (710), openhuman.billing_get_current_plan (68), openhuman.team_get_usage (57)
  • elapsed_ms: 1–4 ms for ~95% of events

Two issues:

  1. AuthProfilesStore::acquire_lock wraps every non-AlreadyExists OpenOptions::create_new failure with a single static string. Every distinct OS errno collapses into one Sentry fingerprint, so we cannot tell which underlying error is hot — the 659 macOS events are opaque.
  2. On Windows, create_new immediately after the prior owner's fs::remove_file can return ERROR_DELETE_PENDING (303) when AV / Search-indexer still holds a handle. Until now is_transient_fs_error only matched 5/32/33/1224, so 303 returned a kind = Other io::Error that retry_with_backoff treated as fatal and bailed at attempt 1 (~2 ms — matching the elapsed_ms = 2 on the latest H8 event for team_get_usage on Windows 10.0.26200).

The branch is companion to OPENHUMAN-TAURI-H1 ("Timed out waiting for auth profile lock") which was already resolved by stale-PID recovery + retry-with-backoff in #1641 / #1636.

Solution

src/openhuman/credentials/profiles.rs — in the non-transient Err branch of the create_new match, walk the anyhow chain for an io::Error, pull its kind() and raw_os_error(), and embed both in the with_context format string:

let io = e.chain().find_map(|c| c.downcast_ref::<std::io::Error>());
let kind = io.map(|ioe| ioe.kind());
let os_code = io.and_then(|ioe| ioe.raw_os_error());
return Err(e).with_context(|| {
    format!("Failed to create auth profile lock (kind={:?}, os_code={:?})", kind, os_code)
});

The top-level message stays stable ("Failed to create auth profile lock …") so the existing Sentry rule keeps matching, but the trailing (kind=…, os_code=…) makes the fingerprint split per OS code. Once this ships, future events will tell us exactly which macOS errno dominates the 659-event cluster, unlocking a one-line follow-up to is_transient_fs_error on the macOS branch.

src/openhuman/util.rs — add 303 to the Windows match in is_transient_fs_error alongside 5/32/33/1224, with an inline comment documenting the AV/indexer delete-pending race and the H8 evidence that justifies it. retry_with_backoff (6 attempts, 100 ms base, exponential up to 30 s) now absorbs the race for callers of acquire_lock.

src/openhuman/credentials/profiles_tests.rs — #[cfg(unix)] test forces a non-AlreadyExists create_new failure by stripping write permission off the state dir (chmod 0500), then asserts the surfaced error string contains the stable top-level message, kind=, and os_code=. Doesn't assert on the kind itself (Windows would surface a different one for the same scenario), only that the diagnostic markers are present.

src/openhuman/util.rs — #[cfg(windows)] test builds io::Error::from_raw_os_error(303), wraps in anyhow::Error::new, and asserts is_transient_fs_error returns true.

Design notes / tradeoffs

  • Kept the top-level "Failed to create auth profile lock" prefix verbatim to preserve the existing Sentry issue grouping; only the suffix changes. The fingerprint splits by OS code via the new suffix without invalidating the existing alert / rule.
  • Did not pre-emptively add macOS errnos to is_transient_fs_error — speculation on the 659 macOS events without data would either over-retry (e.g. swallowing genuine PermissionDenied) or under-retry. Diagnostics first, fix after.
  • Did not change RPC-level retry semantics; the lock retry remains internal to AuthProfilesStore.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — acquire_lock_error_message_includes_io_kind_and_os_code (Unix) covers the non-AlreadyExists failure path; is_transient_fs_error_classifies_windows_delete_pending (Windows) covers the new transient classification.
  • Diff coverage ≥ 80% — cargo test --lib credentials::profiles 29/29 green and cargo test --lib openhuman::util:: 29/29 green; both new units exercise every changed Rust line in profiles.rs (lines 595–614) and util.rs (line 630).
  • Coverage matrix updated — N/A: behaviour-only change (no new feature row; H8 fix is internal to the auth-profile lock path already tracked under the credentials domain).
  • All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: behaviour-only change (see above).
  • No new external network dependencies introduced (mock backend used per Testing Strategy) — both new tests are pure Rust unit tests; no network, no mock backend needed.
  • Manual smoke checklist updated if this touches release-cut surfaces (../docs/RELEASE-MANUAL-SMOKE.md) — N/A: no user-visible surface change; only error-message text and retry policy on an internal lock path.
  • Linked issue closed via Closes #NNN in the ## Related section.

Impact

  • Platforms: desktop only (Windows + macOS + Linux). The transient-classification change is Windows-specific (#[cfg(windows)] match arm); the diagnostic context change runs on every platform.
  • Performance: on Windows, callers that previously failed in ~2 ms now retry up to 6 times with exponential backoff (100 ms base, capped at 30 s) — worst case is multi-second latency on a contended lock under AV interference, vs. an immediate RPC error today. On macOS / Linux, retry behaviour is unchanged; only the error string surfaced to Sentry differs.
  • Security: no change to lock semantics; the file path is still redacted from error messages (preserved from Redact auth profile lock paths from errors #1515).
  • Migration / compatibility: none. The top-level Sentry message prefix is preserved so existing alerts continue to match; the change is purely additive in the error context suffix.

Related

Summary by CodeRabbit

  • Bug Fixes

    • Improved error messages for authentication profile lock failures with richer diagnostic context for easier troubleshooting.
    • Enhanced detection and handling of transient Windows filesystem errors to enable automatic retries and improved resilience.
  • Tests

    • Added unit tests covering lock-creation error reporting and Windows transient filesystem error detection.

Review Change Stack

…enhance error context in auth profile lock creation
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b6f09f85-fd8c-42eb-8e05-d504eb96264c

📥 Commits

Reviewing files that changed from the base of the PR and between 16c83a4 and ec020fd.

📒 Files selected for processing (1)
  • src/openhuman/credentials/profiles_tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/openhuman/credentials/profiles_tests.rs

📝 Walkthrough

Walkthrough

Adds richer io::ErrorKind/raw_os_error diagnostics when lock-file creation fails and treats Windows ERROR_DELETE_PENDING (303) as a transient filesystem error; includes unit tests for the annotation helper and the new transient classification.

Changes

Auth Profile Lock Resilience

Layer / File(s) Summary
Lock creation failure annotation and diagnostics
src/openhuman/credentials/profiles.rs, src/openhuman/credentials/profiles_tests.rs
AuthProfilesStore::acquire_lock now uses annotate_lock_create_failure to wrap non-AlreadyExists lock creation errors with extracted io::ErrorKind and raw_os_error details. Two unit tests validate annotation for both io::Error and non-io::Error chains.
Windows transient filesystem error detection
src/openhuman/util.rs
is_transient_fs_error on Windows now treats error code 303 (ERROR_DELETE_PENDING) as transient. A Windows-only test asserts an io::Error constructed from raw OS error 303 is classified as transient.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Suggested reviewers

  • graycyrus

Poem

🐰 I found a lock that told me why,

a kind and code that caught my eye,
On Windows winds the retry sings,
Small errors bloom into spring,
Hop, hop — diagnostics fly!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately describes the main change: enhancing diagnostics and recovery for auth profile lock creation failures, which is the primary focus across all modified files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

…e lock creation into a separate function for improved testability and clarity
…creation failures, ensuring platform independence and clarity
@YellowSnnowmann YellowSnnowmann marked this pull request as ready for review May 19, 2026 08:08
@YellowSnnowmann YellowSnnowmann requested a review from a team May 19, 2026 08:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/openhuman/credentials/profiles_tests.rs`:
- Around line 454-468: The test constructs a platform-specific raw OS error code
incorrectly by hardcoding 13; update the test in profiles_tests.rs so io_err is
set with cfg attributes (e.g., #[cfg(windows)] use 5 / ERROR_ACCESS_DENIED and
#[cfg(not(windows))] use 13 / EACCES) before calling
annotate_lock_create_failure, and change the subsequent os_code assertion to
expect the platform-specific value (use the same cfg to set the expected_os_code
variable or two separate assertions) so the test validates the real OS error
code on Windows and non-Windows platforms while keeping the checks for the
top-level message and ErrorKind.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9aeb69de-2b61-4f3b-a03d-6e7130829740

📥 Commits

Reviewing files that changed from the base of the PR and between c25fc8e and 16c83a4.

📒 Files selected for processing (3)
  • src/openhuman/credentials/profiles.rs
  • src/openhuman/credentials/profiles_tests.rs
  • src/openhuman/util.rs

Comment thread src/openhuman/credentials/profiles_tests.rs Outdated
…reation failure tests for better accuracy and consistency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant