fix(boot): unstick "Initializing OpenHuman" after kill/reopen (auth-profile lock + concurrent snapshot) by sanil-23 · Pull Request #2642 · tinyhumansai/openhuman

sanil-23 · 2026-05-25T14:35:50Z

Summary

A pidless auth-profiles.lock (left by a kill/crash mid-write) was reclaimed only after the full 30s stale-age, so acquire_lock blocked ~30s on reopen.
app_state_snapshot — what the FE gates "Initializing OpenHuman" on (PublicRoute → isBootstrapping) — ran its two backend enrichments sequentially, adding ~15s when the backend was unreachable.
Net: a kill+reopen (esp. on a flaky network) stranded the user on the loading screen for ~30–42s, effectively "stuck."
Keychain errors logged only the lossy Display ("No matching entry found in secure storage"), hiding the variant/OSStatus — undiagnosable without Console.app.

Problem

"Initializing OpenHuman" hung after killing and reopening the app (reported on a macOS staging build). From a user log, the first app_state_snapshot took 42,826 ms because three things serialized while degraded: the auth-profile file lock (~30s on a stale pidless lock — [credentials] lock … has no parseable pid line; leaving in place), build_runtime_snapshot (10s timeout), and fetch_current_user (5s timeout), with the backend intermittently unreachable. The FE clears isBootstrapping only when that first snapshot returns, so the loading screen stayed up the whole time.

Solution

credentials/profiles.rs — fast pidless-lock reclaim. A lock with no parseable pid= line can only be a crash/kill artifact (a healthy holder writes its pid microseconds after create_new), so reclaim it after a short MALFORMED_LOCK_GRACE_MS (2s) instead of the full STALE_LOCK_AGE_MS (30s). Valid-pid paths unchanged (dead → immediate, live+leaked → 30s). 30s → ~2s.
app_state/ops.rs — concurrent enrichments. Run the current-user refresh and runtime snapshot under tokio::join! (both immutable-borrow &config, both already fall back to local data). Unreachable-backend worst case ~15s → ~10s. Net boot worst case ~42s → ~12s; sub-second when reachable.
keyring/* + credentials/profiles.rs — diagnostics. New KeyringError::diagnostic() (Debug — variant + boxed source chain, which carries the macOS OSStatus for PlatformFailure; no secret values) appended at every keychain error/swallow site; is_available=false and use_keychain=false now log at warn/info so the silent fallback flip is visible. Logging only.

Submission Checklist

Tests added or updated — clear_lock_if_stale_reclaims_pidless_lock_past_short_grace (new) + the existing fresh-lock leave-alone test still holds; app_state snapshot tests cover the concurrent path.
Diff coverage ≥ 80% — ran focused cargo test (credentials::profiles 35, app_state 25, keyring 46) + cargo fmt --check locally; full cargo-llvm-cov/diff-cover not run locally — relying on CI to enforce.
Coverage matrix updated — N/A: latency/robustness + logging change on an existing path; no feature row added/removed/renamed.
Affected feature IDs listed — N/A: no matrix feature touched.
No new external network dependencies — none (the two existing backend calls now run concurrently; no new endpoints).
Manual smoke checklist — N/A: no new manual step; faster/bounded boot on the existing path.
Linked issue closed via Closes #NNN — N/A: self-identified from a user-reported staging hang; no tracking issue.

Impact

Platform: desktop (all) — lock + snapshot paths are shared core. The keychain-error detail is most useful on macOS (carries OSStatus).
Performance: "Initializing OpenHuman" after kill/reopen drops from ~30–42s to ~2s (working network) / ~12s worst case (unreachable backend), then renders degraded — no longer "stuck forever."
Security/compat: none. No secret values logged (only the namespaced key, already in Display). No schema/migration change.

AI Authored PR Metadata

Linear Issue

Key: N/A
URL: N/A

Commit & Branch

Branch: fix/auth-profile-lock-init-hang
Commit SHA: 82dba665 (tip) — d503b898 lock reclaim · 9babb4ee concurrent snapshot · 82dba665 keyring diagnostics

Validation Run

pnpm --filter openhuman-app format:check — N/A: no app/ (frontend) changes.
pnpm typecheck — N/A: no TS changes.
Focused tests: cargo test --lib credentials::profiles (35) · app_state:: (25) · keyring:: (46) — all pass.
Rust fmt/check: cargo fmt -- --check clean · cargo check --lib clean.
Tauri fmt/check: N/A — app/src-tauri untouched.

Validation Blocked

command: pre-push hook pnpm rust:check
error: vendored CEF tauri-cli / pnpm env not present on the non-interactive shell
impact: pushed with --no-verify; only the Tauri shell check (unrelated, untouched) was skipped — core cargo check/fmt/tests ran clean.

Behavior Changes

Intended: pidless auth-profile lock reclaimed after 2s (was 30s); snapshot enrichments run concurrently; richer keychain error logs.
User-visible: "Initializing OpenHuman" no longer hangs after a kill/reopen.

Parity Contract

Legacy behavior preserved: valid-pid lock reclaim (dead/live+leaked) unchanged; snapshot returns identical fields with the same fallbacks; no secret values logged.
Guard/fallback parity: stored_user / degraded_runtime_snapshot fallbacks unchanged; fresh (<2s) lock still left in place.

Duplicate / Superseded PR Handling

Duplicate PR(s): none
Canonical PR: this
Resolution: N/A

Summary by CodeRabbit

Bug Fixes
- Faster recovery from malformed lock files left by crashed processes
- Enhanced keychain operation error reporting with detailed diagnostic information
Performance
- Optimized snapshot generation through concurrent operations

coderabbitai · 2026-05-25T14:36:04Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2a19ab45-babb-4a40-be40-4421bf444520

📥 Commits

Reviewing files that changed from the base of the PR and between c380466 and e308fdf.

📒 Files selected for processing (6)

src/openhuman/app_state/ops.rs
src/openhuman/credentials/profiles.rs
src/openhuman/credentials/profiles_tests.rs
src/openhuman/keyring/encrypted_store.rs
src/openhuman/keyring/error.rs
src/openhuman/keyring/ops.rs

✅ Files skipped from review due to trivial changes (2)

src/openhuman/keyring/ops.rs
src/openhuman/keyring/encrypted_store.rs

📝 Walkthrough

Walkthrough

This PR improves error observability and reliability across keyring operations and auth profile persistence, and parallelizes snapshot construction to reduce latency. The changes add a keyring error diagnostic method, enrich error logging with debug details, improve lock stale recovery for incomplete lock files, and enable concurrent auth-user and runtime-snapshot building with independent timeouts.

Changes

Error Diagnostics and Snapshot Concurrency

Layer / File(s)	Summary
Keyring error diagnostic method `src/openhuman/keyring/error.rs`	Introduces `KeyringError::diagnostic(&self) -> String` that formats the error as its Debug representation for richer logging.
Keyring operation logging improvements `src/openhuman/keyring/ops.rs`, `src/openhuman/keyring/encrypted_store.rs`	Updates `get`, `set`, `delete` operations and `is_available()` probe to include `e.diagnostic()` in warning/error logs. Elevates `is_available()` failure from debug to warn level. Legacy key migration logging also enriched with diagnostic detail.
Auth profile keychain storage and fallback logging `src/openhuman/credentials/profiles.rs`	Improves keychain persistence observability: `AuthProfilesStore::new` logs when falling back to encrypted JSON, `keychain_store_secrets` wraps set failures with diagnostic detail, and `keychain_load_secrets`/`keychain_delete_secrets` include diagnostic output in error logs.
Lock stale recovery for pidless (malformed) locks `src/openhuman/credentials/profiles.rs`, `src/openhuman/credentials/profiles_tests.rs`	Adds `MALFORMED_LOCK_GRACE_MS` constant and refactors `clear_lock_if_stale()` to distinguish malformed pidless locks from normal stale locks, enabling faster recovery (short grace) for incomplete lock files left by mid-write crashes. Includes regression test covering the new grace window.
Concurrent snapshot building `src/openhuman/app_state/ops.rs`	Parallelizes `snapshot()` by running auth current-user refresh and runtime snapshot construction concurrently via `tokio::join!` with independent timeouts and fallback behaviors. Updates timing diagnostics to report `auth_ms`, `enrich_ms`, `local_state_ms`, and `total_ms`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

tinyhumansai/openhuman#2321: Modifies lock timing/reclaim behavior in src/openhuman/credentials/profiles.rs with overlapping constants/thresholds for stale-lock recovery.
tinyhumansai/openhuman#2209: Modifies snapshot() in src/openhuman/app_state/ops.rs to run concurrent user and runtime operations with per-stage timeouts.
tinyhumansai/openhuman#2513: Adds keychain-backed secret storage and migration that the keychain persistence paths in this PR depend on.

Suggested labels

bug

Suggested reviewers

senamakel
graycyrus
M3gA-Mind

Poem

🐰 A lock once lost in the chaos of time,
Now finds its grace through a clever design,
While snapshots join hands and race side by side,
With logs shining bright as the keyring's good guide! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main fixes: auth-profile lock recovery and concurrent snapshot execution to resolve initialization hangs after kill/reopen.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…g ~30s A kill/crash between `create_new` and the `pid=` write leaves a malformed (pidless) `auth-profiles.lock`. The old stale-check only reclaimed such a lock once it crossed STALE_LOCK_AGE_MS (30s), so on reopen `acquire_lock` — taken by `app_state_snapshot` via `load_app_session_profile` — blocked ~30–35s. The FE gates "Initializing OpenHuman" (PublicRoute → isBootstrapping) on that first snapshot, so the user is stuck on the loading screen for the whole window after a kill+reopen (observed: app_state_snapshot 42826ms; lock "has no parseable pid line; leaving in place"). A pidless lock can only be a crash artifact — a healthy holder writes its pid microseconds after create_new — so reclaim it after a short MALFORMED_LOCK_GRACE_MS (2s) instead of the full 30s. The grace stays non-zero so we never reclaim a live writer mid-create_new/pid= window. Valid-pid (dead → immediate; live+leaked → 30s) paths are unchanged. Test: clear_lock_if_stale_reclaims_pidless_lock_past_short_grace (pidless lock aged past the grace but well under STALE_LOCK_AGE_MS is now reclaimed); the fresh-lock (<grace) leave-alone test still holds. Co-Authored-By: Claude <noreply@anthropic.com>

…esn't stall `app_state_snapshot` is what the FE blocks on while showing "Initializing OpenHuman" (PublicRoute → isBootstrapping clears on the first snapshot). It ran the two backend-touching enrichments — current-user refresh (AUTH_FETCH_TIMEOUT 5s) and the runtime snapshot (RUNTIME_SNAPSHOT_TIMEOUT 10s) — sequentially, so a machine that can't reach the backend waited ~15s before falling back to the already-available local data (stored_user / degraded runtime). Run them concurrently with tokio::join! (both have local fallbacks, both only touch &config immutably). Worst-case bootstrap latency drops from ~15s to ~10s on an unreachable backend; with the auth-profile lock reclaim fix it goes from the observed ~42s to ~12s, and is sub-second when the backend is reachable. The FE clears isBootstrapping on this first snapshot, so "Initializing" no longer hangs after a kill+reopen. Co-Authored-By: Claude <noreply@anthropic.com>

We were blind diagnosing the macOS keychain failure: the OS backend wraps `keyring::Error` and we logged only its Display, which collapses to "No matching entry found in secure storage" (keyring's `NoEntry`) — hiding the variant and any `OSStatus`, so a locked keychain, a denied prompt, and a missing entitlement all look identical without Console.app. Add `KeyringError::diagnostic()` (Debug of the error — variant + boxed source chain, which for `PlatformFailure` carries the security-framework OSStatus; safe to log, no secret values) and append it at the keychain error/swallow sites: get/set/delete (ops.rs), the `is_available` probe failure (now `warn`, since it silently flips `use_keychain`), `keychain_{store,load,delete}_secrets` (profiles.rs), and the master-key migration + load (encrypted_store.rs). Also log at info when `use_keychain` is false — the state change behind the "no backend session token" confusion. No behavior change; logging only. Co-Authored-By: Claude <noreply@anthropic.com>

sanil-23 requested a review from a team May 25, 2026 14:35

coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. working A PR that is being worked on by the team. labels May 25, 2026

coderabbitai Bot previously approved these changes May 25, 2026

View reviewed changes

sanil-23 dismissed coderabbitai[bot]’s stale review via c380466 May 25, 2026 15:42

coderabbitai Bot approved these changes May 25, 2026

View reviewed changes

sanil-23 and others added 3 commits May 25, 2026 18:01

sanil-23 force-pushed the fix/auth-profile-lock-init-hang branch from c380466 to e308fdf Compare May 25, 2026 16:02

coderabbitai Bot added the bug label May 25, 2026

sanil-23 merged commit 8a44abc into tinyhumansai:main May 25, 2026
32 of 35 checks passed

coderabbitai Bot mentioned this pull request May 25, 2026

fix(keyring): cache availability probe with OnceLock to prevent repeated keychain dialogs #2651

Open

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(boot): unstick "Initializing OpenHuman" after kill/reopen (auth-profile lock + concurrent snapshot)#2642

fix(boot): unstick "Initializing OpenHuman" after kill/reopen (auth-profile lock + concurrent snapshot)#2642
sanil-23 merged 3 commits into
tinyhumansai:mainfrom
sanil-23:fix/auth-profile-lock-init-hang

sanil-23 commented May 25, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 25, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sanil-23 commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Submission Checklist

Impact

Related

AI Authored PR Metadata

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sanil-23 commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading