fix: runtime snapshot timeouts, config perms, stale lock recovery, summarization-v1 tier, loopback OAuth UX#2690
Conversation
…e config perms + clearer socket ping-timeout log - Wrap screen_intelligence, local_ai, and autocomplete sub-ops in 5 s tokio::time::timeout inside build_runtime_snapshot so a hung engine can't stall the entire snapshot; falls back to degraded values - Auto-chmod config files detected as world-readable to 0o600 instead of just warning - Rename timeout_ms local in ws_loop for clarity; expand ping-timeout log message to include interval/timeout breakdown
…t-perms-socket-timeout
… lock + M1 loopback timeout B4 (providers): add MODEL_SUMMARIZATION_V1 constant and register it in is_known_openhuman_tier so summarization-v1 / hint:summarization no longer silently falls back to reasoning-v1; wire hint:summarization translation in make_openhuman_backend; update fallback warning message. B5 (credentials): change malformed_too_old from map_or(false) to map_or(true) so a pidless auth-profiles.lock whose mtime is unreadable (clock skew, platform limitation) is treated as stale rather than leaving callers stuck for the full LOCK_TIMEOUT_MS window. M1 (loopback-oauth): reduce DEFAULT_TIMEOUT_SECS from 300 → 60; surface a user-visible startupError in OAuthProviderButton when the loopback listener times out instead of silently logging.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughShortens the OAuth loopback timeout and surfaces a localized timeout error in the sign-in button; adds the translation entries. Separately, backend fixes add a summarization model tier, per-sub-operation snapshot timeouts with fallbacks, auto-fix for world-readable config permissions, stricter lock reclamation, and clearer ping-timeout logs. ChangesOAuth Loopback Timeout User Experience
Backend Robustness and Feature Expansion
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/openhuman/inference/provider/factory.rs (1)
47-60: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winRemove local constant duplication in favor of the imported constant.
The
is_abstract_tier_modelfunction defines a localMODEL_SUMMARIZATION_V1constant with a comment stating "No dedicated constant for the summarization tier yet". However, this PR adds exactly that constant totypes.rsand imports it at line 82. The function should use the imported constant instead of defining its own to avoid duplication and maintain consistency.♻️ Proposed fix to use the imported constant
fn is_abstract_tier_model(model: &str) -> bool { use crate::openhuman::config::{ - MODEL_AGENTIC_V1, MODEL_CODING_V1, MODEL_REASONING_QUICK_V1, MODEL_REASONING_V1, + MODEL_AGENTIC_V1, MODEL_CODING_V1, MODEL_REASONING_QUICK_V1, MODEL_REASONING_V1, + MODEL_SUMMARIZATION_V1, }; - // No dedicated constant for the summarization tier yet; keep the literal - // in sync with the tier name used by the summarizer sub-agent. - const MODEL_SUMMARIZATION_V1: &str = "summarization-v1"; let trimmed = model.trim(); trimmed == MODEL_REASONING_V1 || trimmed == MODEL_REASONING_QUICK_V1🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/inference/provider/factory.rs` around lines 47 - 60, The function is_abstract_tier_model defines a local const MODEL_SUMMARIZATION_V1 while an identical constant has already been added and imported; remove the local MODEL_SUMMARIZATION_V1 declaration and replace its usage with the imported MODEL_SUMMARIZATION_V1 (keep the trimmed variable and existing comparisons), ensuring the import of MODEL_SUMMARIZATION_V1 remains in scope so the function compares trimmed == MODEL_SUMMARIZATION_V1 instead of a duplicated literal.
🧹 Nitpick comments (3)
src/openhuman/app_state/ops.rs (1)
510-531: 💤 Low valueConsider adding a timeout wrapper for consistency (optional).
The
service::statuscall lacks a sub-operation timeout unlike the other three. Since it runs viaspawn_blocking, it won't block the async runtime, and the outerRUNTIME_SNAPSHOT_TIMEOUTprovides a safety net. However, ifservice::statushangs, the entire snapshot build will wait the full 10s before falling back.If the service status check is known to be fast and reliable, this is fine. Otherwise, wrapping it similarly would make all four sub-operations consistent and allow faster degraded fallback.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/app_state/ops.rs` around lines 510 - 531, Wrap the spawn_blocking call that invokes crate::openhuman::service::status(&config_for_service) in a tokio::time::timeout (same RUNTIME_SNAPSHOT_TIMEOUT used by the other sub-operations) so the sub-operation fails fast on hang; on timeout map the error into the same fallback ServiceStatus variant (ServiceState::Unknown, details/message populated) and log a warning similar to the existing Err branch, keeping the rest of the async block (t timing, elapsed return) unchanged.src/openhuman/credentials/profiles.rs (1)
1096-1101: ⚡ Quick winClarify reclaim log message when age is unreadable.
The new default (
map_or(true, ...)) correctly treats pidless locks with unreadable mtime as stale, handling clock-skew and filesystem anomalies. However, the reclaim log message at lines 1125-1128 claims the lock is "older than {MALFORMED_LOCK_GRACE_MS}ms" even whenageisNone(unreadable), which is misleading for troubleshooting.Consider updating the log format to distinguish actual measured age from unreadable age:
📝 Suggested refinement
let reclaim_reason: Option<String> = match pid { Some(pid) if !is_pid_alive(pid) => Some(format!("pid {pid} not alive")), Some(pid) if too_old => Some(format!( "lock file older than {STALE_LOCK_AGE_MS}ms (recorded pid {pid}, presumed leaked)" )), - None if malformed_too_old => Some(format!( - "no parseable pid and older than {MALFORMED_LOCK_GRACE_MS}ms \ - (abandoned in-flight lock, reclaiming)" - )), + None if malformed_too_old => { + let age_desc = age.map_or( + "age unreadable (clock skew or fs issue)".to_string(), + |a| format!("older than {MALFORMED_LOCK_GRACE_MS}ms (age={}ms)", a.as_millis()) + ); + Some(format!( + "no parseable pid and {age_desc} (abandoned in-flight lock, reclaiming)" + )) + }, Some(_) => return false,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/credentials/profiles.rs` around lines 1096 - 1101, The reclaim log is misleading because malformed_too_old uses age.map_or(true, ...) and treats unreadable age (age == None) as stale, but the log message (around the reclaim path) always claims the lock is "older than {MALFORMED_LOCK_GRACE_MS}ms"; update the log in the lock-reclaiming code to branch on the age variable: when age is Some(a) log the actual age in ms (e.g., "age {a_ms}ms, threshold {MALFORMED_LOCK_GRACE_MS}ms"), and when age is None log a clear message like "mtime unreadable; treating as stale (threshold {MALFORMED_LOCK_GRACE_MS}ms)". Ensure this change references the same variables/consts used now (age, malformed_too_old, MALFORMED_LOCK_GRACE_MS) so behavior is unchanged but messaging is accurate.src/openhuman/socket/ws_loop.rs (1)
342-345: ⚡ Quick winConsider including the socket ID for log correlation.
The expanded breakdown is a clear improvement. To fully align with the coding guideline requirement for correlation fields in timeout logs, consider including the socket ID to help correlate this timeout event with other connection activity.
📊 Proposed enhancement with socket_id correlation
+ let socket_id = shared.socket_id.read().clone().unwrap_or_else(|| "?".to_string()); log::warn!( - "[socket] No server ping received within {}ms (interval={}ms + timeout={}ms + 5s grace); reconnecting", + "[socket][sid={}] No server ping received within {}ms (interval={}ms + timeout={}ms + 5s grace); reconnecting", + socket_id, timeout_ms, ping_interval, ping_timeout_ms, );As per coding guidelines: "Log entry/exit, branches, external calls, retries/timeouts, state transitions, and errors with verbose diagnostics using stable grep-friendly prefixes and correlation fields".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/socket/ws_loop.rs` around lines 342 - 345, The timeout log for the server ping currently prints timing details but lacks a correlation field; update the log call in ws_loop.rs (the ping timeout branch where the message with timeout_ms, ping_interval, ping_timeout_ms is emitted) to include the socket ID (e.g., socket_id or self.socket_id) as a correlation field so timeouts can be traced to the connection; add the socket ID to the formatted message or, if using a structured logger, include it as a named field alongside the existing timing args.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/config/schema/load.rs`:
- Around line 1128-1141: The code currently ignores fs::set_permissions errors
after inserting the path into the warned cache, which can hide persistent chmod
failures; modify the flow around warned, config_path, fs::set_permissions, and
Permissions::from_mode so you attempt the chmod, check the Result, and only mark
the path as "warned" (or remove it on failure) when the permission change
succeeds; on Err, emit a tracing::error with a stable prefix (e.g.,
"[config][chmod]"), include config_path, the attempted mode (0o600) and the
underlying error, and ensure the once-per-path cache isn't advanced on failure
so subsequent loads will re-attempt and re-log the problem.
---
Outside diff comments:
In `@src/openhuman/inference/provider/factory.rs`:
- Around line 47-60: The function is_abstract_tier_model defines a local const
MODEL_SUMMARIZATION_V1 while an identical constant has already been added and
imported; remove the local MODEL_SUMMARIZATION_V1 declaration and replace its
usage with the imported MODEL_SUMMARIZATION_V1 (keep the trimmed variable and
existing comparisons), ensuring the import of MODEL_SUMMARIZATION_V1 remains in
scope so the function compares trimmed == MODEL_SUMMARIZATION_V1 instead of a
duplicated literal.
---
Nitpick comments:
In `@src/openhuman/app_state/ops.rs`:
- Around line 510-531: Wrap the spawn_blocking call that invokes
crate::openhuman::service::status(&config_for_service) in a tokio::time::timeout
(same RUNTIME_SNAPSHOT_TIMEOUT used by the other sub-operations) so the
sub-operation fails fast on hang; on timeout map the error into the same
fallback ServiceStatus variant (ServiceState::Unknown, details/message
populated) and log a warning similar to the existing Err branch, keeping the
rest of the async block (t timing, elapsed return) unchanged.
In `@src/openhuman/credentials/profiles.rs`:
- Around line 1096-1101: The reclaim log is misleading because malformed_too_old
uses age.map_or(true, ...) and treats unreadable age (age == None) as stale, but
the log message (around the reclaim path) always claims the lock is "older than
{MALFORMED_LOCK_GRACE_MS}ms"; update the log in the lock-reclaiming code to
branch on the age variable: when age is Some(a) log the actual age in ms (e.g.,
"age {a_ms}ms, threshold {MALFORMED_LOCK_GRACE_MS}ms"), and when age is None log
a clear message like "mtime unreadable; treating as stale (threshold
{MALFORMED_LOCK_GRACE_MS}ms)". Ensure this change references the same
variables/consts used now (age, malformed_too_old, MALFORMED_LOCK_GRACE_MS) so
behavior is unchanged but messaging is accurate.
In `@src/openhuman/socket/ws_loop.rs`:
- Around line 342-345: The timeout log for the server ping currently prints
timing details but lacks a correlation field; update the log call in ws_loop.rs
(the ping timeout branch where the message with timeout_ms, ping_interval,
ping_timeout_ms is emitted) to include the socket ID (e.g., socket_id or
self.socket_id) as a correlation field so timeouts can be traced to the
connection; add the socket ID to the formatted message or, if using a structured
logger, include it as a named field alongside the existing timing args.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1cc7f704-9d4a-4e83-a879-ec9e2a061d50
📒 Files selected for processing (23)
app/src/components/oauth/OAuthProviderButton.tsxapp/src/lib/i18n/chunks/ar-4.tsapp/src/lib/i18n/chunks/bn-4.tsapp/src/lib/i18n/chunks/de-4.tsapp/src/lib/i18n/chunks/en-4.tsapp/src/lib/i18n/chunks/es-4.tsapp/src/lib/i18n/chunks/fr-4.tsapp/src/lib/i18n/chunks/hi-4.tsapp/src/lib/i18n/chunks/id-4.tsapp/src/lib/i18n/chunks/it-4.tsapp/src/lib/i18n/chunks/ko-4.tsapp/src/lib/i18n/chunks/pt-4.tsapp/src/lib/i18n/chunks/ru-4.tsapp/src/lib/i18n/chunks/zh-CN-4.tsapp/src/lib/i18n/en.tsapp/src/utils/loopbackOauthListener.tssrc/openhuman/app_state/ops.rssrc/openhuman/config/mod.rssrc/openhuman/config/schema/load.rssrc/openhuman/config/schema/types.rssrc/openhuman/credentials/profiles.rssrc/openhuman/inference/provider/factory.rssrc/openhuman/socket/ws_loop.rs
DEFAULT_TIMEOUT_SECS was lowered from 300 to 60 in tinyhumansai#2690 (loopback OAuth UX) but the unit test still asserted 300, so main has been red on every commit since c42e249. Update the assertion to match the production default.
DEFAULT_TIMEOUT_SECS was lowered from 300 to 60 in tinyhumansai#2690 (loopback OAuth UX) but the unit test still asserted 300, so main has been red on every commit since c42e249. Update the assertion to match the production default.
The factory now translates hint:summarization → summarization-v1 (PR tinyhumansai#2690 added the dedicated summarization tier with the explicit Some("summarization") arm in make_openhuman_backend). The verbatim-forward assertion in make_openhuman_backend_forwards_unknown_hint_verbatim still listed it, so the test panics on main. Replace it with a generic placeholder (hint:lightweight) that exercises the Some(_) catch-all and update the doc comment to reflect the five canonical hints.
build_chat_runtime resolves the memory workload through provider_for_role, which now returns the summarization-v1 tier by default (PR tinyhumansai#2690 added the dedicated tier). The two build_chat_runtime tests still asserted reasoning-v1, so they fail on main. Update both expectations to summarization-v1 and add inline comments explaining the routing.
…g-v1 with cloud_llm_model override build_chat_runtime resolves the memory workload through provider_for_role, which now returns summarization-v1 by default (PR tinyhumansai#2690 added the dedicated tier). When cloud_llm_model is set, the legacy cloud-memory path takes over and falls back to reasoning-v1. Update the two tests in src/openhuman/memory/chat.rs to match the resolved model id on each path.
…i#2704) main HEAD (87f8ef4) is red on checks unrelated to this feature; bundling the fixes here per request so tinyhumansai#2704's CI can go green. Each is a logical merge conflict (both source PRs green alone, broken in combination): - config/ops_tests.rs: tinyhumansai#2631 grew AutonomySettingsPatch to 7 fields, but apply_autonomy_settings_updates_action_budget still used a bare 1-field literal (E0063). Add ..Default::default(). - loopbackOauthListener.test.ts: tinyhumansai#2690 changed DEFAULT_TIMEOUT_SECS 300->60, but the bind-fail test still asserted 300. Update to 60. Verified: cargo check --tests compiles; loopbackOauthListener 9/9 pass. Co-Authored-By: Claude <noreply@anthropic.com>
After tinyhumansai#2690 added the summarization-v1 tier, hint:summarization became a canonical hint (mapped to summarization-v1 by the factory). The test was iterating over it as an unknown passthrough case, which then failed on `cargo test --lib`. Drop the now-canonical entry and update the comment.
build_chat_runtime resolves the "summarization" workload role, which post-tinyhumansai#2690 routes to summarization-v1 by default. With memory_tree.cloud_llm_model overridden, routing falls back to reasoning-v1. Update both assertions to match current behavior.
Summary
Fixes five bugs from tester feedback, grouped into three commits:
B1 —
build_runtime_snapshot10s full-timeout (🔴 HIGH)screen_intelligence,local_ai, andautocompletesub-ops in individual 5stokio::time::timeoutcalls insidebuild_runtime_snapshot; a hung engine falls back to degraded values instead of blocking the entire snapshot.B2 — Config file never chmod'd to 600 (🔴 HIGH)
load.rsnow auto-appliesfs::set_permissions(..., 0o600)when a world-readable config is detected, instead of only logging a warning. DropsMutexGuardbefore theawaitto satisfy theSendbound.B4 —
summarization-v1silently falls back toreasoning-v1(🟠 LOW)MODEL_SUMMARIZATION_V1constant and registers it inis_known_openhuman_tier; addshint:summarization→summarization-v1translation inmake_openhuman_backend; updates the fallback warning message.B5 — Stale
auth-profiles.lockwith no pid not cleaned up reliably (🟠 LOW)malformed_too_olddefault frommap_or(false)tomap_or(true)so a pidless lock whose mtime is unreadable (clock skew, platform limitation) is immediately reclaimed rather than leaving callers stuck for the full 35sLOCK_TIMEOUT_MS.M3 — Ping timeout log message unclear (🟠 LOW)
timeout_mslocal inws_loop.rsfor clarity; expands the ping-timeout log to includeinterval/timeoutbreakdown.M1 — Loopback OAuth stalls 5 min before failing (🟡 MED)
DEFAULT_TIMEOUT_SECSfrom 300 → 60 inloopbackOauthListener.ts.startupErrorinOAuthProviderButtonon timeout instead of silently logging.Test plan
summarization-v1/hint:summarizationworkloads route correctly without fallback warning (B4)Summary by CodeRabbit
New Features
Bug Fixes
Improvements