fix(core): port-bind fallback + retry when 7788 is busy (#1613)#2116
Conversation
…nyhumansai#1613) Adds a sentry breadcrumb alongside the existing log.warn when apply_embedded_ready_signal detects the embedded core bound to a fallback port. Tauri shell's sentry crate ships without the tracing feature, so log macros do not auto-propagate as breadcrumbs — explicit add_breadcrumb is required for the fallback event to surface alongside future error reports. Refs tinyhumansai#1613
📝 WalkthroughWalkthroughAdds preferred-first port selection with retry/probing and fallback, sends EmbeddedReadySignal with chosen port, updates CoreProcessHandle to track active/preferred ports plus a consumable fallback notice, shows a toast on fallback, and adds tests covering these flows. ChangesPort Fallback and Dynamic Selection
Sequence DiagramsequenceDiagram
participant User
participant App as TauriApp
participant Cmd as start_core_process
participant Handle as CoreProcessHandle
participant Server as EmbeddedServer
participant Picker as pick_listen_port_for_host
User->>App: Launch
App->>Cmd: invoke(state, app)
Cmd->>Handle: ensure_running()
Handle->>Server: spawn with ready_tx
Server->>Picker: pick_listen_port_for_host(127.0.0.1, preferred)
Picker-->>Server: (listener, port, fallback_from)
Server->>Server: bind to returned listener
Server->>Handle: send EmbeddedReadySignal(port, fallback_from)
Handle->>Handle: update active_port and store PortFallbackNotice
Handle-->>Cmd: Ok
Cmd->>Handle: take_last_port_fallback_notice()
Handle-->>Cmd: Some(notice)
Cmd->>App: show toast("Using port X, fell back from Y")
App->>User: notification
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/openhuman/connectivity/rpc.rs (1)
629-642:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winIsolate these env-var tests from
OPENHUMAN_CORE_RPC_URL.
resolve_listen_port()now prefersOPENHUMAN_CORE_RPC_URL, but the default/invalid-env tests only save and clearOPENHUMAN_CORE_PORT. An inherited URL from the runner will make these assertions nondeterministic.Also applies to: 665-695
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/openhuman/connectivity/rpc.rs` around lines 629 - 642, The tests around resolve_listen_port() only save/clear OPENHUMAN_CORE_PORT but resolve_listen_port() now also reads OPENHUMAN_CORE_RPC_URL, causing flaky behavior; update the tests (the functions using ENV_LOCK and resolve_listen_port()) to also save the current OPENHUMAN_CORE_RPC_URL (e.g., let prev_url = std::env::var("OPENHUMAN_CORE_RPC_URL").ok()), remove or clear that env (unsafe { std::env::remove_var("OPENHUMAN_CORE_RPC_URL"); }) for the test duration, and restore it at the end alongside OPENHUMAN_CORE_PORT; apply the same change to the other test block mentioned (around lines covering 665-695) so both tests isolate OPENHUMAN_CORE_RPC_URL as well as OPENHUMAN_CORE_PORT.src/core/jsonrpc.rs (1)
913-938:⚠️ Potential issue | 🟠 Major | ⚡ Quick winSynchronize the chosen port into process env on the standalone path too.
run_server_inner()can now fall back forrun_server(), but only the embedded ready callback updatesOPENHUMAN_CORE_RPC_URL. If the standalone CLI lands on 7789+,openhuman.connectivity_diagwill still report the old configured port becauseresolve_listen_port()reads env state, not the bound listener.Proposed fix
let listen_port = pick.port; let bind_addr = format!("{host}:{listen_port}"); + std::env::set_var("OPENHUMAN_CORE_RPC_URL", format!("http://{bind_addr}/rpc")); let listener = pick.listener;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/core/jsonrpc.rs` around lines 913 - 938, The standalone startup path doesn't update OPENHUMAN_CORE_RPC_URL, so resolve_listen_port() may read the old env; after selecting the final listen_port/bind_addr in run_server_inner (the same place you send the EmbeddedReadySignal), set the OPENHUMAN_CORE_RPC_URL env var to the full URL (http://{bind_addr}) so both embedded and standalone paths synchronize; modify the block that currently sends EmbeddedReadySignal (or the branch where ready_tx is None) to call std::env::set_var("OPENHUMAN_CORE_RPC_URL", format!("http://{bind_addr}")) using the same bind_addr and listen_port values so resolve_listen_port() sees the updated port.
🧹 Nitpick comments (1)
app/src-tauri/src/core_process_tests.rs (1)
140-149: 💤 Low valueHardcoded port 7789 expectation may cause environmental flakiness.
If another process happens to occupy port 7789 in the test environment (e.g., parallel test run, CI agent state), this test will fail with a misleading assertion error even though the fallback mechanism works correctly (by choosing 7790+). The skip logic for 7788 already acknowledges environmental constraints.
Consider either binding 7789 as well to guarantee the scenario, or relaxing the assertion to verify the fallback port is in the valid range (7789–7798) rather than exactly 7789:
let chosen = handle.port(); assert!( (7789..=7798).contains(&chosen), "fallback should choose a port in 7789–7798, got {chosen}" );Not a blocker since 7789 conflicts are rare, but could prevent sporadic CI failures.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src-tauri/src/core_process_tests.rs` around lines 140 - 149, The test currently asserts handle.port() == 7789 and notice.chosen_port == 7789 which can flake if 7789 is already occupied; update the assertions to allow any valid fallback in the configured range instead of a single hardcoded value: assert that handle.port() (from handle.port()) is within 7789..=7798 and that notice.chosen_port (from take_last_port_fallback_notice().chosen_port) is also within that same range, or alternatively bind 7789 before starting the core to guarantee it; adjust the failure message to include the actual chosen port for easier debugging.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/connectivity/rpc.rs`:
- Around line 216-238: The fallback loop that calls TcpListener::bind on each
entry of fallback_ports (inside the function that returns PickListenPortResult)
currently tries each fallback only once and treats AddrInUse as final, causing
NoAvailablePort when transient conflicts exist; change the logic to retry binds
for each fallback similar to how the preferred port is retried (i.e., on
Err(err) if err.kind() == ErrorKind::AddrInUse loop with a short backoff and a
bounded number of attempts per fallback), preserving the existing warn/debug
messages and returning the same PickListenPortResult when a bind succeeds;
ensure you reference and update the loop using fallback_ports,
TcpListener::bind, preferred, fingerprint.as_human_readable(), and the returned
PickListenPortResult so retries are applied to every fallback candidate.
- Around line 248-255: identify_listener builds URLs like
"http://{probe_host}:{port}/" which breaks for IPv6 literals (e.g., "::1")
because they must be bracketed; fix by detecting when probe_host contains ':'
(an IPv6 literal) and, if not already wrapped, wrap it in square brackets before
constructing url (update the code that sets probe_host or the url construction
using identify_listener and the url variable to produce "http://[::1]:7788/" for
IPv6 addresses, avoiding double-bracketing).
---
Outside diff comments:
In `@src/core/jsonrpc.rs`:
- Around line 913-938: The standalone startup path doesn't update
OPENHUMAN_CORE_RPC_URL, so resolve_listen_port() may read the old env; after
selecting the final listen_port/bind_addr in run_server_inner (the same place
you send the EmbeddedReadySignal), set the OPENHUMAN_CORE_RPC_URL env var to the
full URL (http://{bind_addr}) so both embedded and standalone paths synchronize;
modify the block that currently sends EmbeddedReadySignal (or the branch where
ready_tx is None) to call std::env::set_var("OPENHUMAN_CORE_RPC_URL",
format!("http://{bind_addr}")) using the same bind_addr and listen_port values
so resolve_listen_port() sees the updated port.
In `@src/openhuman/connectivity/rpc.rs`:
- Around line 629-642: The tests around resolve_listen_port() only save/clear
OPENHUMAN_CORE_PORT but resolve_listen_port() now also reads
OPENHUMAN_CORE_RPC_URL, causing flaky behavior; update the tests (the functions
using ENV_LOCK and resolve_listen_port()) to also save the current
OPENHUMAN_CORE_RPC_URL (e.g., let prev_url =
std::env::var("OPENHUMAN_CORE_RPC_URL").ok()), remove or clear that env (unsafe
{ std::env::remove_var("OPENHUMAN_CORE_RPC_URL"); }) for the test duration, and
restore it at the end alongside OPENHUMAN_CORE_PORT; apply the same change to
the other test block mentioned (around lines covering 665-695) so both tests
isolate OPENHUMAN_CORE_RPC_URL as well as OPENHUMAN_CORE_PORT.
---
Nitpick comments:
In `@app/src-tauri/src/core_process_tests.rs`:
- Around line 140-149: The test currently asserts handle.port() == 7789 and
notice.chosen_port == 7789 which can flake if 7789 is already occupied; update
the assertions to allow any valid fallback in the configured range instead of a
single hardcoded value: assert that handle.port() (from handle.port()) is within
7789..=7798 and that notice.chosen_port (from
take_last_port_fallback_notice().chosen_port) is also within that same range, or
alternatively bind 7789 before starting the core to guarantee it; adjust the
failure message to include the actual chosen port for easier debugging.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5edabb2a-01d1-4fda-bfd2-04e98e372d1e
📒 Files selected for processing (5)
app/src-tauri/src/core_process.rsapp/src-tauri/src/core_process_tests.rsapp/src-tauri/src/lib.rssrc/core/jsonrpc.rssrc/openhuman/connectivity/rpc.rs
…Use (tinyhumansai#1613) Previously each fallback port (7789–7798) was single-shot bound; a transient race (AV scanner, prior-instance teardown) on any fallback would surface as NoAvailablePort even though the retry contract added in 3ef9271 applied only to the preferred port. Apply the same retry_policy (3 attempts × 500ms backoff) per fallback candidate, mirroring the preferred-port path. Addresses CodeRabbit review on PR tinyhumansai#2116 (rpc.rs:216-238).
…inyhumansai#1613) `http://::1:7788/` violates RFC 3986 and fails URL parsing — colons in the IPv6 literal are mis-read as host:port. This made identify_listener mis-classify live OpenHuman cores bound on IPv6 addresses as `Other`, breaking the stale-listener takeover path (tinyhumansai#1130) for any IPv6 deployment. Detect IPv6 literals (colon in host, not already bracketed) and wrap the authority in `[]` per RFC 3986 / RFC 2732 before format!ing the probe URL. Addresses CodeRabbit review on PR tinyhumansai#2116 (rpc.rs:248-255).
…CORE_RPC_URL (tinyhumansai#1613) resolve_listen_port() reads OPENHUMAN_CORE_RPC_URL ahead of OPENHUMAN_CORE_PORT (higher priority), so the default and invalid-env tests need to save and clear both env vars — otherwise an inherited URL from the test runner makes the assertions nondeterministic. The honours_env_override test also now clears OPENHUMAN_CORE_RPC_URL so the OPENHUMAN_CORE_PORT override is the resolved value rather than being shadowed by an inherited URL. Addresses CodeRabbit review on PR tinyhumansai#2116 (rpc.rs:629-695, outside-diff).
…ack (tinyhumansai#1613) Only the embedded ready callback (apply_embedded_ready_signal) was updating OPENHUMAN_CORE_RPC_URL when fallback engaged. The standalone openhuman-core CLI never refreshed the env var, so openhuman.connectivity_diag (which reads via resolve_listen_port) would report the originally-requested port instead of the actually-bound listener after fallback to 7789+. Set the env var unconditionally right after the bind succeeds so both paths keep diag honest. Embedded path still re-sets the same value through the ready signal — idempotent, no behavior change for the embedded case. Addresses CodeRabbit review on PR tinyhumansai#2116 (jsonrpc.rs:913-938, outside-diff).
…yhumansai#1613) The fixed-port fallback test asserted handle.port() == 7789 and notice.chosen_port == 7789. Under parallel test runs or environmental state where 7789 is occupied (CI agent residue, AV scanner), the actual fallback would correctly pick 7790+ but the assertion would fail with a misleading message even though the recovery contract was upheld. Accept any port in the configured fallback range 7789..=7798 instead, and include the chosen port in the failure message for easier debugging. The skip-on-bind-failure logic for 7788 already acknowledges the same environmental flakiness. Addresses CodeRabbit review on PR tinyhumansai#2116 (core_process_tests.rs:140-149).
|
@coderabbitai — addressing the outside-diff + nitpick items from your review:
|
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Unrecoverable error: Failed to clone repository
All the outside-diff and nitpick items from the review are addressed. The implementation looks solid. |
# Conflicts: # src/core/jsonrpc.rs
Subtracting small offsets (-30/-60/-90s) from `Date.now() - 86400000` could cross the previous-day boundary when CI happens to run within 90 seconds of local midnight. The "Peak: 4" assertion then dropped to 3 because one of the four timestamps landed on the day before yesterday. Anchor the helper at noon yesterday so subtractions stay safely inside the same day.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/src/components/intelligence/MemoryHeatmap.test.tsx (1)
69-69: ⚡ Quick winConsider using the noon-anchored helper for consistency.
The millisecond test still uses the old
Date.now() - 24hpattern, which could encounter the same midnight-boundary flakiness that theyesterdaySeconds()refactor addressed. For consistency and improved stability, consider:- const yMs = Date.now() - 24 * 60 * 60 * 1000; + const yMs = yesterdaySeconds() * 1000;This reuses the deterministic helper and still produces a millisecond timestamp that validates the component's handling of ms-precision values.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/src/components/intelligence/MemoryHeatmap.test.tsx` at line 69, Replace the fragile midnight-dependent millisecond timestamp with the deterministic helper used elsewhere: instead of computing yMs via Date.now() - 24*60*60*1000, derive it from the existing yesterdaySeconds() helper and convert to milliseconds (e.g., use yesterdaySeconds() * 1000) so MemoryHeatmap.test.tsx's yMs uses the same noon-anchored deterministic value as the other tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@app/src/components/intelligence/MemoryHeatmap.test.tsx`:
- Line 69: Replace the fragile midnight-dependent millisecond timestamp with the
deterministic helper used elsewhere: instead of computing yMs via Date.now() -
24*60*60*1000, derive it from the existing yesterdaySeconds() helper and convert
to milliseconds (e.g., use yesterdaySeconds() * 1000) so
MemoryHeatmap.test.tsx's yMs uses the same noon-anchored deterministic value as
the other tests.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 98bb2cc6-f329-4629-8a43-c73636c88850
📒 Files selected for processing (1)
app/src/components/intelligence/MemoryHeatmap.test.tsx
cargo-llvm-cov instrumentation roughly doubles cold-start time for the embedded core. The 10s budget added in tinyhumansai#2116 still flakes the ensure_running_falls_back_to_7789_when_7788_is_busy test on the Tauri coverage runner. 20s gives the readiness probe enough slack without affecting user-visible startup (ready signal lands in <1s on normal runs and the loop exits immediately).
…#1613) (tinyhumansai#2116) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
…#1613) (tinyhumansai#2116) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
…#1613) (tinyhumansai#2116) Co-authored-by: Steven Enamakel <enamakel@tinyhumans.ai>
Summary
pick_listen_porthelper that probes preferred port 7788 first, then iterates fallback range 7789–7798 when occupied, with 3 × 500ms retry-with-backoff to absorb transientEADDRINUSEraces (AV scanners, prior-instance teardown).WouldTakeOverso the Tauri shell can run the existing kill-and-revalidate flow.EmbeddedReadySignalsoCoreState.active_portreflects the actual bind, andOPENHUMAN_CORE_RPC_URLupdates before the frontend reads it. Toast surfacesOpenHuman is using port 778X because 7788 was busy.core.port: port_fallback_engaged: 7788 -> 778X) attaches to subsequent error reports so post-mortem context shows the fallback engaged.Problem
On Windows when port 7788 is already bound (zombie OpenHuman, AV scanner, firewall),
TcpListener::bind(("0.0.0.0", 7788))insrc/core/jsonrpc.rsfails withos error 10013 (WSAEACCES), the embedded task exits, andensure_runningsurfacesin-process core server exited before becoming ready. The app becomes unstartable with no fallback, no retry, and no user-facing guidance — Sentry has captured 10 events from a single affected user attempting repeated relaunches.Solution
pick_listen_port(preferred) -> Result<(TcpListener, u16), PickListenPortError>insrc/openhuman/connectivity/rpc.rs:bind(preferred). Success → returns immediately.EADDRINUSE: calls existingidentify_listener(preferred). If the squatter is an OpenHuman core, returnsWouldTakeOver { preferred, .. }. Otherwise iterates 7789..=7798 and returns the first successful bind, logging[core] preferred port 7788 in use by <fingerprint>; bound to <port>at warn.RetryPolicy::DEFAULT(3 attempts, 500ms backoff) wraps the bind to absorb transient races before theWouldTakeOverclassification kicks in.run_server_inner(src/core/jsonrpc.rs) callspick_listen_port_for_host; the ready signal carriesportplusfallback_from: Option<u16>so the Tauri shell knows when fallback engaged.apply_embedded_ready_signal(app/src-tauri/src/core_process.rs) updatesactive_port, refreshesOPENHUMAN_CORE_RPC_URL, records aPortFallbackNotice, emits a Sentry breadcrumb, and the Tauri shell raises a notification toast.Submission Checklist
pick_listen_port_preferred_free,pick_listen_port_openhuman_listener_requests_takeover,pick_listen_port_other_listener_falls_back,pick_listen_port_all_candidates_busy_errors,pick_listen_port_retries_transient_addr_in_use) plus 3 newcore_process_tests(ready_signal_updates_runtime_port_and_fallback_notice,ensure_running_falls_back_for_unknown_listener_on_port,ensure_running_falls_back_to_7789_when_7788_is_busy) cover the bind path, takeover preservation, fallback range, and retry semantics.docs/TEST-COVERAGE-MATRIX.mdhas no row for port-bind recovery (infrastructure-only, not a user-facing feature surface)## RelatedCloses #NNNin the## RelatedsectionImpact
WSAEACCES. No behavioural change when 7788 is free (still the preferred bind).GET /probe + 3 × 500ms retry on the cold path (only when 7788 is occupied). Hot path unchanged.port+fallback_from) — internal to embedded-server boundary, no public RPC surface change.Related
AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/1613-port-bind-fallbackValidation Run
cargo check --manifest-path app/src-tauri/Cargo.tomlcargo fmt --check(touched files)cargo test --manifest-path app/src-tauri/Cargo.toml --lib core_process— 19 passedcargo test --lib openhuman::connectivity— all new tests compile + runValidation Blocked
command:cargo clippy --all-targets -- -D warningserror:36 pre-existing clippy errors onapp/src-tauri(needless_return,needless_splitn, etc.) unrelated to this changeimpact:strict-clippy gate is broken on upstream/main; project-wide cleanup tracked separatelyBehavior Changes
EADDRINUSE; toast + breadcrumb when fallback engagesParity Contract
ensure_running(own task + own port) skips probe;OPENHUMAN_CORE_REUSE_EXISTING=1legacy attach behavior intactSummary by CodeRabbit
New Features
Bug Fixes
Tests
Tests (UI)