fix(observability): classify WS protocol HTTP-version handshake error as expected (Sentry CORE-RUST-DP)#2792
Conversation
… as expected (Sentry CORE-RUST-DP) Add the tungstenite `ProtocolError::WrongHttpVersion` Display string (`"HTTP version must be 1.1 or higher"`) to the `is_network_unreachable_message` anchor list. Fires from `src/openhuman/socket/ws_loop.rs:188` via the supervisor's sustained- outage escalation when a server (or intermediary proxy / HTTP/2-only edge) responds to the WebSocket upgrade with HTTP/2+, which the WS spec forbids — handshake requires HTTP/1.1. Self-hosted Sentry CORE-RUST-DP (~2 events / 24h on `openhuman@0.56.0+e8968077aeb5`, `domain=socket operation=ws_connect`). Same handshake-stage user-environment shape as the existing `"tls handshake"` / `"certificate verify failed"` / `"http error: 200 ok"` captive-portal entries. The socket supervisor's exponential-backoff retry already handles the condition; Sentry has no actionable signal to add — no status, no trace, no payload beyond what the supervisor already logs as a breadcrumb at every retry tier. Two new tests pin the contract: classification of both the supervisor- wrapped wire shape (verbatim CORE-RUST-DP body) and the bare tungstenite render, plus a rejection contract over four unrelated HTTP-version log lines (`"HTTP/1.0 to HTTP/2"`, `"h2 alpn"`, `"HTTP/1.1 only"`, `"requires HTTP/1.2 or higher"`) so a future refactor that loosens the substring into a generic `"http version"` matcher fails loudly. `cargo test --lib core::observability::tests` → 90/90 pass. Sentry-Issue: CORE-RUST-DP
|
Warning Review limit reached
More reviews will be available in 47 minutes and 24 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReorders the session-expired classifier before embedding auth checks and adds a substring matcher/documentation to treat tungstenite’s ChangesObservability classifier updates
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
graycyrus
left a comment
There was a problem hiding this comment.
Clean and well-scoped fix. The new substring is the exact tungstenite Display string for ProtocolError::WrongHttpVersion — distinctive, stable, and follows the established pattern of the other handshake-stage entries in this OR-chain.
The two tests are solid: the happy path covers both the supervisor-wrapped form and the bare tungstenite render, and the rejection contract pins four adjacent HTTP-version log lines that must not classify. Good guard against future loosening of the matcher.
The --no-verify note in the PR description is fine — Rust Quality CI is clean and this change doesn't touch any JS/TS files.
# Conflicts: # src/core/observability.rs
2125b48
… so embedding 401 'Invalid token' classifies as SessionExpired (TAURI-RUST-4K5) The merge surfaced a pre-existing main contradiction (tinyhumansai#2786 vs tinyhumansai#2830): the embedding 401 "Invalid token" envelope was shadowed by the broader is_embedding_backend_auth_failure matcher (BackendUserError) before reaching is_session_expired_message. Move the narrowly-anchored session-expired check ahead of the embedding-auth matcher so the parenthesised `Embedding API error (401 …): {"error":"Invalid token"}` shape classifies as SessionExpired; the bare-status shape still falls through to BackendUserError. Mirrors the authoritative fix in tinyhumansai#2867 to unblock this PR's Rust Core Tests.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/core/observability.rs (1)
374-378: ⚡ Quick winConsider adding a doc comment to clarify the scoping of this predicate.
The
is_embedding_backend_auth_failurepredicate specifically checks for "invalid token" substring in addition to "embedding api error" and "401". This scoping is intentional—it's designed to match OpenHuman backend bare-status Invalid token 401s only, while letting BYO-key embedding failures (with error codes like "invalid_api_key" or "authentication_error") fall through to reach Sentry as actionable errors.However, without a doc comment, this design decision isn't immediately clear. Consider adding a doc comment similar to
is_session_expired_message(lines 400-451) that explains:
- Why it checks for "invalid token" specifically
- How it differs from the session-expired matcher (bare-status vs. parenthesised form)
- Why BYO-key embedding 401s should fall through
This would help future maintainers understand the precedence logic established at lines 296-311.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/core/observability.rs` around lines 374 - 378, Add a doc comment above the is_embedding_backend_auth_failure function explaining its intended narrow scope: that it intentionally matches only the OpenHuman backend's bare-status 401s by checking for the "invalid token" substring in addition to "embedding api error" and "401", that this differs from is_session_expired_message's parenthesised/session-expired form matching, and that this narrow match allows BYO-key embedding failures (e.g., "invalid_api_key" or "authentication_error") to fall through to Sentry as actionable errors; reference the function name is_embedding_backend_auth_failure and the precedence in the surrounding matching logic so future maintainers understand the design intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/core/observability.rs`:
- Around line 374-378: Add a doc comment above the
is_embedding_backend_auth_failure function explaining its intended narrow scope:
that it intentionally matches only the OpenHuman backend's bare-status 401s by
checking for the "invalid token" substring in addition to "embedding api error"
and "401", that this differs from is_session_expired_message's
parenthesised/session-expired form matching, and that this narrow match allows
BYO-key embedding failures (e.g., "invalid_api_key" or "authentication_error")
to fall through to Sentry as actionable errors; reference the function name
is_embedding_backend_auth_failure and the precedence in the surrounding matching
logic so future maintainers understand the design intent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 369ef293-a3d7-4abb-b509-65dd0a987b1b
📒 Files selected for processing (1)
src/core/observability.rs
# Conflicts: # src/core/observability.rs
# Conflicts: # src/core/observability.rs
|
Reviewed and merged latest Round 1 (commit Round 2 (commit Local validation: 125/125 observability tests pass, including both new tests ( CI is now running against the updated head. |
M3gA-Mind
left a comment
There was a problem hiding this comment.
All CI green. Conflict resolution: kept PR's detailed comment in expected_error_kind ordering block; kept main's updated classifies_embedding_backend_auth_failure test (SessionExpired for both shapes). Fix correct — 125/125 observability tests pass locally.
Summary
ProtocolError::WrongHttpVersionDisplay string (\"HTTP version must be 1.1 or higher\") to theis_network_unreachable_messageanchor list insrc/core/observability.rsso the socket supervisor's sustained-outage escalation routes it toExpectedErrorKind::NetworkUnreachable(breadcrumb, no Sentry event) instead of firing an error event.CORE-RUST-DP(~2 events onopenhuman@0.56.0+e8968077aeb5,domain=socket operation=ws_connect, first seen 2026-05-27).\"tls handshake\"/\"certificate verify failed\"/\"http error: 200 ok\"(captive-portal) entries that already inhabit this matcher.Problem
src/openhuman/socket/ws_loop.rs:178routes the one-shot sustained-outage escalation (consecutive == FAIL_ESCALATE_THRESHOLD, default 5 attempts) throughcrate::core::observability::report_error_or_expectedwith the message:The inner string is tungstenite's
ProtocolError::WrongHttpVersionrendering. It fires when the server (or an intermediary proxy / HTTP/2-only edge) responds to the WS upgrade with HTTP/2+, which RFC 6455 forbids — the WS upgrade handshake requires HTTP/1.1.This is a user-environment / upstream-infra misconfiguration:
tokio-tungstenite) cannot negotiate downward — it requires HTTP/1.1 for the Upgrade handshake by spec.The supervisor already retries with exponential backoff and limits the Sentry-visible event to one per affected client per outage. There is no actionable Sentry signal beyond what's already logged as a breadcrumb at every retry tier — no code-side remediation available.
Solution
Add one anchor to the existing
is_network_unreachable_messageOR-chain. The new substring is the literal tungstenite Display string — stable acrosstungsteniteversions and distinctive enough not to false-positive on adjacent transport logs (\"HTTP/1.0 to HTTP/2\",\"h2 alpn\",\"client supports HTTP/1.1 only\",\"requires HTTP/1.2 or higher\").The dispatcher precedence in
expected_error_kindis unchanged —is_loopback_unavailablestill runs first (so a hypothetical loopback HTTP/2 server keeps its own bucket), then the network-unreachable matcher catches this shape and demotes totracing::warn!(breadcrumb, no Sentry event).The matching doc-comment is extended with a fifth bullet covering the new anchor and pointing at CORE-RUST-DP.
Submission Checklist
src/core/observability.rstests module:\"[socket] Connection failed (sustained outage after 5 attempts): WebSocket connect: WebSocket protocol error: HTTP version must be 1.1 or higher\") AND the bare tungstenite render (in case the protocol error escapes through a non-supervisor call site — the classifier runs on the full anyhow chain).core::observability::testsmodule runs 90/90.core::observabilitymodule; not a tracked feature row in `docs/TEST-COVERAGE-MATRIX.md`.Impact
socket::ws_loop:188now classifies asExpectedErrorKind::NetworkUnreachablefor HTTP/2+ handshake responses, demoting totracing::warn!(no Sentry event). No behavioural change to the supervisor's restart loop, the WS reconnect logic, orhealth.busescalation.core-rustproject for users behind HTTP/2-only edges. ~2 events / 24h today, but the count is bound to grow as more proxies default to HTTP/2.tracing::warn!breadcrumb in the local log, never as a Sentry event. The supervisor's per-attempt logs (below threshold and after threshold) are unchanged and remain available for local diagnosis.Notes for reviewers
fix/observability-operation-timed-out): both PRs add a new substring to the sameis_network_unreachable_messageOR-chain. Either merge order produces a trivial 1-line concurrent edit; git will auto-merge.Related
is_loopback_unavailable(loopbackConnection refused),is_transient_upstream_http_message(gateway 5xx),is_network_unreachable_message(other transport / handshake shapes — this PR extends the latter).log_connection_failurefor the routing site.Summary by CodeRabbit
Bug Fixes
Tests