You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite. Learn more
Code Review: refactor(gateway): clean up and fix lost tunnel msgs (#4771)
This is a solid refactor. The core idea of splitting InFlightRequest into a three-state machine (Active, PendingHibernation, Hibernating) and extracting an InFlightRequestHandle that encapsulates per-request operations is a meaningful improvement over the old pattern of passing SharedState + request_id through every task. The new hibernation_task.rs also correctly handles forwarding tunnel messages during hibernation, which was the stated bug fix.
Below are issues ranging from bugs to style nits, organized by severity.
Correctness / Logic Bugs
1. has_pending_websocket_messages comment direction is inverted
engine/packages/pegboard-gateway2/src/lib.rs, line ~692:
pending_ws_msgs holds messages sent to the envoy (server-bound, buffered pending ack) — not messages from the client. Messages from the client buffered during hibernation live in pending_tunnel_msgs. The comment says "pending ws messages from the client" but checks the wrong field. Either the check should be on pending_tunnel_msgs, or the comment needs to clarify these are "pending envoy-bound ws messages (not yet acked)". The inversion may mean the early-wake behavior is wrong.
2. PendingHibernation entries are silently GC'd with no signal to waiters
engine/packages/pegboard-gateway2/src/shared_state.rs, gc_in_flight_requests, line ~253:
When a request gets stuck in PendingHibernation (e.g. the actor wakes up between start_hibernation() being called and get_hibernating_in_flight_request() being called), the GC silently removes it with no drop_tx signal. Any code waiting on drop_rx or msg_rx for that request will hang indefinitely. Consider adding a drop_tx to PendingHibernation or preventing GC of that transient state entirely.
3. Race between get_hibernating_in_flight_request and GC
In handle_websocket_hibernation, get_hibernating_in_flight_request transitions the state to Hibernating, and then has_pending_websocket_messages is called on the handle. Between these two calls, gc_in_flight_requests could observe the Hibernating entry and remove it if pending_ws_msgs have timed out. The subsequent has_pending_websocket_messages call would then return an error instead of Continue. The window is tiny but the error handling path may not be correct for this case.
Missing Functionality / Behavioral Regressions
4. Removed reply_to/opened logic has no documented replacement
The old code included a reply_to field on the first tunnel message so the envoy learned where to send responses. This logic was removed entirely without explanation. If nothing has replaced it in the protocol layer, the envoy will not know where to route replies for new connections, which is a silent functional regression. Please document whether this was intentionally replaced (and where) or whether it was accidentally dropped.
5. pending_tunnel_msgs not checked in early-wake condition
The early-wake check only checks pending_ws_msgs. If there are only buffered pending_tunnel_msgs (messages from the envoy during hibernation), the early-wake is skipped. Those messages will still be forwarded when wake() is eventually called, but the hibernation task runs unnecessarily. Adding a check for !pending_tunnel_msgs.is_empty() here would allow an early exit and clarify intent.
Code Quality Issues
6. replace_with crate added without justification for abort semantics
Cargo.toml adds the replace_with crate for two replace_with_or_abort call sites. The _or_abort variant aborts the entire process on any panic inside the closure with no unwinding or clean shutdown. This is a heavy semantic choice that should be justified in a comment, and the closures should clearly be panic-free. If a standard std::mem::replace with a sentinel/default state achieves the same result, prefer that to avoid the abort semantics.
7. Commented-out dead code should be removed
engine/packages/pegboard-gateway2/src/shared_state.rs, line ~891 contains a commented-out wrapping_lt function. Remove this or restore it. Commented-out code adds noise without historical value (git blame serves that purpose).
8. Hibernating branch in ping_task uses bail! for a non-recoverable case
ping_task is only spawned for active websocket connections, so hitting the PendingHibernation/Hibernating branch indicates a task lifecycle ordering bug, not a recoverable error. Use unreachable! with a descriptive message so it surfaces clearly in logs rather than propagating as an unstructured anyhow error.
Several new comments do not follow the project convention of complete sentences:
shared_state.rs: // TODO: Kinda ugly but avoids clones and whatnot is colloquial. Prefer: // Using replace_with avoids cloning the state and its nested channels.
hibernation_task.rs doc comment: /// Peeks client ws until a message is received. is misleading. The task also forwards tunnel messages and handles actor-ready events. Update to reflect full responsibility.
10. anyhow! macro used directly (CLAUDE.md preference)
Per CLAUDE.md: "Prefer .context() over the anyhow! macro." The HibernationLifecycleResult::Aborted error in lib.rs should use a structured RivetError derive or .context() on an existing error.
11. #[tracing::instrument(skip_all)] on a hot-path inner function
Adding #[tracing::instrument] to recv_message creates a new span per message with no parent context (called from inside retain_async closures and tight receive loops). This adds tracing overhead on a hot path without useful context. Consider removing the attribute and keeping only the debug log line.
12. InFlightRequestCtx naming is misleading
InFlightRequestCtx is a short-lived struct that is destructured immediately at call sites into channels + a handle. The Ctx suffix implies a long-lived configuration type. Consider InFlightRequestChannels or InFlightRequestInit to match its actual one-time-destructure use pattern.
Summary
The PR correctly fixes the lost-tunnel-messages bug and the InFlightRequestHandle abstraction is a real improvement. Key items to address before merging:
Correctness: Document or restore the reply_to/opened behavior removed from the tunnel message path (issue 4 above).
Correctness: PendingHibernation entries are silently removed by GC with no signal to waiters (issue 2).
Comment accuracy: The "pending ws messages from client" comment direction appears inverted (issue 1).
replace_with crate: Add a comment justifying the abort-on-panic semantics vs. a simpler mem::replace approach (issue 6).
Dead code: Remove the commented-out wrapping_lt function (issue 7).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Please include a summary of the changes and the related issue. Please also include relevant motivation and context.
Type of change
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
Checklist: