Skip to content

refactor(gateway): clean up and fix lost tunnel msgs#4771

Open
MasterPtato wants to merge 1 commit into04-24-refactor_pb_fix_actor_overdue_alarm_handlingfrom
04-24-refactor_gateway_clean_up_and_fix_lost_tunnel_msgs
Open

refactor(gateway): clean up and fix lost tunnel msgs#4771
MasterPtato wants to merge 1 commit into04-24-refactor_pb_fix_actor_overdue_alarm_handlingfrom
04-24-refactor_gateway_clean_up_and_fix_lost_tunnel_msgs

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Code Review: refactor(gateway): clean up and fix lost tunnel msgs (#4771)

This is a solid refactor. The core idea of splitting InFlightRequest into a three-state machine (Active, PendingHibernation, Hibernating) and extracting an InFlightRequestHandle that encapsulates per-request operations is a meaningful improvement over the old pattern of passing SharedState + request_id through every task. The new hibernation_task.rs also correctly handles forwarding tunnel messages during hibernation, which was the stated bug fix.

Below are issues ranging from bugs to style nits, organized by severity.


Correctness / Logic Bugs

1. has_pending_websocket_messages comment direction is inverted

engine/packages/pegboard-gateway2/src/lib.rs, line ~692:

pending_ws_msgs holds messages sent to the envoy (server-bound, buffered pending ack) — not messages from the client. Messages from the client buffered during hibernation live in pending_tunnel_msgs. The comment says "pending ws messages from the client" but checks the wrong field. Either the check should be on pending_tunnel_msgs, or the comment needs to clarify these are "pending envoy-bound ws messages (not yet acked)". The inversion may mean the early-wake behavior is wrong.

2. PendingHibernation entries are silently GC'd with no signal to waiters

engine/packages/pegboard-gateway2/src/shared_state.rs, gc_in_flight_requests, line ~253:

InFlightRequestState::PendingHibernation { .. } => {}

When a request gets stuck in PendingHibernation (e.g. the actor wakes up between start_hibernation() being called and get_hibernating_in_flight_request() being called), the GC silently removes it with no drop_tx signal. Any code waiting on drop_rx or msg_rx for that request will hang indefinitely. Consider adding a drop_tx to PendingHibernation or preventing GC of that transient state entirely.

3. Race between get_hibernating_in_flight_request and GC

In handle_websocket_hibernation, get_hibernating_in_flight_request transitions the state to Hibernating, and then has_pending_websocket_messages is called on the handle. Between these two calls, gc_in_flight_requests could observe the Hibernating entry and remove it if pending_ws_msgs have timed out. The subsequent has_pending_websocket_messages call would then return an error instead of Continue. The window is tiny but the error handling path may not be correct for this case.


Missing Functionality / Behavioral Regressions

4. Removed reply_to/opened logic has no documented replacement

The old code included a reply_to field on the first tunnel message so the envoy learned where to send responses. This logic was removed entirely without explanation. If nothing has replaced it in the protocol layer, the envoy will not know where to route replies for new connections, which is a silent functional regression. Please document whether this was intentionally replaced (and where) or whether it was accidentally dropped.

5. pending_tunnel_msgs not checked in early-wake condition

The early-wake check only checks pending_ws_msgs. If there are only buffered pending_tunnel_msgs (messages from the envoy during hibernation), the early-wake is skipped. Those messages will still be forwarded when wake() is eventually called, but the hibernation task runs unnecessarily. Adding a check for !pending_tunnel_msgs.is_empty() here would allow an early exit and clarify intent.


Code Quality Issues

6. replace_with crate added without justification for abort semantics

Cargo.toml adds the replace_with crate for two replace_with_or_abort call sites. The _or_abort variant aborts the entire process on any panic inside the closure with no unwinding or clean shutdown. This is a heavy semantic choice that should be justified in a comment, and the closures should clearly be panic-free. If a standard std::mem::replace with a sentinel/default state achieves the same result, prefer that to avoid the abort semantics.

7. Commented-out dead code should be removed

engine/packages/pegboard-gateway2/src/shared_state.rs, line ~891 contains a commented-out wrapping_lt function. Remove this or restore it. Commented-out code adds noise without historical value (git blame serves that purpose).

8. Hibernating branch in ping_task uses bail! for a non-recoverable case

ping_task is only spawned for active websocket connections, so hitting the PendingHibernation/Hibernating branch indicates a task lifecycle ordering bug, not a recoverable error. Use unreachable! with a descriptive message so it surfaces clearly in logs rather than propagating as an unstructured anyhow error.

9. Comment style violations (CLAUDE.md convention)

Several new comments do not follow the project convention of complete sentences:

  • shared_state.rs: // TODO: Kinda ugly but avoids clones and whatnot is colloquial. Prefer: // Using replace_with avoids cloning the state and its nested channels.
  • hibernation_task.rs doc comment: /// Peeks client ws until a message is received. is misleading. The task also forwards tunnel messages and handles actor-ready events. Update to reflect full responsibility.

10. anyhow! macro used directly (CLAUDE.md preference)

Per CLAUDE.md: "Prefer .context() over the anyhow! macro." The HibernationLifecycleResult::Aborted error in lib.rs should use a structured RivetError derive or .context() on an existing error.

11. #[tracing::instrument(skip_all)] on a hot-path inner function

Adding #[tracing::instrument] to recv_message creates a new span per message with no parent context (called from inside retain_async closures and tight receive loops). This adds tracing overhead on a hot path without useful context. Consider removing the attribute and keeping only the debug log line.

12. InFlightRequestCtx naming is misleading

InFlightRequestCtx is a short-lived struct that is destructured immediately at call sites into channels + a handle. The Ctx suffix implies a long-lived configuration type. Consider InFlightRequestChannels or InFlightRequestInit to match its actual one-time-destructure use pattern.


Summary

The PR correctly fixes the lost-tunnel-messages bug and the InFlightRequestHandle abstraction is a real improvement. Key items to address before merging:

  1. Correctness: Document or restore the reply_to/opened behavior removed from the tunnel message path (issue 4 above).
  2. Correctness: PendingHibernation entries are silently removed by GC with no signal to waiters (issue 2).
  3. Comment accuracy: The "pending ws messages from client" comment direction appears inverted (issue 1).
  4. replace_with crate: Add a comment justifying the abort-on-panic semantics vs. a simpler mem::replace approach (issue 6).
  5. Dead code: Remove the commented-out wrapping_lt function (issue 7).

@MasterPtato MasterPtato force-pushed the 04-24-refactor_gateway_clean_up_and_fix_lost_tunnel_msgs branch from 9681ff0 to b08bae3 Compare April 25, 2026 01:14
@MasterPtato MasterPtato force-pushed the 04-24-refactor_pb_fix_actor_overdue_alarm_handling branch from 3cfcb16 to 31253f7 Compare April 25, 2026 01:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant