Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 19, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 19, 2025 9:20pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 19, 2025 9:20pm
rivet-inspector Ignored Ignored Preview Nov 19, 2025 9:20pm
rivet-site Ignored Ignored Preview Nov 19, 2025 9:20pm

Copy link
Member Author

NathanFlurry commented Nov 19, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Remove TunnelAck

This PR removes the acknowledgement mechanism (TunnelAck) from the tunnel protocol and simplifies the message flow between pegboard-gateway and pegboard-runner. Here's my analysis:

✅ Strengths

1. Protocol Simplification

  • Removes bidirectional ack complexity, making the system easier to understand and maintain
  • Eliminates state tracking for active requests on the runner side (tunnel_active_requests HashMap removed from conn.rs)
  • Reduces message overhead by removing ack round trips

2. Architectural Improvement

  • Introduces gateway_id field in tunnel messages, allowing runners to route responses directly to the correct gateway without maintaining state
  • Gateway-to-runner routing becomes stateless from the runner's perspective
  • Cleaner separation: gateway handles in-flight state, runner just forwards messages

3. Code Quality

  • Removes ~100+ lines of complex ack handling logic
  • Eliminates the need to track request lifecycle states
  • Better protocol versioning with new ToRunner and ToGateway union types

⚠️ Concerns and Recommendations

1. TODO Comments (Shared State - lines 228-249)
The ToGatewayKeepAlive handler is stubbed out with a large TODO block containing old TunnelAck logic. This is confusing:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Recommendation: Either implement the keep-alive mechanism or remove the TODO if it's not needed. The commented code references TunnelAck which no longer exists.

2. Message Acknowledgement Timeout Logic (shared_state.rs:447-453)
The GC logic checks for message timeouts but the condition appears inverted:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This triggers a timeout when the message is within the timeout window, not beyond it. Should be >= instead of <=.

Same issue on line 458-462 for WebSocket messages.

3. Missing Request Cleanup on Runner Disconnect
Previously, when the runner disconnected, active requests received abort messages (removed from lib.rs:246-276). Now there's no cleanup mechanism when runners disconnect unexpectedly.

Question: How does the gateway handle runner disconnections? Does the workflow system handle this, or could requests hang indefinitely?

4. Protocol Version Conversion Edge Cases (versioned.rs:1231-1236 and 1331-1338)
The v2→v3 conversion maps deprecated TunnelAck to RequestAbort or ResponseAbort:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

While the comment says this shouldn't happen, if a v2 client sends TunnelAck, it will be interpreted as an abort, potentially breaking active connections.

Recommendation: Add logging or explicit error handling for this case to aid debugging during the migration period.

5. Hibernation State Management
The pending_msgs vector (line 179-182) is still populated but never cleared since TunnelAck removal. This could cause:

  • Memory leaks for long-lived connections
  • False positive timeouts in GC

Question: What's the intended lifecycle for pending_msgs now? Should it be removed entirely?

6. TypeScript Protocol Missing Index Field
In the v2→v3 conversion (versioned.rs:1219), the TypeScript protocol defaulted index to 0, but now uses msg.index. Verify the TypeScript SDK properly handles this field.

🔍 Testing Recommendations

  1. Load testing - Verify the system handles high message volumes without the ack backpressure mechanism
  2. Failure scenarios - Test runner disconnections during active requests
  3. Protocol migration - Ensure v2 clients gracefully upgrade to v3
  4. Memory profiling - Verify pending_msgs doesn't cause leaks
  5. Hibernation flow - Test WebSocket hibernation/resume with pending messages

📝 Minor Issues

  • Line 165: Gateway ID should use self.gateway_id.as_bytes() without dereferencing (though this works, it's cleaner)
  • Documentation: Consider updating protocol docs to explain the new stateless routing model

Summary

This is a solid architectural improvement that simplifies the protocol. The main concerns are around the incomplete keep-alive implementation, the inverted timeout logic, and potential memory leaks from pending_msgs. Once these are addressed, this should be good to merge.

Priority Issues:

  1. Fix timeout condition logic (lines 449 and 458)
  2. Clarify pending_msgs lifecycle and GC behavior
  3. Implement or remove ToGatewayKeepAlive TODO

@MasterPtato MasterPtato force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 972242d to 556b074 Compare November 19, 2025 21:19
@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:19
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 556b074 to 972242d Compare November 19, 2025 21:22
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 19, 2025 21:22
@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the pegboard gateway/runner protocol and simplifies the tunnel message flow by embedding the gateway ID directly in messages, eliminating the need for stateful request tracking.

Architecture Changes

Key Improvements ✅

  1. Simplified Message Flow: The removal of TunnelAck eliminates bidirectional acknowledgment complexity, reducing network overhead and latency.

  2. Stateless Gateway Routing: By embedding gateway_id in ToServerTunnelMessage, the runner can now route messages without maintaining active request state (tunnel_active_requests HashMap removed).

  3. Protocol Evolution: Introduces ToRunner union type to support future gateway-to-runner specific messages like ToRunnerKeepAlive (currently stubbed with TODO).

Code Quality & Best Practices

Strengths ✅

  • Follows CLAUDE.md conventions: Structured logging with tracing, proper error handling with anyhow
  • Backward compatibility: Version conversion functions handle the deprecated TunnelAck gracefully for v2→v3 migration
  • Reduced complexity: Removed 150+ lines of stateful tracking logic
  • Type safety: Protocol changes properly reflected across Rust and TypeScript SDKs

Concerns ⚠️

1. Incomplete Feature Implementation

// pegboard-gateway/src/shared_state.rs:227-249
Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issue: The ToGatewayKeepAlive handler is completely commented out, leaving pending message cleanup unimplemented. This could lead to unbounded memory growth in in_flight.pending_msgs.

Impact: Without acknowledgment cleanup, the gateway will accumulate pending messages indefinitely until requests complete, potentially exhausting memory under high load.

Recommendation: Either implement the keepalive mechanism or remove the pending message tracking entirely if it's no longer needed.

2. Hardcoded Default Values in Backward Compatibility

// runner-protocol/src/versioned.rs:165-169
v2::ToClient::ToClientTunnelMessage(msg) => {
    v3::ToClient::ToClientTunnelMessage(v3::ToClientTunnelMessage {
        gateway_id: [0; 16],  // ⚠️ Zero UUID for v2 messages
        // ...

Issue: V2 messages get a zero UUID for gateway_id, which could cause routing failures if v2 runners are still active.

Recommendation: Either:

  • Document that v2 protocol is fully deprecated and incompatible with this change
  • Store the actual gateway ID in a way that's accessible during deserialization
  • Add validation to reject v2 messages if they can't be properly routed

3. Silent Failure Mode

// runner-protocol/src/versioned.rs:1232-1236
v2::ToClientTunnelMessageKind::TunnelAck => {
    // TunnelAck is deprecated and should not be used
    // For backwards compatibility, we skip it
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: Converting deprecated TunnelAck to ToClientRequestAbort silently changes semantics, potentially causing unexpected request termination.

Recommendation: Log a warning when this conversion occurs, or consider returning an error to surface the protocol mismatch.

Performance Considerations

Positive Impact ✅

  • Reduced message overhead: No more ack roundtrips for every tunnel message
  • Memory efficiency: Removed per-request TunnelActiveRequest HashMap in runner
  • Simpler code paths: Fewer branches in message handling

Potential Issues ⚠️

  • Pending message accumulation: Without proper ack cleanup (see concern [SVC-2555] Set up issue templates #1), memory usage could grow unbounded
  • Missing reliability mechanism: If the previous TunnelAck was used for flow control or reliability, its removal could impact message delivery guarantees

Security Concerns

Low Risk ✅

No immediate security issues identified. The gateway_id is a fixed-size 16-byte array, preventing injection attacks.

Consideration 💭

  • Ensure that gateway_id validation happens at the protocol boundary to prevent routing to unintended gateways
  • The zero UUID fallback for v2 messages (concern [SVC-2479] Send cluster events to PostHog #2) could theoretically allow routing confusion if not handled carefully

Test Coverage

Missing ⚠️

No test files found for pegboard-gateway or pegboard-runner packages. This is a significant protocol change that should have:

  1. Unit tests for message serialization/deserialization
  2. Integration tests verifying the new routing mechanism
  3. Tests for backward compatibility with v2 protocol
  4. Tests for the keepalive mechanism (once implemented)

Recommendation: Add test coverage before merging, especially for:

  • Gateway ID routing correctness
  • Pending message cleanup behavior
  • Protocol version conversion edge cases

Recommendations

Before Merging 🔴

  1. Implement or remove keepalive handling: Address the TODO in shared_state.rs:227
  2. Clarify v2 compatibility: Document or fix the zero gateway ID issue
  3. Add test coverage: At minimum, test the happy path and version conversion

Post-Merge Considerations 🟡

  1. Monitor memory usage in production for pending message accumulation
  2. Consider adding metrics for message routing failures
  3. Update any external documentation referencing TunnelAck

Overall Assessment

This is a well-structured refactoring that simplifies the codebase and removes unnecessary complexity. However, the incomplete keepalive implementation and backward compatibility concerns should be addressed before merging to production.

Status: ⚠️ Approve with changes requested


Review generated with Claude Code - following rivet-dev/rivet conventions from CLAUDE.md

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review - PR #3488: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the tunnel protocol and simplifies the message flow between pegboard-gateway and pegboard-runner. The change introduces a new ToRunner message type and adds a gateway_id field to tunnel messages to enable direct routing without maintaining stateful request tracking.


Positive Aspects

Protocol Simplification: Removing TunnelAck eliminates a complex acknowledgment protocol, reducing code complexity significantly.

Reduced State Management: The PR removes tunnel_active_requests HashMap and associated lifecycle tracking logic (~80 lines removed from conn.rs and lib.rs).

Cleaner Architecture: Using gateway_id for direct message routing is more straightforward than maintaining request-to-subject mappings.

Backwards Compatibility: The versioning logic properly handles the deprecated TunnelAck in v2→v3 conversions (lines 1232-1237, 1320-1327 in versioned.rs).


Issues & Concerns

🔴 Critical: Commented-Out TODO Code

Location: pegboard-gateway/src/shared_state.rs:227-249

There's a large block of commented-out code handling ToGatewayKeepAlive:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issues:

  1. This suggests incomplete implementation - the keep-alive mechanism appears to be planned but not implemented
  2. The commented code references removing messages from pending_msgs, but this functionality is now lost
  3. No explanation of when/how this will be completed
  4. The pending_msgs vector at line 179-182 is still being populated but never cleaned up

Recommendation: Either:

  • Implement the keep-alive mechanism properly in this PR, or
  • File a follow-up issue and add a TODO comment with the issue number, or
  • Remove the commented code if it's truly not needed

🟡 Medium: Potential Memory Leak

Location: pegboard-gateway/src/shared_state.rs:179-182

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

Issue: Messages are added to pending_msgs but the commented-out ToGatewayKeepAlive handler that would remove them is not implemented. The GC logic at lines 447-453 checks for timeouts but the condition at line 448 appears inverted:

if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This should likely be > not <= - it triggers GC when the message is not timed out, which seems backwards.

Recommendation: Fix the GC logic condition or clarify the intended behavior with comments.

🟡 Medium: Protocol Version Handling

Location: versioned.rs:1232-1237, 1320-1327

The conversion of deprecated TunnelAck to ToClientRequestAbort / ToServerResponseAbort could cause issues:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // For backwards compatibility, we skip it
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issues:

  1. Comment says "we skip it" but actually converts to RequestAbort - this will abort the request
  2. If v2 clients send TunnelAck, it will incorrectly terminate their requests
  3. No testing for mixed-version scenarios

Recommendation: Either truly skip the message (return early) or add integration tests for protocol version mixing.

🟡 Medium: Missing Error Handling

Location: ws_to_tunnel_task.rs:368-369

let gateway_reply_to = GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Issue: The gateway_id is assumed to be valid, but there's no validation. If it's zero/invalid (as seen in the default value at line 165 of shared_state.rs and versioned.rs lines 178, 493), messages could be routed incorrectly.

Recommendation: Add validation or document why [0; 16] is safe for backwards compatibility.

🟢 Minor: TypeScript Schema Generation

Location: runner-protocol/src/index.ts

The generated TypeScript looks correct, but the large diff (184 additions, 49 deletions) suggests significant schema changes.

Recommendation: Ensure TypeScript clients are updated and tested. Consider documenting the breaking changes in a migration guide.

🟢 Minor: Cleanup Opportunity

Location: pegboard-runner/src/lib.rs:244-276

Removed code that sent close messages to active requests. This cleanup is good, but verify that request cleanup now happens correctly through other mechanisms.


Security Concerns

No major security issues identified. The removal of TunnelAck doesn't appear to introduce vulnerabilities. However:

  • Ensure the gateway_id routing cannot be exploited to intercept messages
  • The zero-value gateway_id in backward compatibility code should be validated

Performance Considerations

Positive: Removing tunnel_active_requests HashMap reduces memory overhead and lock contention.

⚠️ Concern: The unbounded pending_msgs vector (if not properly GC'd) could grow indefinitely.


Testing

Missing: No test coverage visible in this PR for:

  • Mixed protocol versions (v2 ↔ v3)
  • Keep-alive mechanism
  • Gateway ID routing
  • Pending message GC

Recommendation: Add integration tests before merging.


Recommendations Summary

Before merging:

  1. ✅ Implement or remove the commented ToGatewayKeepAlive handler
  2. ✅ Fix the GC logic condition at line 448 in shared_state.rs
  3. ✅ Add validation for gateway_id
  4. ✅ Add tests for protocol version compatibility
  5. ⚠️ Document the breaking changes for TypeScript/runner clients

Nice to have:

  • Add metrics for pending message queue sizes
  • Document the new message flow in architecture docs

Conclusion

This is a solid refactoring that simplifies the protocol, but has some implementation gaps that need addressing before merging. The main concern is the incomplete keep-alive mechanism and potential memory leak from unbounded pending_msgs. Once those are resolved, this will be a good improvement to the codebase.

Overall Assessment: ⚠️ Needs Work - Address critical TODOs and GC logic before merging.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

This PR removes the TunnelAck acknowledgment mechanism from the runner protocol and replaces it with a gateway-ID-based message routing system. Overall, the changes look good and simplify the protocol, but I've identified several important issues to address.


🔴 Critical Issues

1. Missing gateway_id in TypeScript SDK message sending

Location: engine/sdks/typescript/runner/src/tunnel.ts:430-437

The TypeScript #sendMessage method creates a ToServerTunnelMessage but does not include the required gateway_id field. This will cause protocol deserialization failures on the Rust side.

// Current code - MISSING gatewayId
const message: protocol.ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,  // gateway_id is missing here!
    },
};

Fix needed: Add gatewayId field to the message. The gateway ID should be extracted from the incoming ToClientTunnelMessage and stored in the tunnel or actor state.

2. Gateway ID not tracked in TypeScript SDK

Location: engine/sdks/typescript/runner/src/tunnel.ts

The TypeScript runner needs to:

  1. Extract gatewayId from incoming ToClientTunnelMessage messages
  2. Store it per-request (likely in the actor's request tracking)
  3. Include it when sending ToServerTunnelMessage back

Currently, there's no mechanism to capture or store the gateway ID from incoming messages.


⚠️ Warnings & Potential Issues

3. Incomplete TODO comments

Location: engine/packages/pegboard-gateway/src/shared_state.rs:227-249

Large commented-out block for ToGatewayKeepAlive handling with a TODO: comment. This suggests incomplete implementation:

  • What's the plan for keep-alive messages?
  • Should pending message tracking be removed entirely, or is this temporary?
  • The commented code references message_id but ToGatewayKeepAlive is defined as void in the schema

Recommendation: Either implement the keep-alive mechanism or remove the TODO comment and clarify the design decision.

4. Pending message cleanup may not work correctly

Location: engine/packages/pegboard-gateway/src/shared_state.rs:179-182

The pending_msgs vector is still being populated but never cleared since TunnelAck removal. This means:

  • The pending_msgs vector will grow indefinitely for long-lived requests
  • The GC logic at line 447-453 checks pending message timeouts but uses comparison operators that will timeout messages (condition on line 448-450 is backwards)
// Line 447-453 - Bug in condition
if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    if now.duration_since(earliest_pending_msg.send_instant)
        <= MESSAGE_ACK_TIMEOUT  // Should be >= not <=
    {
        break 'reason Some(MsgGcReason::MessageNotAcked{...});
    }
}

This will cause the GC to trigger timeouts for messages that are still fresh!

5. Protocol versioning compatibility concerns

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1237, 1305-1312

The backwards compatibility converters map removed TunnelAck to ToClientRequestAbort / ToServerResponseAbort. This is a questionable fallback:

  • Acks are not semantically equivalent to aborts
  • May cause unexpected behavior if v2 clients/gateways try to communicate with v3 endpoints
// Lines 1231-1237
v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Recommendation: Consider making this a hard error or at minimum adding metrics/logging to detect if this fallback is ever hit in production.


✅ Good Practices

  1. Clean state management simplification - Removing the complex TunnelActiveRequest tracking from conn.rs simplifies the runner state management significantly.

  2. Gateway ID approach - Using the gateway ID for routing is more scalable than maintaining per-request state on both sides.

  3. Protocol schema changes - The BARE schema updates correctly define the new GatewayId type and update both message structures.

  4. Rust error handling - Continues to use anyhow::Result and proper context as per CLAUDE.md guidelines.


📋 Minor Suggestions

  1. Logging consistency: The log messages follow the structured logging pattern correctly (lowercase, using ? for debug). Good adherence to CLAUDE.md.

  2. Consider adding metrics: This is a significant protocol change. Consider adding metrics for:

    • Gateway ID mismatches
    • Unknown gateway IDs
    • Message routing failures
  3. WebSocket hibernation: The change impacts how WebSocket messages are tracked during hibernation. Ensure the hibernation flow is thoroughly tested with the new gateway-id-based routing.


🧪 Testing Recommendations

Before merging, please ensure:

  1. Integration tests covering gateway ↔ runner message flow
  2. WebSocket hibernation tests with the new routing
  3. Protocol version compatibility tests (v2 ↔ v3)
  4. Load tests to verify the GC properly cleans up pending messages

Summary

Overall assessment: The architectural change is sound, but there are critical bugs that will cause runtime failures:

  • ❌ Missing gateway_id in TypeScript SDK will break all tunnel messages from TypeScript runners
  • ❌ GC timeout logic is inverted and will incorrectly timeout fresh messages
  • ⚠️ Incomplete keep-alive implementation needs clarification

Recommendation: Address the critical issues before merging. The TypeScript SDK bug is a blocker.

@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgement mechanism from the pegboard tunnel protocol, simplifying the communication flow between gateway and runner. The change eliminates the bidirectional acknowledgement system in favor of a unidirectional message flow with a gateway_id field for routing.

Code Quality & Best Practices

✅ Strengths

  1. Good Protocol Evolution: The versioned protocol conversion handles backward compatibility properly by mapping deprecated TunnelAck variants to appropriate fallback messages in versioned.rs:1231-1238 and 1331-1338.

  2. Consistent Naming: The new ToRunner and ToGateway union types follow existing conventions and improve clarity about message direction.

  3. Proper Logging: Structured logging patterns are correctly used throughout (e.g., tracing::debug!(?x, "message") instead of string formatting).

  4. Simplified State Management: Removing tunnel_active_requests HashMap from pegboard-runner/src/conn.rs:32-33 eliminates complex lifecycle tracking and potential memory leaks.

⚠️ Areas for Improvement

  1. Incomplete TODO Comments (shared_state.rs:227-249):

    Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
        // TODO:
        // let prev_len = in_flight.pending_msgs.len();
        // ...

    Issue: The ToGatewayKeepAlive handler is stubbed out with commented code. This suggests the keep-alive acknowledgement mechanism is unfinished.

    Recommendation: Either:

    • Implement the keep-alive logic before merging, or
    • Add a detailed TODO with a tracking issue reference explaining what needs to be done
    • Document whether this is safe to leave incomplete and what the runtime implications are
  2. Gateway ID Initialization (versioned.rs:493 and 681):

    gateway_id: [0; 16],  // Default to all zeros for v2 messages

    Issue: When converting v2 messages to v3, gateway_id defaults to zeros. This could cause routing issues if older protocol versions are still in use.

    Recommendation: Document whether v2 protocol is still supported in production. If yes, add a comment explaining how zero gateway_id is handled in routing logic.

  3. Error Handling in GC Logic (shared_state.rs:447-453):

    if let Some(earliest_pending_msg) = req.pending_msgs.first() {
        if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT {
            break 'reason Some(MsgGcReason::MessageNotAcked{message_id:Uuid::from_bytes(earliest_pending_msg.message_id)});
        }
    }

    Issue: The comparison operator <= appears incorrect - this would trigger timeout immediately. Should be >=.

    Critical Bug: This will cause requests to be garbage collected prematurely!

  4. Same Issue in WebSocket GC (shared_state.rs:458-460):

    if now.duration_since(earliest_pending_ws_msg.send_instant) <= HWS_MESSAGE_ACK_TIMEOUT {
        break 'reason Some(MsgGcReason::WebSocketMessageNotAcked{last_ws_msg_index: hs.last_ws_msg_index});
    }

    Critical Bug: Same logic error with <= instead of >=.

  5. Removed Cleanup Logic (pegboard-runner/src/lib.rs:246-277):
    The PR removes cleanup code that sent ToServerResponseAbort messages for active HTTP requests when the WebSocket closes.

    Concern: How are in-flight HTTP requests now cleaned up on runner disconnect? The comment says WebSockets are handled differently, but what about HTTP requests?

    Recommendation: Verify that gateway-side timeout handling properly cleans up orphaned HTTP requests.

Potential Bugs

🔴 Critical

  1. GC Timeout Comparison Bug (as noted above in points 3 & 4):

    // WRONG: Triggers immediately
    if now.duration_since(send_instant) <= TIMEOUT
    
    // CORRECT: Should be
    if now.duration_since(send_instant) >= TIMEOUT

    Impact: All pending messages will be incorrectly flagged for garbage collection immediately.

    Fix: Change both occurrences from <= to >=.

⚠️ Medium

  1. Incomplete Keep-Alive Implementation:
    The ToGatewayKeepAlive and ToRunnerKeepAlive types are defined but not fully implemented. Without keep-alive acknowledgements, how is connection health monitored?

  2. Migration Path Unclear:
    The PR changes protocol semantics significantly. Is there a deployment plan to ensure gateways and runners are upgraded atomically? What happens if v3 gateway talks to v2 runner?

Performance Considerations

✅ Improvements

  1. Reduced Message Overhead: Eliminating TunnelAck messages cuts bidirectional traffic in half for tunnel communication.

  2. Simplified Memory Management: Removing tunnel_active_requests HashMap eliminates per-request state tracking overhead.

  3. Direct Routing: Using gateway_id for direct subject construction (GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id))) is more efficient than dynamic reply-to tracking.

💡 Suggestions

  1. Consider whether pending_msgs in shared_state.rs:179-182 still serves a purpose now that acknowledgements are removed. If it's only for GC timeout, document this clearly.

Security Concerns

✅ No Critical Issues

No obvious security vulnerabilities introduced.

💡 Considerations

  1. Gateway ID Validation: Should ws_to_tunnel_task.rs:368-369 validate that the gateway_id exists before constructing the receiver subject? Otherwise, messages could be sent to arbitrary NATS subjects.

  2. Zero Gateway ID: The v2 compatibility code uses zero gateway_id. Ensure this doesn't allow message spoofing or routing to unintended gateways.

Test Coverage

❌ Concerns

  1. No Test Updates: The PR modifies core protocol behavior but doesn't include test updates or new tests.

  2. No Integration Tests: Given this changes gateway-runner communication, integration tests verifying the new flow would be valuable.

  3. GC Logic Untested: The timeout comparison bug suggests the GC logic may not have adequate test coverage.

Recommendation: Add tests covering:

  • Gateway-to-runner message flow without acknowledgements
  • GC timeout logic for pending messages
  • Protocol version conversion for v2→v3 migration
  • Keep-alive message handling (once implemented)

Additional Observations

  1. TypeScript SDK Changes: The TypeScript protocol updates in runner-protocol/src/index.ts properly remove TunnelAck from union tag ordering, maintaining serialization compatibility.

  2. Clean Deletion: The removal of is_to_server_tunnel_message_kind_request_close() helper function is appropriate since terminal state tracking is no longer needed.

  3. Documentation: The BARE schema (v3.bare) is well-documented with comments explaining the purpose of new types.

Recommendations Summary

Must Fix Before Merge

  • Fix GC timeout comparison operators (lines 449 and 459 in shared_state.rs)
  • Implement or remove ToGatewayKeepAlive handling (remove TODO or implement functionality)
  • Document v2 protocol compatibility (explain zero gateway_id handling)

Should Address

  • Add integration tests for new message flow
  • Verify HTTP request cleanup on runner disconnect
  • Document migration/deployment plan for protocol change
  • Add validation for gateway_id before routing

Nice to Have

  • Add tests for GC logic
  • Document whether pending_msgs is still needed
  • Consider adding metrics for message flow monitoring

Conclusion

This is a solid simplification of the tunnel protocol that removes unnecessary complexity. However, there are critical bugs in the GC timeout logic that must be fixed before merging. The incomplete keep-alive implementation also needs resolution.

Overall assessment: Needs work before merge due to timeout comparison bugs, but the architectural direction is sound.


Review generated with Claude Code

@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 19, 2025 21:54
@MasterPtato MasterPtato force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 972242d to 556b074 Compare November 19, 2025 21:54
@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Remove TunnelAck

Summary

This PR removes the TunnelAck message type from the runner protocol and simplifies the communication flow between the pegboard gateway and runner. The change eliminates the bidirectional acknowledgment mechanism for tunnel messages.


✅ Positive Observations

  1. Significant Code Simplification: The PR successfully removes ~405 lines while adding 374, resulting in a net reduction and cleaner architecture. The removal of tunnel_active_requests tracking state machine simplifies the codebase considerably.

  2. Protocol Versioning Handled Correctly: The changes properly handle backward compatibility in versioned.rs by converting deprecated TunnelAck messages to appropriate fallback types (ToClientRequestAbort/ToServerResponseAbort) for older protocol versions.

  3. Gateway ID Introduction: The addition of gateway_id field to tunnel messages (lines 165, 283 in v3.bare) allows the runner to determine the correct gateway to send responses to without maintaining state. This is a cleaner design pattern.

  4. Consistent Error Handling: The code properly uses structured logging with tracing's field syntax (e.g., tracing::debug!(?x, "foo")) as per CLAUDE.md guidelines.


⚠️ Issues & Concerns

1. Incomplete TODO Comments (High Priority)

Location: pegboard-gateway/src/shared_state.rs:227-249 and tunnel_to_ws_task.rs:56-58

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ... commented out ack handling logic
}

Issue: The ToGatewayKeepAlive handler is completely stubbed out with commented code. This suggests the feature is incomplete.

Questions:

  • What is the purpose of ToGatewayKeepAlive? Is this meant to replace TunnelAck functionality?
  • Why is the pending message cleanup commented out?
  • Should this be implemented before merging, or is this intentionally deferred?

Recommendation: Either implement the keepalive handling or document why it's deferred in the PR description.


2. Potential Message Ordering Issues (Medium Priority)

Location: pegboard-gateway/src/shared_state.rs:179-182

The code maintains a pending_msgs queue that tracks unacknowledged messages:

req.pending_msgs.push(PendingMessage {
    message_id,
    send_instant: now,
});

However, with TunnelAck removed, these pending messages are never removed from the queue except during GC timeouts (line 447-452).

Concerns:

  • The pending_msgs vector will continuously grow for long-lived requests
  • Memory leak potential for high-throughput connections
  • The GC logic checks if duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT which seems backwards (should be > not <=)

Line 448-450:

if now.duration_since(earliest_pending_msg.send_instant)
    <= MESSAGE_ACK_TIMEOUT
{
    break 'reason Some(MsgGcReason::MessageNotAcked{...});
}

This condition triggers GC when the message is within the timeout window, which is the opposite of the intended behavior.

Recommendation:

  • Fix the GC timeout logic (change <= to >=)
  • Consider removing pending_msgs entirely if acks are no longer needed, or implement the keepalive mechanism to actually clear them

3. Removed Cleanup Logic Not Replaced (Medium Priority)

Location: pegboard-runner/src/lib.rs:246-276 (deleted code)

The PR removes cleanup code that sends close messages to active requests when the runner websocket disconnects:

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    // ... send ToServerResponseAbort
}

Issue: With this removal, when a runner disconnects, the gateway may not be notified that in-flight HTTP requests were terminated.

Recommendation: Verify that the gateway properly handles runner disconnections and cleans up in-flight requests without relying on explicit close messages. Document this behavior if intentional.


4. TypeScript Client Impact (Low-Medium Priority)

Location: engine/sdks/typescript/runner/src/tunnel.ts

The TypeScript changes are extensive (87 deletions, 36 additions). The diff shows removal of pendingMessages tracking but I cannot see the full implementation.

Questions:

  • Does the TypeScript runner have equivalent issues with message tracking?
  • Are there TypeScript tests covering the tunnel message flow?

Recommendation: Ensure TypeScript runner has test coverage for the new protocol behavior.


5. Protocol Version Migration (Low Priority)

Location: runner-protocol/src/versioned.rs:1232-1238 and 1331-1338

The fallback behavior for deprecated TunnelAck converts it to Abort messages:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: The comment "This shouldn't happen in practice" suggests this is dead code, but it's still reachable if v2 clients send TunnelAck.

Recommendation: Consider whether this should log a warning or if the conversion to Abort is the correct behavior.


🔒 Security Considerations

No security vulnerabilities identified. The changes don't introduce:

  • Command injection risks
  • SQL injection (no database queries modified)
  • XSS vulnerabilities
  • Improper authentication/authorization changes

The gateway_id is properly typed as a 16-byte array (UUIDv4) which is appropriate.


🚀 Performance Considerations

Positive:

  • Removing the ack mechanism reduces message overhead (2 messages per tunnel message → 1 message)
  • Less state tracking should improve memory usage

Concern:

  • The pending message memory leak issue mentioned above could degrade performance over time

🧪 Test Coverage

Missing Information: The PR doesn't show test changes.

Recommendations:

  1. Add tests for ToGatewayKeepAlive / ToRunnerKeepAlive behavior
  2. Test that gateway_id is correctly populated and used for routing
  3. Test backward compatibility with v2 protocol clients
  4. Test runner disconnection cleanup scenarios

📝 Documentation & Code Style

Adherence to CLAUDE.md: ✅ Generally good

  • Imports are at the top of files ✅
  • Uses structured logging correctly ✅
  • Uses workspace dependencies ✅
  • Hard tabs are preserved (rustfmt.toml) ✅

Issues:

  • TODO comments should either be implemented or have tracking issues
  • The PR description is empty - should explain the rationale for removing TunnelAck

Recommendations Summary

Must Fix Before Merge (Breaking Issues):

  1. ❗ Fix the GC timeout logic at shared_state.rs:448-450 (inverted condition)
  2. ❗ Implement or document the ToGatewayKeepAlive handler
  3. ❗ Verify runner disconnection cleanup works correctly without explicit close messages

Should Fix Before Merge:

  1. Add test coverage for the new protocol behavior
  2. Add PR description explaining the rationale and design decisions
  3. Consider removing pending_msgs tracking if acks are truly no longer needed

Nice to Have:

  1. Document the migration path for existing deployments
  2. Add warning logs for deprecated TunnelAck conversions

Overall, this is a good simplification of the protocol, but there are a few critical issues that should be addressed before merging. The inverted timeout condition is a blocking bug, and the incomplete keepalive implementation needs clarification.

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: Remove TunnelAck

Overview

This PR removes the TunnelAck message type from the runner protocol, simplifying the acknowledgment mechanism between the pegboard gateway and runner. This is a significant protocol change that affects bidirectional message flow.


✅ Strengths

  1. Good Protocol Versioning: The change properly maintains backwards compatibility through the versioned protocol system. The conversion functions in versioned.rs handle v2 → v3 upgrades gracefully by mapping the removed TunnelAck to terminal states.

  2. Cleaner Architecture: Removing the explicit ack mechanism simplifies the state machine:

    • Removed tunnel_active_requests HashMap tracking
    • Removed complex ack send/receive logic
    • Cleaner separation between gateway and runner responsibilities
  3. New ToRunner Message Type: The introduction of ToRunner as a separate type from ToClient provides better type safety and future extensibility with ToRunnerKeepAlive.

  4. Gateway ID Addition: Adding gateway_id to tunnel messages enables better routing without maintaining stateful reply-to subjects.


🔍 Issues & Concerns

1. Incomplete Keep-Alive Implementation

Location: pegboard-gateway/src/shared_state.rs:227-250

The ToGatewayKeepAlive handling is stubbed out with TODO comments:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Concern: The commented code suggests this was previously handling ack logic for pending messages. Without this:

  • How are pending messages being garbage collected now?
  • Could this lead to memory leaks in in_flight.pending_msgs?
  • The MESSAGE_ACK_TIMEOUT and pending_msgs fields still exist but may not be used correctly

Recommendation: Either complete the keep-alive implementation or remove the dead code (pending_msgs, timeouts, etc.) if the ack mechanism is truly unnecessary.


2. Stale TypeScript Code

Location: engine/sdks/typescript/runner/src/stringify.ts:35-36, 77-78

The stringify functions still handle TunnelAck:

case "TunnelAck":
    return "TunnelAck";

Issue: This creates a mismatch between the protocol definition (v3 removes TunnelAck) and the client SDK code. While the versioning layer handles conversion, the TypeScript code should be updated to:

  1. Remove TunnelAck cases from v3 stringify functions
  2. Add handling for new ToRunnerKeepAlive messages

Recommendation: Update TypeScript SDK to match protocol v3 exactly. Consider having separate stringify functions per protocol version if needed.


3. Backwards Compatibility Edge Case

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1237, 1331-1338

The conversion functions map deprecated TunnelAck to abort states:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // TunnelAck is deprecated and should not be used
    // For backwards compatibility, we skip it
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Concern:

  • The comment says "we skip it" but the code actually converts to Abort - this is inconsistent
  • If an old v2 runner sends a TunnelAck, it will be misinterpreted as a request abort
  • Could cause active requests to be terminated incorrectly during a rolling upgrade

Recommendation:

  • Either truly "skip" the message (return an Option or filter it out)
  • Or document clearly that v2 runners are no longer supported
  • Add integration tests for mixed-version scenarios

4. Missing Error Handling

Location: pegboard-runner/src/tunnel_to_ws_task.rs:56-59

protocol::ToRunner::ToRunnerKeepAlive(_) => {
    // TODO:
    continue;
}

Issue: Keep-alive messages are silently ignored. If keep-alives are meant to maintain connection health, there should be:

  • Logging at trace/debug level
  • Potential response mechanism
  • Timeout tracking

Recommendation: At minimum, add a trace log. Better yet, implement proper keep-alive handling or remove the message type if unused.


5. Protocol Version Not Bumped in Migration Code

Location: engine/sdks/rust/runner-protocol/src/versioned.rs:175-183

When converting v2 → v3, the code sets gateway_id: [0; 16]:

v3::ToClient::ToClientTunnelMessage(v3::ToClientTunnelMessage {
    gateway_id: [0; 16],  // ← Zero gateway ID for v2 messages
    request_id: msg.request_id,
    // ...
})

Concern: A zero UUID could cause routing issues if the gateway ID is used for message routing. The code in ws_to_tunnel_task.rs:368 uses this gateway_id to construct the reply subject:

let gateway_reply_to = GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Recommendation:

  • Document that v2 protocol is unsupported, OR
  • Find a way to populate the correct gateway_id during conversion, OR
  • Add validation to reject zero gateway IDs

6. Removed Connection Cleanup Logic

Location: pegboard-runner/src/lib.rs:246-278 (deleted code)

The PR removes the cleanup logic that sent abort messages to active requests when the runner websocket closes:

// Send close messages to all remaining active requests
// ... [deleted 37 lines]

Concern: Without this cleanup:

  • How does the gateway know when a runner disconnects?
  • Could lead to dangling requests waiting indefinitely
  • The comment mentions "Websockets are not ephemeral like requests" but what about HTTP requests?

Recommendation: Verify that the gateway has alternative mechanisms to detect runner disconnection and clean up in-flight requests. This might be handled elsewhere, but it should be documented or tested.


🔒 Security Considerations

  1. Gateway ID Validation: Ensure gateway_id is validated before use to prevent potential routing to unintended gateways
  2. No obvious injection vulnerabilities - the changes don't introduce user-controlled data into critical paths

⚡ Performance Considerations

Positive impacts:

  • Removed HashMap lock contention on tunnel_active_requests
  • Eliminated ack message round-trips (lower latency)
  • Simplified message processing pipeline

Potential concerns:

  • If keep-alive mechanism is not properly implemented, could lead to connection timeouts
  • Pending message tracking (pending_msgs) still exists but may not be pruned correctly

🧪 Test Coverage

Missing:

  • No visible tests for protocol v2 → v3 conversion
  • No integration tests for mixed-version gateway/runner scenarios
  • No tests verifying proper handling of dropped connections without active request cleanup

Recommendation: Add tests for:

  1. Protocol conversion with TunnelAck messages
  2. Gateway ID routing correctness
  3. Connection cleanup behavior
  4. Keep-alive message handling

📋 Summary

This is a well-intentioned simplification of the protocol, but it has several incomplete areas that need addressing before merge:

Must Fix:

  1. ✅ Complete or remove the keep-alive TODO sections
  2. ✅ Update TypeScript SDK to match protocol v3
  3. ✅ Clarify backwards compatibility story for v2 TunnelAck conversion
  4. ✅ Verify connection cleanup still works without active request tracking

Should Fix:
5. Add test coverage for protocol conversion
6. Document the zero gateway_id behavior for v2 messages
7. Add logging for keep-alive handling

Nice to Have:
8. Integration tests for mixed-version scenarios
9. Performance testing to verify the removed ack mechanism doesn't impact reliability


💡 Recommendations

  1. Short term: Complete the TODOs or remove dead code to make the intent clear
  2. Medium term: Add comprehensive tests for the protocol changes
  3. Long term: Consider documenting the protocol evolution and migration strategy for operators

Overall, the architectural direction is sound, but the implementation feels incomplete. The TODOs suggest this might have been a partial revert or work-in-progress that needs finishing touches.

@NathanFlurry NathanFlurry force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 556b074 to 972242d Compare November 19, 2025 23:25
@NathanFlurry NathanFlurry force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 759df3e to 4adef6f Compare November 19, 2025 23:25
@claude
Copy link

claude bot commented Nov 19, 2025

Pull Request Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the pegboard gateway/runner protocol and replaces it with a simpler approach using gateway_id for routing. The change simplifies the protocol by eliminating bidirectional acknowledgments while maintaining message delivery tracking.


Code Quality & Best Practices

Strengths

  1. Clean Protocol Simplification: The removal of TunnelAck reduces protocol complexity and removes a bidirectional dependency that was error-prone.

  2. Good Backward Compatibility Handling: The versioned protocol conversion in versioned.rs properly handles the removed TunnelAck for backward compatibility:

    // TunnelAck was removed in v3
    v2::ToClientTunnelMessageKind::TunnelAck => {
        v3::ToClientTunnelMessageKind::ToClientRequestAbort
    }
  3. Proper Gateway ID Introduction: Adding gateway_id to messages allows for direct routing without maintaining stateful gateway_reply_to mappings.

  4. Consistent Logging: Follows the codebase's logging patterns with structured logging using tracing.

⚠️ Areas of Concern

1. Incomplete Keep-Alive Implementation (engine/packages/pegboard-gateway/src/shared_state.rs:227-250)

There's commented-out code for handling ToGatewayKeepAlive:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Recommendation: Either:

  • Implement the keep-alive mechanism fully
  • Remove this variant if it's not needed yet
  • Add a GitHub issue tracking its implementation

2. TypeScript Gateway ID Not Populated (engine/sdks/typescript/runner-protocol/src/index.ts)

In the message sending logic (engine/sdks/typescript/runner/src/tunnel.ts:430-438), the gatewayId field is not being populated:

const message: protocol.ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,
        // Missing: gatewayId!
    },
};

Critical Issue: The TypeScript runner is missing the gatewayId field when constructing ToServerTunnelMessage. This will cause messages from TypeScript runners to fail routing.

Fix Required: Add gatewayId extraction and population in the TypeScript tunnel implementation.

3. Removed Cleanup Logic (engine/packages/pegboard-runner/src/lib.rs:243-276)

The PR removes the cleanup code that sent close messages to active requests when the runner disconnects. While the comment suggests this is intentional ("All teardown logic is handled by pegboard-runner"), this creates a potential resource leak.

Concern: If the runner crashes or disconnects ungracefully, the gateway may not know to clean up in-flight requests immediately. The GC will eventually clean them up via timeout, but this could delay resource cleanup.

Recommendation: Verify that the gateway's GC properly handles this scenario and document the expected behavior.


Potential Bugs

🐛 Critical: TypeScript Gateway ID Missing

As mentioned above, the TypeScript implementation doesn't populate gatewayId. This is a breaking bug for TypeScript runners.

Location: engine/sdks/typescript/runner/src/tunnel.ts:430-438

🐛 Moderate: Inconsistent Error Handling in Version Conversion

In engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1236, the conversion from v2 to v3 converts TunnelAck to ToClientRequestAbort, but this may have unintended side effects:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issue: If a v2 client somehow sends a TunnelAck, it will be misinterpreted as a request abort, which could terminate an active request.

Recommendation: Consider using a no-op or logging a warning instead of converting to ToClientRequestAbort.

🐛 Minor: WebSocket Index Not Preserved in v2 Conversion

In versioned.rs:1219-1223, there's a comment about defaulting index to 0 for v2, but the code actually uses msg.index:

v2::ToClientTunnelMessageKind::ToClientWebSocketMessage(msg) => {
    v3::ToClientTunnelMessageKind::ToClientWebSocketMessage(v3::ToClientWebSocketMessage {
        index: msg.index,  // Good - preserves index
        data: msg.data,
        binary: msg.binary,
    })
}

The comment is misleading. This should be removed or updated.


Performance Considerations

Improvements

  1. Reduced Message Overhead: Removing TunnelAck messages reduces network traffic by ~50% for tunnel operations (no ack messages needed).

  2. Simplified State Management: Removing the active request tracking in tunnel_active_requests reduces memory overhead and mutex contention.

  3. Direct Routing: Using gateway_id for routing is more efficient than looking up gateway_reply_to from state.

⚠️ Potential Regressions

  1. No Explicit Delivery Confirmation: Without TunnelAck, there's no explicit confirmation that messages were received. The GC-based timeout approach is less responsive to failures.

  2. Increased GC Pressure: The gateway's GC (shared_state.rs:379-504) now needs to handle timeout-based cleanup instead of explicit acknowledgments. This may increase latency in detecting failed messages.

Recommendation: Monitor GC performance and message timeout rates after deployment.


Security Concerns

No New Security Issues Identified

The changes don't introduce obvious security vulnerabilities. The protocol changes are internal and don't expose new attack surfaces.

🔒 Existing Security Considerations

  1. Gateway ID Routing: The gateway_id is used for routing but isn't validated beyond basic deserialization. Consider adding validation that the gateway_id matches expected gateways.

  2. Message Timeout Cleanup: The GC relies on timeouts (MESSAGE_ACK_TIMEOUT = 30s) for cleanup. This is appropriate but ensure it can't be exploited for resource exhaustion.


Test Coverage

⚠️ Missing Test Coverage

Major Concern: I don't see any test updates or new tests for this significant protocol change.

Recommendations:

  1. Add integration tests for the new gateway_id routing mechanism
  2. Add tests for protocol version conversion (v2 ↔ v3)
  3. Add tests for the GC timeout behavior without TunnelAck
  4. Add tests for TypeScript runner tunnel message construction (especially gatewayId)
  5. Add tests for the ToRunnerKeepAlive / ToGatewayKeepAlive messages once implemented

Testing Checklist:

  • Protocol version conversion (v2 to v3)
  • Gateway routing with gateway_id
  • GC timeout behavior
  • TypeScript runner compatibility
  • Hibernating WebSocket restoration
  • Message delivery without TunnelAck

Additional Recommendations

  1. Documentation: Update protocol documentation to explain the removal of TunnelAck and the new routing mechanism.

  2. Migration Guide: Consider adding migration notes for anyone running older runners/gateways during deployment.

  3. Metrics: Add metrics to track:

    • Message timeout rates (to detect delivery issues)
    • GC cleanup frequency
    • Messages dropped due to missing gateway
  4. Dead Code Cleanup: Remove the TunnelActiveRequest struct definition from conn.rs since it's no longer used.

  5. TypeScript Fix Priority: The missing gatewayId in TypeScript should be fixed before merging this PR.


Summary

Overall Assessment: This is a good refactoring that simplifies the protocol, but it has a critical bug in the TypeScript implementation and incomplete keep-alive functionality.

Blocker Issues:

  • ❌ TypeScript gatewayId not populated
  • ToRunnerKeepAlive / ToGatewayKeepAlive incomplete

Required Before Merge:

  1. Fix TypeScript gatewayId population
  2. Complete or remove keep-alive implementation
  3. Add test coverage for protocol changes
  4. Verify GC cleanup handles runner disconnection properly

Nice to Have:

  • Update protocol documentation
  • Add metrics for monitoring
  • Clean up misleading comments in version conversion code

Great work on simplifying the protocol! With the fixes mentioned above, this will be a solid improvement. 🚀

@claude
Copy link

claude bot commented Nov 19, 2025

Code Review: Remove TunnelAck from Protocol

This PR removes the TunnelAck mechanism from the runner protocol. I've identified several critical issues that need to be addressed:

CRITICAL ISSUE 1: Memory Leak - Pending Messages Never Removed
Location: engine/packages/pegboard-gateway/src/shared_state.rs:179, 447-451

The pending_msgs vector is still being populated (line 179) and checked in GC logic (line 447-451), but the TunnelAck handler that would remove messages is now just a TODO comment (line 228-249). This means:

  • Memory leak: pending_msgs grows unbounded for long-lived requests
  • GC timeout logic is broken since messages are never removed
  • The condition at line 437 checking pending_msgs.is_empty() will never be true

Recommendation: Either remove pending_msgs entirely OR implement the ToGatewayKeepAlive handler.

CRITICAL ISSUE 2: Gateway ID Migration Problem
Location: engine/packages/pegboard-runner/src/ws_to_tunnel_task.rs:368-369, versioned.rs:493-494

The v2→v3 conversion sets gateway_id to [0; 16] (all zeros), but the code uses this to route messages via GatewayReceiverSubject. A zero UUID likely won't route to a valid gateway.

Recommendation: Add validation to reject zero UUIDs, or clarify if v2 clients are still supported.

HIGH PRIORITY: Inverted GC Logic?
Location: engine/packages/pegboard-gateway/src/shared_state.rs:448-450

The GC uses <= instead of > for timeout comparison, which seems backwards. It triggers GC when messages are NOT expired rather than when they ARE expired.

MEDIUM PRIORITY: TODO Comments
Multiple TODO comments suggest incomplete work (shared_state.rs:228-249, tunnel_to_ws_task.rs:57). Should these be completed before merging?

POSITIVE:

  • Good simplification of protocol complexity
  • Proper error handling with anyhow
  • Consistent structured logging

RECOMMENDATION: Fix critical issues 1 and 2 before merging. The incomplete removal of the ack mechanism will cause memory leaks and incorrect timeout behavior in production.

@MasterPtato MasterPtato force-pushed the 11-19-chore_engine-runner_remove_echo_close_on_toclientwebsocketclose branch from 972242d to 556b074 Compare November 20, 2025 03:07
@MasterPtato MasterPtato force-pushed the 11-19-chore_pegboard-gateway_remove_tunnelack_ branch from 4adef6f to 759df3e Compare November 20, 2025 03:07
@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Remove TunnelAck

Summary

This PR removes the TunnelAck acknowledgment mechanism from the tunnel protocol, simplifying the communication flow between gateway and runner. The change removes bidirectional acknowledgment tracking in favor of a simpler unidirectional message flow with gateway_id routing.


Strengths

  1. Simplification: Removing the TunnelAck mechanism significantly reduces complexity:

    • Eliminates stateful tunnel_active_requests tracking in pegboard-runner (conn.rs:28-30)
    • Removes complex ack timeout logic and retry mechanisms
    • Reduces message round-trips for each tunnel operation
  2. Cleaner Gateway Routing: The new gateway_id field enables direct routing:

    • Gateway creates a unique ID on initialization (shared_state.rs:72)
    • Messages include gateway_id for return path routing (ws_to_tunnel_task.rs:368-369)
    • No need to track gateway_reply_to per request
  3. Protocol Version Compatibility: Good backward compatibility handling:

    • versioned.rs properly converts v2 TunnelAck to safe fallbacks (ToClientRequestAbort/ToServerResponseAbort)
    • New ToRunner type for gateway-to-runner messages distinct from ToClient
  4. Proper Cleanup: Runner shutdown no longer needs to iterate through active requests to send close messages (lib.rs:243-276 removed)


⚠️ Concerns & Issues

1. Critical: Incomplete Keep-Alive Implementation

The code shows commented-out ToGatewayKeepAlive handling in shared_state.rs:227-250:

Ok(protocol::ToGateway::ToGatewayKeepAlive) => {
    // TODO:
    // let prev_len = in_flight.pending_msgs.len();
    // ...
}

Issues:

  • The ToGatewayKeepAlive and ToRunnerKeepAlive messages are defined in the schema but not implemented
  • This leaves no mechanism to clear pending_msgs from the gateway's in-flight requests
  • Without acknowledgment, pending_msgs will grow unbounded until the GC timeout (30s)
  • The GC will incorrectly timeout requests if the first pending message exceeds MESSAGE_ACK_TIMEOUT (shared_state.rs:447-453)

Impact: Memory leaks and false request timeouts.

Recommendation: Either:

  1. Complete the keep-alive implementation to clear pending_msgs, OR
  2. Remove pending_msgs tracking entirely if acknowledgments are truly not needed

2. Potential Race Condition: Gateway ID Mismatch

In ws_to_tunnel_task.rs:368-369:

let gateway_reply_to =
    GatewayReceiverSubject::new(Uuid::from_bytes(msg.gateway_id)).to_string();

Issue: If a runner sends a message with a stale gateway_id (from before gateway restart), the message will be routed to a non-existent subject.

Scenarios:

  • Gateway crashes and restarts with new gateway_id
  • Runner still has old gateway_id from initial connection
  • Messages get lost in UPS with no error feedback

Recommendation: Add validation that messages are only accepted if gateway_id matches the current gateway, or add TTL/cleanup for stale gateway subjects.

3. Message Loss Without Acknowledgments

The PR removes all acknowledgment logic but doesn't address reliability:

Concerns:

  • No guarantee that messages sent over UPS actually arrive
  • PublishOpts::one() has no delivery confirmation
  • Network partitions or UPS failures could silently drop messages
  • No retry mechanism for failed sends

Recommendation: Document this as a known limitation, or consider adding at-least-once delivery semantics if reliability is critical.

4. Garbage Collection Logic May Be Broken

In shared_state.rs:447-453, the GC checks if the earliest pending message exceeded timeout:

if let Some(earliest_pending_msg) = req.pending_msgs.first() {
    if now.duration_since(earliest_pending_msg.send_instant)
        <= MESSAGE_ACK_TIMEOUT  // ⚠️ Should be >= not <=
    {
        break 'reason Some(MsgGcReason::MessageNotAcked{...});
    }
}

Issue: The condition uses <= when it should likely be >=. This will immediately mark all messages as not-acked regardless of how long they've been pending.

Impact: All requests will be prematurely garbage collected after the first GC tick.

5. TypeScript SDK Index Type Mismatch

In the TypeScript changes (tunnel.ts and generated protocol), the WebSocket message index handling changed but may have subtle issues:

Previous behavior:

  • v2 defaulted to index: 0 for messages without hibernation
  • v3 uses actual index tracking

Concern: Existing TypeScript code expecting index 0 for non-hibernating WebSockets might break.


🔍 Additional Observations

  1. Good: Error Handling - The v2→v3 conversion properly handles deprecated TunnelAck by converting to abort messages

  2. Good: Protocol Versioning - New ToRunner type properly separates concerns between runner-gateway and client messaging

  3. Minor: Dead Code - The commented-out wrapping_lt function in shared_state.rs:519-521 should be removed

  4. Testing Gap - No test files modified. This is a significant protocol change that would benefit from integration tests covering:

    • Gateway restart scenarios with stale gateway_id
    • Message delivery failures
    • GC behavior with pending messages

📋 Recommendations

High Priority:

  1. Fix or complete the keep-alive implementation - Either implement it fully or remove pending_msgs tracking
  2. Fix the GC timeout condition - Change <= to >= in line 449
  3. Add gateway_id validation - Prevent messages from being routed to non-existent gateways

Medium Priority:
4. Document the reliability guarantees (or lack thereof) without acks
5. Add integration tests for the new protocol flow
6. Remove commented-out dead code (wrapping_lt, keep-alive TODO)

Low Priority:
7. Consider metrics/observability for message routing failures
8. Document migration path for any external protocol consumers


🎯 Verdict

This is a good simplification that removes unnecessary complexity, but there are critical bugs that need to be addressed before merging:

  • The incomplete keep-alive implementation will cause memory leaks
  • The GC timeout logic appears inverted
  • Gateway ID routing needs validation to prevent message loss

Recommendation: Request changes to address the high-priority issues above.


Review generated using repository conventions from CLAUDE.md

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Remove TunnelAck

This PR removes the TunnelAck mechanism from the tunnel protocol and replaces it with a simpler approach where the runner looks up the gateway by ID. I've reviewed the changes and have the following feedback:

Positive Changes

  1. Simplified Protocol: Removing the bidirectional acknowledgment system reduces protocol complexity and removes ~100 lines of tracking code
  2. Gateway ID Addition: Adding gateway_id to tunnel messages enables direct gateway routing without maintaining active request state
  3. Cleaner Separation: The new ToRunner and ToGateway message types provide better clarity about message flow direction
  4. Backward Compatibility: The version conversion logic properly handles the deprecated TunnelAck for v2→v3 upgrades

⚠️ Potential Issues

1. Missing Gateway ID in TypeScript (Critical)

In engine/sdks/typescript/runner/src/tunnel.ts:430-438, the gatewayId field is not being populated:

const message: protocol.ToServer = {
    tag: "ToServerTunnelMessage",
    val: {
        requestId,
        messageId,
        messageKind,
    },
};

The ToServerTunnelMessage type now requires a gatewayId field (as seen in the schema at line 282-287 of v3.bare), but it's not being set. This will likely cause serialization errors or undefined behavior.

Fix: Extract and store the gatewayId from incoming ToClientTunnelMessage messages and include it when sending responses.

2. Incomplete KeepAlive Implementation

In engine/packages/pegboard-gateway/src/shared_state.rs:227-250, the ToGatewayKeepAlive handler is stubbed out with TODO comments. The commented-out code references the old TunnelAck mechanism for cleaning up pending messages.

Similarly, in engine/packages/pegboard-runner/src/tunnel_to_ws_task.rs:56-62, the ToRunnerKeepAlive handler is also stubbed.

Questions:

  • What is the intended behavior of these KeepAlive messages?
  • Should they be removed entirely if not needed, or implemented properly?
  • The TODO comments suggest this was meant to replace some of the TunnelAck functionality

3. Removed Cleanup Logic May Cause Memory Leaks

In engine/packages/pegboard-runner/src/lib.rs:243-276, the PR removes code that sent close messages to all active requests when the runner WebSocket disconnected. The comment suggests this cleanup is now handled by pegboard-runner, but I don't see corresponding changes that ensure all active requests are properly terminated.

Concern: Without explicit cleanup, there may be:

  • Gateway-side requests waiting indefinitely for responses
  • Memory leaks from unreleased request state
  • Orphaned connections if the runner crashes

Recommendation: Verify that the gateway properly handles runner disconnection and cleans up all in-flight requests.

4. Message Acknowledgment Still Needed

Looking at the garbage collection logic in shared_state.rs:379-504 and tunnel.ts:441-519, both sides still track pending messages and time them out after 5-30 seconds. However, without acknowledgments, messages will always time out, potentially causing:

  • False positive timeouts for slow but valid requests
  • Unnecessary connection closures
  • Poor debugging experience (timeouts instead of actual errors)

Question: If acknowledgments are removed, should the pending message tracking also be removed? Or is there a new acknowledgment mechanism planned?

5. Inconsistent Error Handling in Version Conversion

In engine/sdks/rust/runner-protocol/src/versioned.rs:1231-1237 and 1331-1338, when converting deprecated TunnelAck from v2 to v3, it's being converted to ToClientRequestAbort and ToServerResponseAbort respectively:

v2::ToClientTunnelMessageKind::TunnelAck => {
    // TunnelAck is deprecated and should not be used
    // For backwards compatibility, we skip it
    // This shouldn't happen in practice as TunnelAck was removed
    v3::ToClientTunnelMessageKind::ToClientRequestAbort
}

Issues:

  • The comment says "skip it" but the code converts to Abort
  • Converting to Abort could cause legitimate v2 acks to terminate connections
  • Should probably log a warning if this case is hit

Recommendation: Either properly skip/ignore the message or document why Abort is the correct fallback.

6. Default Gateway ID in Version Conversion

In versioned.rs:492 and 638, when converting v2 messages to v3, gateway_id is hardcoded to [0; 16]:

v3::ToClientTunnelMessage {
    gateway_id: [0; 16],
    // ...
}

This zero-filled UUID will fail to route to the correct gateway.

Questions:

  • Is v2→v3 conversion only used during initial upgrade?
  • Should this return an error instead of using a placeholder?

🔍 Minor Issues

  1. Typo in shared_state.rs:449: <= should probably be > for timeout comparison:

    if now.duration_since(earliest_pending_msg.send_instant) <= MESSAGE_ACK_TIMEOUT

    This condition is checking if we should GC the message, but <= would keep messages that haven't timed out yet.

  2. Inverted Logic in tunnel.ts:776: Comment says "actor not loaded" but code checks hasActor:

    if (this.#runner.hasActor(actorId, generation)) {
        this.log?.warn({
            msg: "actor not loaded to send response...",

    Should be !this.#runner.hasActor(...)

  3. Unused Field: In shared_state.rs, the receiver_subject field is stored in InFlightRequest but could potentially be derived from the gateway_id if needed, reducing memory footprint slightly.

📋 Testing Recommendations

  1. Test gateway lookup with multiple concurrent gateways
  2. Verify runner reconnection doesn't break in-flight requests
  3. Test v2→v3 protocol version upgrades
  4. Verify WebSocket hibernation still works correctly
  5. Load test to ensure no memory leaks from unreleased request state

🎯 Summary

The core simplification is sound, but there are several implementation gaps that need to be addressed:

Must Fix:

  • Add gatewayId field to TypeScript tunnel messages
  • Clarify KeepAlive implementation or remove TODOs
  • Fix timeout comparison logic

Should Investigate:

  • Active request cleanup on runner disconnect
  • Purpose of pending message tracking without acks
  • Version conversion edge cases

Overall, this is a good architectural improvement, but needs some implementation details resolved before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants