Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivet-site Error Error Nov 19, 2025 1:29am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 19, 2025 1:29am
rivet-inspector Ignored Ignored Preview Nov 19, 2025 1:29am
rivetkit-serverless Skipped Skipped Nov 19, 2025 1:29am

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 18, 2025

Pull Request Review

Summary

This PR fixes a race condition where actor stops during WebSocket connections or HTTP requests weren't being properly handled, leading to requests hanging indefinitely. The fix adds subscriptions to the actor::Stopped workflow event and properly terminates pending operations when actors stop.


✅ Positive Aspects

  1. Critical Bug Fix: Addresses a real race condition where requests/WebSockets could hang if an actor stops during operation
  2. Consistent Error Handling: Uses appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable) that match existing patterns
  3. Hibernation Awareness: Correctly handles the hibernation case for WebSockets - hibernating if allowed, otherwise returning unavailable
  4. Proper Logging: Good use of structured logging with tracing::debug! to track when actors stop during operations

🔍 Code Quality Observations

1. Duplicate Subscription Pattern

In lib.rs, the stopped_sub subscription is created in three places. Consider creating the subscription once at the beginning of handle_request and handle_websocket methods and reusing it throughout the function to reduce redundant subscription overhead.

2. Runner WebSocket Behavior Change (Important)

In pegboard-runner/src/lib.rs:246-252, the behavior was changed so WebSockets are NOT notified when the runner WS closes. This means WebSocket connections will remain open from the gateway perspective even when the runner disconnects, waiting only for the actor to stop. Recommendation: Add integration tests to verify this behavior works correctly.

3. Comment Accuracy

Line 385 comment changed from "Send reclaimed messages" to "Send pending messages" - good improvement in clarity! ✅


🐛 Potential Issues

1. Subscription Timing in Hibernation Flow

The stopped_sub is created at line 290 for WebSocket open handling, but if after_hibernation is true (line 310), the WebSocket open logic is skipped. Verify that the handle_websocket_hibernation_inner method subscribes to the Stopped event, or document why this race is acceptable.

2. Error Propagation During Cleanup

When sending the WebSocket close message fails, only an error log is produced. Should this error be propagated? If the close message fails to send, the runner might not clean up properly.


🔒 Security Considerations

No security issues identified. The changes properly handle error cases and don't introduce new attack vectors. ✅


⚡ Performance Considerations

  1. Subscription Overhead: Multiple subscriptions to the same event stream could add overhead. Consider optimizing by reusing subscriptions.
  2. HashMap Lookups: The has_pending_websocket_messages method performs a HashMap lookup on the hot path during hibernation.

🧪 Test Coverage

Major Gap: No tests are included with this PR for critical race condition fixes.

Recommendations:

  • Unit tests: Mock the stopped_sub to fire during various stages
  • Integration tests: Simulate actor stopping during active request/WebSocket
  • Edge cases: Actor stops immediately after subscription, multiple concurrent requests

🎯 Recommendations Summary

Must Address:

  1. Verify hibernation subscription coverage for stop events
  2. Add tests for the actor stop scenarios

Should Consider:

  1. Optimize subscription creation (reuse where possible)
  2. Clarify intended behavior for WebSocket when runner disconnects but actor continues
  3. Add integration tests for the new behavior

Final Verdict

This is a solid bug fix that addresses a real race condition. The implementation is clean and follows existing patterns. However, the lack of tests for such critical race condition fixes is concerning.

Recommended Action: Approve with the requirement that tests be added before merging, especially for the WebSocket-specific behavior changes.

Great work on identifying and fixing this race condition! 🚀

@MasterPtato MasterPtato force-pushed the 11-18-fix_guard_handle_actor_stopped_during_ws_req branch from 6c5e080 to dc9084a Compare November 18, 2025 23:50
@MasterPtato MasterPtato force-pushed the 11-13-chore_rename_retry_-_hibernate_in_runner_protocol branch from e83ac9b to 00cd5f2 Compare November 18, 2025 23:50
@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 18, 2025 23:50 Inactive
@claude
Copy link

claude bot commented Nov 18, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Summary

This PR adds handling for actor stopped events during WebSocket and HTTP request processing in the pegboard gateway, ensuring graceful cleanup when actors stop mid-request. The changes also improve WebSocket lifecycle management in the runner.

Code Quality ✅

Strengths:

  • Clear separation of concerns with stopped_sub subscription in both request and WebSocket handlers
  • Consistent error handling using appropriate error types (ServiceUnavailable, WebSocketServiceUnavailable, WebSocketServiceHibernate)
  • Follows existing patterns and conventions in the codebase
  • Good use of tokio::select! for concurrent event handling

Detailed Analysis

1. pegboard-gateway/src/lib.rs - Actor Stopped Handling

Lines 155-158, 290-293: Subscription creation

let mut stopped_sub = self
    .ctx
    .subscribe::<pegboard::workflows::actor::Stopped>(("actor_id", self.actor_id))
    .await?;

Good: Subscribes to actor stopped events before processing requests
⚠️ Consideration: The subscription happens after reading the request body. If the actor stops during body reading, it won't be detected. Consider moving the subscription earlier if this is a concern.

Lines 220-223, 356-359, 436-444: Select arms for stopped events

_ = stopped_sub.next() => {
    tracing::debug!("actor stopped while waiting for request response");
    return Err(ServiceUnavailable.build());
}

Good: Appropriate error types returned for each context
Good: Debug logging helps with troubleshooting
Good: Hibernation-aware logic in WebSocket handler (lines 436-444)

2. pegboard-gateway/src/shared_state.rs - Pending Messages Check

Lines 347-357: New has_pending_websocket_messages method

pub async fn has_pending_websocket_messages(&self, request_id: RequestId) -> Result<bool> {
    let Some(req) = self.in_flight_requests.get_async(&request_id).await else {
        bail!("request not in flight");
    };

    if let Some(hs) = &req.hibernation_state {
        Ok(!hs.pending_ws_msgs.is_empty())
    } else {
        Ok(false)
    }
}

Good: Simple, focused method
Good: Proper error handling for missing requests
Good: Safe handling of optional hibernation state

Lines 609-616: Early wake optimization in handle_websocket_hibernation

// Immediately rewake if we have pending messages
if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

Excellent: Prevents unnecessary hibernation when messages are pending
Performance: Avoids spawning keepalive task unnecessarily

3. pegboard-runner/src/lib.rs - WebSocket Lifecycle Changes

Lines 246-279: Simplified close message handling

// Send close messages to all remaining active requests
let active_requests = conn.tunnel_active_requests.lock().await;
for (request_id, req) in &*active_requests {
    // Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
    // instead they wait for the actor itself to stop.
    if req.is_ws {
        continue;
    }
    // ... send ToServerResponseAbort
}

Good: Clear distinction between WebSocket and HTTP request lifecycle
Good: Comprehensive comment explains the design decision
Good: Removes complex WebSocket close logic that was previously trying to send close frames

Removed Code Analysis:

  • Removed lines 13, 246-262: Eliminated CloseFrame and CloseCode imports and complex close logic
  • Simplification: The new approach is cleaner - WebSockets wait for actor stop signal rather than runner disconnect

Potential Issues & Suggestions

Minor Issues

  1. Comment Typo (line 385):
// Send pending messages

Previous comment said "Send reclaimed messages" - good fix! ✅

  1. Subscription Timing (lines 155-158):
    The stopped_sub is created after reading the request body. If the actor stops during body reading (lines 148-153), it won't be detected immediately. Consider:
// Subscribe before reading body
let mut stopped_sub = self.ctx.subscribe::<...>(...).await?;

let body_bytes = req.into_body().collect().await
    .context("failed to read body")?
    .to_bytes();
  1. Error Propagation Consistency:
    All three stopped event handlers return errors appropriately, but consider if there are cases where you'd want to distinguish between "actor stopped" vs other ServiceUnavailable causes for metrics/logging.

Questions for Discussion

  1. Race Condition: Is there a race between the actor stopping and the subscription being established? If an actor stops between handler start and subscription creation, the event might be missed.

  2. Subscription Cleanup: The stopped_sub is dropped at the end of each function. Is this the desired behavior, or should subscriptions be maintained longer?

  3. Testing: Are there integration tests that verify:

    • Actor stops during request body reading?
    • Actor stops while waiting for response?
    • Actor stops during WebSocket message handling?
    • Actor stops during hibernation?

Performance Considerations

Good: The early wake check (lines 609-616) prevents unnecessary hibernation overhead
Good: Using tokio::select! for concurrent event handling is efficient
⚠️ Minor: Creating a new subscription for each request/WebSocket connection could add overhead at scale. Consider connection pooling or subscription reuse if this becomes a bottleneck.

Security Concerns

No security issues identified

  • Proper error handling prevents information leakage
  • No authentication/authorization changes
  • No unsafe code introduced

Test Coverage

⚠️ Missing Test Evidence: No test files were modified in this PR. Consider adding tests for:

  • Actor stopping during request handling
  • Actor stopping during WebSocket connection
  • Actor stopping during hibernation
  • Early wake when pending messages exist

Recommendations

  1. Add Integration Tests: Cover the new actor stopped scenarios
  2. Consider Subscription Timing: Move subscription creation before body reading if actor stop during body read is a concern
  3. Add Metrics: Consider adding metrics to track how often actors stop mid-request for operational visibility
  4. Documentation: Add a code comment explaining the actor stopped subscription pattern for future maintainers

Verdict

Approval: ✅ Approved with Suggestions

This is a solid PR that improves reliability by handling actor stopped events gracefully. The code follows existing patterns, has appropriate error handling, and includes a nice optimization for hibernation. The suggestions above are minor improvements that could be addressed in follow-up work.

Strengths:

  • Proper error handling
  • Good logging
  • Performance optimization (early wake)
  • Simplified WebSocket lifecycle

Action Items:

  • Consider adding integration tests
  • Consider moving subscription earlier to catch stops during body reading
  • Consider adding metrics for operational visibility

@claude
Copy link

claude bot commented Nov 18, 2025

Code Review: PR #3481 - fix(guard): handle actor stopped during ws/req

Summary

This PR adds proper handling for actor stop events during WebSocket and HTTP request processing in the pegboard gateway. The changes ensure graceful cleanup when actors stop while requests are in-flight.

Code Quality ✅

Strengths:

  • Consistent error handling: Properly returns ServiceUnavailable for HTTP requests and WebSocketServiceUnavailable/WebSocketServiceHibernate for WebSocket connections based on hibernation capability
  • Structured logging: Uses structured logging correctly with tracing::debug!(?x, "message") pattern, following CLAUDE.md conventions
  • Lifecycle management: The optimization in handle_websocket_hibernation (lib.rs:609-616) to immediately rewake when pending messages exist is a good performance improvement
  • Code organization: Subscribes to the Stopped signal early in both handle_request and handle_websocket methods, ensuring proper cleanup

Potential Issues ⚠️

1. Subscription Created After Request Start (Medium Priority)

In handle_request (lib.rs:155-158), the stopped_sub subscription is created after reading the request body but before starting the in-flight request. If the actor stops between reading the body and subscribing, there's a small race condition window.

Recommendation: Move the subscription to the earliest point possible.

2. WebSocket: No Stopped Handling During Hibernation (Medium Priority)

In handle_websocket_hibernation_inner (lib.rs:668-699), the method only listens for Ready events and WebSocket messages, but doesn't handle the case where the actor stops during hibernation.

Recommendation: Add a stopped_sub to the hibernation select to handle actor stops during hibernation.

3. Runner-side: Changed Semantics (Low Priority)

In pegboard-runner/src/lib.rs:249-253, WebSockets no longer receive close messages when the runner WebSocket closes. Consider documenting this behavior more clearly.

Performance Considerations ✅

Positive:

  • Early check for pending messages in hibernation avoids unnecessary keepalive task spawn
  • Subscriptions are cleaned up properly via tokio::select patterns

Security Concerns ✅

No security issues identified. Error messages don't leak sensitive information.

Test Coverage ⚠️

Missing:

  • No test files found in pegboard-gateway package
  • This PR adds critical error handling logic that should be tested

Recommendations:

  1. Add integration tests for actor stop scenarios
  2. Test the race condition scenarios
  3. Verify that pending messages are properly handled when actor stops

Overall Assessment

This is a solid improvement that addresses a real gap in actor lifecycle handling. The code quality is good and follows repository conventions. The main concern is the lack of test coverage and a couple of edge cases in the subscription timing.


Approval Status: ⚠️ Request Changes (add tests + fix race conditions)

cc: @MasterPtato

🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 19, 2025

PR Review: fix(guard): handle actor stopped during ws/req

Overview

This PR addresses an important issue where the gateway doesn't properly handle actor stop events during WebSocket and HTTP request processing. The implementation subscribes to actor Stopped events and gracefully terminates connections when actors are no longer available.

Code Quality & Best Practices

✅ Strengths

  1. Consistent Error Handling: The PR correctly uses appropriate error types:

    • ServiceUnavailable for HTTP requests
    • WebSocketServiceUnavailable for WebSocket connections (when hibernation not enabled)
    • WebSocketServiceHibernate for WebSocket connections (when hibernation enabled)
  2. Follows Repository Patterns:

    • Uses structured logging correctly (e.g., tracing::debug!("actor stopped...") without formatting into the message)
    • Adheres to the error handling patterns defined in the codebase
    • Properly uses tokio::select! for concurrent event handling
  3. Good Code Organization: The changes are well-scoped and localized to the affected modules

🔍 Potential Issues

1. Logic Error in is_ws_hibernate check (Critical)

Location: engine/packages/pegboard-gateway/src/lib.rs:562-566

if lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Issue: The logic appears inverted. map_or_else takes two closures:

  • First closure runs when None (the error case)
  • Second closure runs when Some (the success case)

This means:

  • If lifecycle_res is Err(...), it checks is_ws_hibernate(err)
  • If lifecycle_res is Ok(...), it returns false

But the condition sends a close frame when the result is false, which means:

  • Close frames are sent for successful completions
  • Close frames are NOT sent for hibernation errors

Expected behavior: Close frames should NOT be sent when hibernating, but SHOULD be sent otherwise.

Suggested fix:

// Send close frame to runner if NOT hibernating
if !lifecycle_res
    .as_ref()
    .map_or_else(is_ws_hibernate, |_| false)
{

Or more clearly:

let should_send_close = match &lifecycle_res {
    Err(err) => !is_ws_hibernate(err),
    Ok(_) => true,
};

if should_send_close {
    // ... send close message
}

2. Subscription Timing

Location: Multiple locations in lib.rs

The PR creates stopped_sub subscriptions after reading the request body (for HTTP) or after header extraction (for WebSocket). This creates a potential race condition:

let body_bytes = req.into_body().collect().await?; // Actor could stop here
let mut stopped_sub = self.ctx.subscribe::<...>().await?; // We'd miss it

Impact: If the actor stops between request parsing and subscription, the stop event would be missed and the request would timeout instead of failing immediately.

Suggested improvement: Subscribe to actor stopped events as early as possible, before any async operations.

3. Immediate Wake on Pending Messages

Location: engine/packages/pegboard-gateway/src/lib.rs:614-621

if self
    .shared_state
    .has_pending_websocket_messages(unique_request_id.into_bytes())
    .await?
{
    return Ok(HibernationResult::Continue);
}

Observation: This is good defensive programming that prevents unnecessary hibernation when messages are already pending. However, there's no explanation in comments about why this is needed.

Suggestion: Add a comment explaining this check prevents a race condition where messages arrive just as hibernation begins.

4. Missing Stopped Subscription in Hibernation Handler

Location: engine/packages/pegboard-gateway/src/lib.rs:673-705

The handle_websocket_hibernation_inner function subscribes to actor::Ready events but NOT to actor::Stopped events. This means if an actor stops during hibernation, the WebSocket will remain hibernated until either:

  • The actor becomes ready again (unlikely if stopped)
  • The client sends a message or closes

Suggested improvement: Also subscribe to Stopped events during hibernation and close the connection if received.

Performance Considerations

✅ Good Practices

  1. Efficient Subscription Management: The PR properly uses tokio::select! which is efficient for handling multiple async events concurrently.

  2. No Blocking Operations: All operations are properly async, preventing thread blocking.

⚠️ Minor Concerns

  1. Multiple Subscriptions: Each request/WebSocket creates a new subscription to actor stopped events. For actors with high request volume, this could create significant subscription overhead.
    • Impact: Likely minimal in practice since subscriptions are lightweight
    • Consideration: Monitor UPS subscription count if this becomes a bottleneck

Security Concerns

✅ No Major Issues

  • Error messages don't leak sensitive information
  • Proper error propagation without exposing internal state
  • No authentication/authorization bypasses introduced

Test Coverage

⚠️ Missing Tests

The PR doesn't include any tests for the new actor stopped handling. Consider adding integration tests for:

  1. HTTP Request: Actor stops while waiting for response
  2. WebSocket Open: Actor stops during WebSocket handshake
  3. WebSocket Active: Actor stops during active WebSocket connection
  4. WebSocket Hibernation: Actor stops while WebSocket is hibernated
  5. Hibernation with Pending Messages: Verify immediate wake behavior

Example test scenario:

#[tokio::test]
async fn test_actor_stopped_during_request() {
    // Setup: Create actor and start request
    // Action: Stop actor while request is in flight
    // Assert: Request returns ServiceUnavailable error
}

Runner Changes

✅ Improved Clarity

Location: engine/packages/pegboard-runner/src/lib.rs:246-279

The change to NOT send WebSocket close messages when the runner connection closes is correct:

// Websockets are not ephemeral like requests. If the runner ws closes they are not informed;
// instead they wait for the actor itself to stop.
if req.is_ws {
    continue;
}

This properly distinguishes between:

  • HTTP requests: Ephemeral, should be aborted when runner disconnects
  • WebSockets: Persistent, rely on actor lifecycle events

The comment clearly explains this distinction. Good!

Documentation

Website Changes

The changes to website/public/llms*.txt appear to be unrelated documentation cleanup. This is fine but might warrant a separate commit for clarity.

Overall Assessment

Status: ⚠️ Needs Changes

The PR addresses a real issue and the overall approach is sound, but there's a critical logic error in the is_ws_hibernate check that needs to be fixed before merging. Additionally, the race condition with subscription timing and missing test coverage should be addressed.

Required Changes:

  1. ✅ Fix the inverted logic in the is_ws_hibernate check (line 562-566)

Recommended Changes:

  1. Subscribe to actor stopped events earlier to prevent race conditions
  2. Add stopped event handling during hibernation
  3. Add test coverage for the new functionality
  4. Add explanatory comments for the immediate wake optimization

Nice to Have:

  1. Consider separating the documentation cleanup into a separate commit
  2. Add performance monitoring for subscription overhead if deployed at scale

Great work identifying and fixing this issue! The core logic is solid, just needs the critical bug fix and some defensive improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants