Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 21, 2025 0:48am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 21, 2025 0:48am
rivet-inspector Ignored Ignored Preview Nov 21, 2025 0:48am
rivet-site Ignored Ignored Preview Nov 21, 2025 0:48am

Copy link
Contributor Author

MasterPtato commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed PR #3505 and here's my feedback:

Summary

This PR fixes issues with hibernating WebSocket connections by:

  1. Preventing GC from removing hibernating in-flight requests
  2. Checking if actor started after subscription when hibernating

Positive Changes

Race condition fix (lib.rs:639-652): The addition of fetching actor info after subscribing prevents a race condition where an actor could become ready between the subscription and the select. This is good defensive programming.

Keepalive mechanism (keepalive_task.rs:49-57): Using tokio::try_join! to update both the database and in-memory state atomically is correct and ensures consistency.

GC timeout logic (shared_state.rs:503-505): Adding HibernationTimeout as a separate GC reason provides better observability and prevents hibernating connections from being incorrectly cleaned up.

Issues Found

🔴 Critical: Debug logging left in production code

shared_state.rs:178, 373

tracing::info!(?message_id, ?payload, "---------------");
tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

These debug logs should be removed before merging. They appear to be temporary debugging statements.

Recommendation: Remove these or convert to tracing::debug! with meaningful messages.


🟡 Minor: Inconsistent logging format

shared_state.rs:178
The log message uses ?message_id but should use %message_id for consistency with the tracing instrument directive on line 141 which uses % for request_id formatting.

Recommendation: Use % for display formatting consistently across the codebase.


🟡 Code quality: Parameter renaming

lib.rs:270, 277, 568, 572
The parameter rename from unique_request_id to request_id is good for clarity, but there's an inconsistency in variable usage at lib.rs:590 where the comment says "exiting hibernating" (should be "hibernation").

Recommendation: Fix the log message: "exiting hibernation due to pending messages"


🟢 Minor: Comment typo

keepalive_task.rs:11
Comment says "Periodically pings writes keepalive" - should be "Periodically writes keepalive pings" or "Periodically pings and writes keepalive".


Performance & Security

✅ No performance concerns identified
✅ No security vulnerabilities introduced
✅ Proper error handling maintained

Test Coverage

⚠️ No tests found - Consider adding integration tests for:

  • Race condition scenario where actor becomes ready during hibernation setup
  • GC behavior with hibernating connections
  • Keepalive task interaction with in-flight request state

Architecture Considerations

✅ The two-phase approach (subscribe then check) correctly handles the race condition
✅ The keepalive mechanism properly updates both persistent (UDB) and in-memory state
✅ GC logic correctly distinguishes between hibernating and non-hibernating connections

Recommendations

Must fix before merge:

  1. Remove or improve debug logging statements (lines 178, 373 in shared_state.rs)

Should fix:
2. Fix typo: "exiting hibernating" → "exiting hibernation" (lib.rs:590)
3. Fix comment: "pings writes" → "writes" (keepalive_task.rs:11)
4. Consider consistent use of % vs ? for tracing formatting

Nice to have:
5. Add integration tests for the race condition scenarios
6. Document the race condition fix in code comments

Overall, this is a solid fix for important race conditions in WebSocket hibernation. The logic is sound, but the debug logging needs to be cleaned up before merging.

@NathanFlurry NathanFlurry changed the base branch from 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state to graphite-base/3505 November 20, 2025 22:32
@MasterPtato MasterPtato force-pushed the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch from 2cb75a1 to a1a7bc8 Compare November 20, 2025 22:38
@MasterPtato MasterPtato changed the base branch from graphite-base/3505 to 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state November 20, 2025 22:39
@MasterPtato MasterPtato marked this pull request as ready for review November 20, 2025 22:41
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(gateway): prevent gc from removing hibernating in flight req, check actor started after sub when hibernating

Summary

This PR addresses two critical race conditions in the pegboard gateway's hibernation handling:

  1. Prevents garbage collection from removing in-flight requests that are actively hibernating
  2. Ensures actor state is checked after subscribing to prevent missing ready events

✅ Strengths

Architecture & Design

  • Good race condition fix: The check at lib.rs:646-659 properly handles the TOCTOU (time-of-check-time-of-use) race by fetching actor state after subscribing to the ready event
  • Proper keepalive mechanism: The keepalive_hws function correctly updates the last_ping timestamp to prevent premature GC of hibernating requests
  • Clean separation of concerns: The new field in InFlightRequestHandle properly distinguishes between new and resumed connections

Code Quality

  • Improved logging: Consistent use of % formatting for request IDs instead of ? improves readability (e.g., request_id=%tunnel_id::request_id_to_string(&request_id))
  • Better error handling: The ensure! macro at lib.rs:311-314 provides clear assertion that we shouldn't create new in-flight entries after hibernation

⚠️ Issues & Concerns

🔴 Critical: Debug Logging Left in Production Code

Location: shared_state.rs:188 and shared_state.rs:383

tracing::info!(?message_id, ?payload, "---------------");
// ...
tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

Issue: These appear to be debug logging statements that should be removed before merging. They:

  • Use tracing::info! for what appears to be debugging noise
  • Have non-descriptive messages ("---------------", "------2---------")
  • Log potentially large payloads at info level which could impact performance
  • Violate CLAUDE.md logging conventions (should be lowercase, descriptive messages)

Recommendation: Remove these lines or convert to proper tracing::debug! with descriptive messages.


🟡 Potential Logic Issue: GC Timeout Fallback

Location: shared_state.rs:489-490

let hibernation_timeout =
    Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000));

Issue: The unwrap_or(90_000) fallback seems problematic:

  • If hibernation_timeout (i64) cannot convert to the target type, it falls back to 90 seconds
  • This could silently mask configuration errors
  • The fallback value doesn't match common timeout patterns in the codebase

Recommendation: Either:

  1. Use expect() with a clear error message if the conversion fails
  2. Validate the config value at initialization time
  3. Document why 90 seconds is the correct fallback

🟡 Code Style: Inconsistent Struct Field Destructuring

Location: lib.rs:167-170

let InFlightRequestHandle {
    mut msg_rx,
    mut drop_rx,
    ..  // Ignoring the new field
} = self.shared_state.start_in_flight_request(...).await;

vs. lib.rs:302-309 where all fields are explicitly destructured.

Issue: Using .. to ignore fields makes it harder to track when struct fields are used. The explicit destructuring in the websocket handler is clearer.

Recommendation: For consistency and clarity, explicitly list new: _ instead of using ...


🟢 Minor: Typo in Comment

Location: keepalive_task.rs:11

/// Periodically pings writes keepalive in UDB.

Issue: Grammar error - "pings writes" should be "writes" or "pings and writes"

Recommendation: Fix to: "Periodically writes keepalive in UDB."


🟢 Minor: Configuration Dependency Added

Locations: guard/src/lib.rs:28, guard/src/shared_state.rs:9

Observation: The SharedState now requires config to be passed through, adding rivet-config as a dependency to pegboard-gateway. This is fine but adds coupling.

Recommendation: Consider if hibernation_timeout could be passed as a simple parameter rather than requiring the entire config object. This would reduce coupling, though the current approach is acceptable.


🧪 Testing Considerations

Missing Test Coverage:

  1. Race condition between actor ready and subscription (the main fix at lib.rs:646-659)
  2. GC behavior when last_ping is updated via keepalive_hws
  3. The ensure! assertion when attempting to create a new entry after hibernation

Recommendation: Add integration tests for these scenarios, particularly:

  • Actor becomes ready between check and subscribe
  • Hibernating request survives GC when keepalive is running
  • Attempting to reconnect when entry should exist

📋 Minor Observations

  1. Duplicate tokio::try_join! pattern: keepalive_task.rs:49-57 - Good use of concurrent operations for keepalive
  2. Proper error propagation: All error cases properly return anyhow::Result
  3. Following CLAUDE.md conventions: Generally good adherence to structured logging patterns (except for the debug logs mentioned)

🎯 Action Items

Before Merge:

  1. MUST: Remove debug logging at shared_state.rs:188 and shared_state.rs:383
  2. SHOULD: Fix the unwrap_or(90_000) fallback or document why it's correct
  3. SHOULD: Fix typo in keepalive_task.rs:11
  4. CONSIDER: Use explicit field ignoring instead of ..

After Merge:

  • Add integration tests for the race condition fixes
  • Consider refactoring config dependency

✨ Overall Assessment

Rating: ⭐⭐⭐⭐ (4/5)

This PR fixes legitimate race conditions in the hibernation logic with a sound approach. The core fixes are solid and well-reasoned. The main concern is the debug logging that should not be merged to production. Once those logging statements are addressed, this is good to merge.

The fix properly prevents the GC from incorrectly removing hibernating connections and handles the actor ready race condition correctly. Good work on identifying and fixing these subtle timing issues!

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch from a1a7bc8 to b1dae1e Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 8b84eb1 to 4f58c50 Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch from b1dae1e to b6dfb0d Compare November 20, 2025 23:35
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

This PR addresses important issues with WebSocket hibernation handling in the pegboard gateway. Overall, the changes look solid and address real race conditions and lifecycle management issues. Here's my detailed feedback:

Strengths

  1. Critical Race Condition Fix: Lines 647-659 in lib.rs properly handle the race condition by:

    • Creating the subscription first
    • Then checking actor state after subscription
    • This prevents missing the Ready event if the actor becomes ready between the check and subscription
  2. Hibernation Timeout Logic: The new HibernationTimeout GC reason (lines 515-516 in shared_state.rs) correctly prevents indefinite hibernation when keepalive fails.

  3. Keepalive Integration: The dual keepalive approach (lines 49-57 in keepalive_task.rs) ensures both UDB and in-flight state are updated atomically using tokio::try_join!.

  4. Validation Logic: The check at lines 311-315 in lib.rs ensures invariant correctness - we should never create a new in-flight entry after hibernation.

  5. Logging Improvements: Changed from ? to % for request_id logging, which is more appropriate for display formatting.

🔍 Issues & Recommendations

Critical - Debug Logging Left in Code

Line 188 in shared_state.rs:

tracing::info!(?message_id, ?payload, "---------------");

Line 383 in shared_state.rs:

tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

These appear to be debug logs that should be removed before merging. They:

  • Use info level for debugging
  • Have unclear messages ("---------------", "------2---------")
  • May log sensitive payload data at production log levels

Recommendation: Remove these or change to tracing::debug! with descriptive messages.

Minor - Inconsistent Debug Log (Line 191 in shared_state.rs)

tracing::debug!(?message_id, ?payload, "shared state send message");

This debug log was added but duplicates information from the info logs above. Consider consolidating.

Code Quality - Configuration Error Handling

Line 490 in shared_state.rs:

Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000))

The unwrap_or(90_000) silently falls back to 90 seconds if the conversion fails. Consider:

  • Logging a warning when fallback occurs
  • Making this a named constant (DEFAULT_HIBERNATION_TIMEOUT_MS)
  • Validating the config value at startup instead

Performance - Redundant Variable Assignment

Lines 422-436 in lib.rs:

let keepalive = if can_hibernate {
    Some(tokio::spawn(keepalive_task::task(
        self.shared_state.clone(),
        // ...
    )))
} else {
    None
};

Then later checked with:

let Some(keepalive) = keepalive else {
    return Ok(LifecycleResult::Aborted);
};

This pattern is correct but could be simplified. The task spawn could be deferred until needed in the join.

Documentation - Missing Comment Context

The new last_ping field (line 53 in shared_state.rs) has a comment but the keepalive mechanism isn't fully documented. Consider adding a module-level comment explaining the hibernation keepalive protocol.

🔒 Security Considerations

  • No obvious security issues
  • Payload logging (if kept) should be reviewed for sensitive data exposure
  • GC timeout handling prevents resource exhaustion

🧪 Testing Recommendations

The PR should include tests for:

  1. Race condition: Actor becomes ready between subscription and state check
  2. GC timeout: Hibernating request expires when keepalive stops
  3. Edge case: Creating in-flight request after hibernation (should fail the ensure!)
  4. Keepalive failure: Both UDB and in-flight state remain consistent

📋 Other Notes

  • The new field addition to InFlightRequestHandle is a clean way to track entry creation
  • Threading config through to SharedState is the right approach for accessing hibernation timeout
  • GC logic improvements make the state machine clearer

Summary

This is a solid fix for critical issues. The main blocker is the debug logging left in production code. Once that's addressed, this should be good to merge.

Action Items:

  • Remove or fix debug logging (lines 188, 383 in shared_state.rs)
  • Consider making hibernation timeout fallback a named constant
  • Add tests for the race condition fix
  • Verify no sensitive data in payload logging

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - PR #3505

Summary

This PR fixes two critical issues with hibernating WebSocket connections in the pegboard gateway:

  1. Prevents garbage collection from removing hibernating in-flight requests
  2. Adds a race condition check to ensure actors have actually started before processing hibernating requests

Positive Aspects ✅

Architecture & Design

  • Good use of the new flag in InFlightRequestHandle to track whether an entry is new or reused - this provides valuable context for hibernation logic
  • The keepalive_hws() function provides a clean separation of concerns for keeping hibernating requests alive
  • Race condition fix in handle_websocket_hibernation_inner() (lines 646-659) is well-thought-out: subscribing first, then checking state prevents missing events

Code Quality

  • Consistent use of structured logging with % formatting for request/gateway IDs (e.g., request_id=%tunnel_id::request_id_to_string(&request_id))
  • Good error handling with the ensure! macro to validate hibernation state invariants
  • Proper cleanup via tokio::try_join! to run both DB upsert and in-memory keepalive atomically

Issues & Concerns ⚠️

1. Debug Logging Left in Production Code 🔴 HIGH PRIORITY

Lines shared_state.rs:188 and shared_state.rs:383 contain debug logs that should be removed:

// Line 188
tracing::info!(?message_id, ?payload, "---------------");

// Line 383  
tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

These appear to be debugging artifacts with placeholder messages. They should either be:

  • Removed entirely if not needed
  • Converted to proper tracing::debug! with descriptive messages
  • Made conditional on a debug feature flag

Logging payloads at info level could have performance implications and may expose sensitive data.

2. Inconsistent Error Handling 🟡

In keepalive_hws() (shared_state.rs:265-280):

if let Some(hs) = &mut req.hibernation_state {
    hs.last_ping = Instant::now();
} else {
    tracing::warn!("should not call keepalive_hws for non-hibernating ws");
}

This logs a warning but returns Ok(()). Consider returning an error or using ensure! for consistency with the codebase's error handling patterns:

let Some(hs) = &mut req.hibernation_state else {
    bail!("cannot keepalive non-hibernating websocket");
};
hs.last_ping = Instant::now();

3. Type Safety Concern 🟡

Line shared_state.rs:490:

Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000))

The fallback of 90 seconds is hardcoded and may not match the actual config default. Consider:

  • Using a named constant: const DEFAULT_HIBERNATION_TIMEOUT_MS: u64 = 90_000;
  • Or better, ensure the config value is always valid and use .expect() with a clear message
  • The fallback silently masks configuration errors which could lead to unexpected behavior

4. Race Condition Documentation 📝

The fix at lines 646-659 in lib.rs is excellent but deserves a code comment explaining the race condition it prevents:

// Fetch actor info after sub to prevent race condition
// Race: Actor becomes ready between initial check and subscription
if let Some(actor) = self.ctx.op(...).await? {
    if actor.runner_id.is_some() {
        // ...
    }
}

Suggested comment:

// Check if actor became ready before we subscribed. This prevents a race where:
// 1. Actor becomes ready
// 2. We subscribe (missing the Ready event)
// 3. We'd wait forever on the subscription
// By checking after subscribing, we ensure we either see the current state or catch the event

5. Potential Memory Leak 🟡

The HibernationState now tracks last_ping to prevent GC, but the GC logic (lines 503-517) only checks:

  1. Pending message ack timeout
  2. Hibernation timeout (based on last_ping)
  3. Gateway channel closed (if not hibernating)

Question: What happens if a hibernating WebSocket has no pending messages but the keepalive task fails/crashes? The last_ping would never update, eventually triggering HibernationTimeout, which is correct. However, this could take up to 90 seconds (or configured timeout). Consider adding metrics or logging when this timeout is triggered to help diagnose keepalive failures.

6. Configuration Dependency 📝

The PR adds a rivet-config dependency to pegboard-gateway. While necessary, this increases coupling. The hibernation_timeout is read once during initialization but used in a hot path (GC loop every 15 seconds). This is fine, but document that config changes require restart.

Testing Recommendations 🧪

This PR would benefit from tests covering:

  1. Hibernation keepalive: Verify that hibernating requests with active keepalive aren't GC'd
  2. GC timeout: Verify that hibernating requests without keepalive are eventually removed
  3. Race condition: Test the actor-becoming-ready race with concurrent subscriptions
  4. New flag behavior: Verify new is true for first request, false for reconnections
  5. Error scenarios: What happens if keepalive_hws is called on non-hibernating request

Consider adding driver tests (see rivetkit-typescript/packages/rivetkit/src/driver-test-suite/tests/) for the hibernation scenarios.

Performance Considerations ⚡

Positive:

  • Using Instant::now() instead of util::timestamp::now() for last_ping is correct since it's only used for elapsed time calculations (avoiding potential clock skew issues)
  • The GC runs every 15 seconds, so the overhead of checking last_ping is minimal

Consideration:

  • Every keepalive interval (hibernation_timeout / 2), the system now updates both UDB and in-memory state via tokio::try_join!. This is fine, but ensure UDB can handle the write load with many hibernating connections.

Security Considerations 🔒

Logging Sensitivity:

  • The debug logs at lines 188 and 383 log ?payload which may contain sensitive data. Ensure payloads don't contain tokens, credentials, or PII before merging, or remove these logs.

DoS Protection:

  • The hibernation timeout mechanism is good for preventing resource exhaustion from abandoned connections
  • The existing HWS_MAX_PENDING_MSGS_SIZE_PER_REQ (1 MiB) limit remains in place, which is good

Code Style Compliance 📐

Follows CLAUDE.md conventions:

  • ✅ Uses structured logging correctly (mostly)
  • ✅ Error handling with anyhow::Result
  • ✅ Workspace dependencies properly configured
  • ✅ Lowercase log messages
  • ⚠️ The debug logs violate the structured logging pattern

Minor style notes:

  • Line 383: The comment "------2---------" should be descriptive
  • Consider using named constants for magic numbers (e.g., the 128ms jitter in keepalive_task.rs:46)

Recommendations

Before merging:

  1. MUST FIX: Remove or fix debug logging at lines 188 and 383
  2. SHOULD FIX: Make error handling in keepalive_hws() consistent
  3. SHOULD FIX: Use a named constant for the hibernation timeout fallback
  4. SHOULD ADD: Code comment explaining the race condition fix

Follow-up work:

  1. Add metrics for hibernation timeout events
  2. Add integration/driver tests for hibernation scenarios
  3. Consider making the jitter configurable or using a constant

Verdict

This PR addresses real bugs and the core logic is sound. The race condition fix is particularly well-designed. However, the debug logging artifacts must be addressed before merging. Once the high-priority issues are resolved, this will be a solid improvement to the hibernation system.

Estimated risk: Medium (due to changes in critical path) → Low (after debug logs removed)
Recommend: Request changes for debug logging, approve after fixes


Generated with Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix Gateway Hibernation Issues

Summary

This PR addresses two critical issues with hibernating WebSocket requests in the pegboard gateway:

  1. Prevents the GC from removing hibernating in-flight requests that have pending messages
  2. Adds a race condition check to verify the actor hasn't already started during hibernation setup

Code Quality ✅

Positive Changes:

  • Good use of structured logging with % formatter for better performance (changed from ? in multiple places)
  • Proper separation of concerns with the new keepalive_hws() method
  • Added new flag to InFlightRequestHandle to track entry creation/reuse - clever solution

Code Style:

  • Follows Rust conventions and workspace patterns
  • Consistent with existing codebase style
  • Uses ensure! macro appropriately for validation (line 311-314 in lib.rs)

Potential Issues & Concerns

1. Debug Logging Left in Production Code ⚠️

Location: shared_state.rs:383

tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

This appears to be temporary debug logging that should be removed before merging. The log message "------2---------" suggests it was added during development.

Recommendation: Remove or convert to proper debug! logging with a meaningful message.

2. Error Handling: Try_into with Silent Fallback

Location: shared_state.rs:490

Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000))

The try_into() can fail if hibernation_timeout (an i64) is negative or too large. The fallback value of 90,000ms matches the config default, but this silent failure could mask configuration issues.

Recommendation: Consider logging a warning if the conversion fails, or validate the config value at initialization time.

3. Race Condition Check Enhancement

Location: lib.rs:643-658
The new check fetches actor info after subscribing to prevent a race condition where the actor becomes ready during hibernation. However, there's a potential gap:

if let Some(actor) = self.ctx.op(pegboard::ops::actor::get_for_gateway::Input {
    actor_id: self.actor_id,
}).await? {
    if actor.runner_id.is_some() {
        tracing::debug!("actor became ready during hibernation");
        return Ok(HibernationResult::Continue);
    }
}

Questions:

  • What happens if the actor is None? Should this be an error case?
  • Is checking runner_id.is_some() sufficient to determine if an actor is "started"? This logic may warrant a comment explaining the actor state model.

4. Thread Safety: Hibernation State Updates

The keepalive_hws() method updates last_ping while gc_in_flight_requests() reads it. While scc::HashMap provides concurrent access, both methods use mutable access patterns:

Location: shared_state.rs:266-280 and shared_state.rs:515-516

The code appears correct (both use async locks via get_async), but it's worth verifying that the interleaving of:

  1. GC reading last_ping.elapsed()
  2. Keepalive updating last_ping

cannot cause a request to be incorrectly GC'd. The timing window is small but exists between the GC's check at line 515 and when it marks the request as stopping at line 537.

Recommendation: Add a comment explaining the locking guarantees, or consider if stopping flag prevents this issue already.

Performance Considerations

Positive:

  • Parallel operations with tokio::try_join! in keepalive task (line 49-57) is good for reducing latency
  • GC interval of 15s is reasonable and uses MissedTickBehavior::Skip to prevent pileup

Consideration:

  • The new HibernationTimeout GC reason adds another check to the GC loop. For systems with many hibernating connections, this could add overhead. Current implementation looks efficient, but worth monitoring.

Security Concerns

Low Risk Identified:
The PR adds proper validation to prevent creating new in-flight entries after hibernation (line 311-314), which is a good defensive programming practice. No obvious security vulnerabilities introduced.

Test Coverage ⚠️

Missing: The PR doesn't include test changes. Given the complexity of the race conditions being fixed, consider adding:

  1. Test for hibernation timeout GC behavior
  2. Test for the race condition when actor becomes ready during hibernation setup
  3. Test that keepalive_hws() properly prevents GC during hibernation

Recommendation: Add integration tests to verify the fixes work as intended, especially for the race condition scenarios.

Architecture & Design

Good:

  • Passing SharedState to keepalive_task is cleaner than individual parameters
  • Threading config through to SharedState::new() is proper dependency management
  • The two-phase GC commit pattern is well-documented and appropriate

Configuration Threading:
The PR properly threads the config through:

  • guard/lib.rsSharedState::new()
  • SharedState stores hibernation_timeout from config
    This follows the established patterns in the codebase.

Summary & Recommendations

Must Fix Before Merge:

  1. ❌ Remove debug logging at shared_state.rs:383

Should Address:
2. ⚠️ Add tests for hibernation timeout and race condition scenarios
3. ⚠️ Consider logging when try_into() fallback is triggered (line 490)
4. ⚠️ Add comment explaining actor state check logic (line 651)

Nice to Have:
5. 💡 Document the locking guarantees for keepalive_hws() vs GC
6. 💡 Verify what happens when actor is None in the race condition check

Conclusion

The core logic of this PR is sound and addresses real issues with hibernating WebSocket GC and race conditions. The implementation follows good practices and is consistent with the codebase. The main concerns are:

  • Debug logging that should be removed
  • Missing test coverage for complex race conditions
  • Minor documentation gaps

With the debug logging removed and ideally some test coverage added, this PR should be good to merge.

@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 4f58c50 to 8b84eb1 Compare November 21, 2025 00:48
@MasterPtato MasterPtato force-pushed the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch from b6dfb0d to 31af31a Compare November 21, 2025 00:48
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

I've reviewed this PR and identified several issues that should be addressed before merging.

Summary

This PR fixes two critical race conditions in the pegboard gateway's WebSocket hibernation handling:

  1. Prevents GC from prematurely removing hibernating in-flight requests by adding keepalive mechanism
  2. Fixes race condition where actor could become ready between subscription and state check

🔴 Critical Issues

1. Debug Logging Left in Production Code

shared_state.rs:188 and shared_state.rs:383

tracing::info!(?message_id, ?payload, "shared state send message");
// ...
tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

These appear to be debug logs that need to be removed/fixed before merging:

  • Line 383 has a non-descriptive message ("------2---------")
  • Logging full payloads at info level could impact performance
  • May expose sensitive data in production logs
  • Violates CLAUDE.md logging conventions (should be lowercase, descriptive)

Recommendation: Remove line 383 entirely. For line 188, either remove or change to tracing::debug! with a proper message.


🟡 Issues to Address

2. Silent Error Handling in Type Conversion

shared_state.rs:490

Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000))

The unwrap_or(90_000) silently falls back to 90 seconds if conversion fails. This could mask configuration errors.

Recommendations:

  • Extract to a named constant: const DEFAULT_HIBERNATION_TIMEOUT_MS: u64 = 90_000;
  • Use .expect("hibernation_timeout must fit in target type") instead to fail fast on invalid config
  • Or validate at config initialization time

3. Inconsistent Error Handling Pattern

shared_state.rs:273-277

if let Some(hs) = &mut req.hibernation_state {
    hs.last_ping = Instant::now();
} else {
    tracing::warn!("should not call keepalive_hws for non-hibernating ws");
}

This logs a warning but returns Ok(()). For consistency with the codebase's error handling patterns, consider:

let Some(hs) = &mut req.hibernation_state else {
    bail!("cannot keepalive non-hibernating websocket");
};
hs.last_ping = Instant::now();

This would match the pattern used elsewhere (e.g., lines 414-416).

4. Minor Typo in Log Message

lib.rs:597

tracing::debug!("exiting hibernating due to pending messages");

Should be "exiting hibernation" (not "hibernating").


✅ Positive Aspects

  1. Good race condition fix (lib.rs:646-659): Subscribing to events before checking actor state correctly prevents missing the Ready event. This is the right approach.

  2. Clean keepalive design (keepalive_task.rs:49-57): Using tokio::try_join! to atomically update both UDB and in-memory state ensures consistency.

  3. Proper hibernation GC logic (shared_state.rs:515-516): The new HibernationTimeout GC reason correctly prevents indefinite hibernation when keepalive fails.

  4. Better type safety (InFlightRequestHandle.new): The new new field cleanly distinguishes between initial connections and reconnections after hibernation.

  5. Improved logging: Consistent use of % formatting for display types (request_id=%tunnel_id::request_id_to_string(&request_id)) improves readability.

  6. Good validation (lib.rs:311-315): The ensure! check properly validates that we don't create new in-flight entries after hibernation.


📋 Recommendations

Before merge:

  • Fix/remove debug logging (lines 188, 383 in shared_state.rs)
  • Fix typo: "exiting hibernating" → "exiting hibernation" (lib.rs:597)
  • Consider using .expect() instead of .unwrap_or() for hibernation timeout conversion
  • Consider making keepalive_hws return an error for non-hibernating websockets

Nice to have:

  • Add integration tests for the race condition scenarios
  • Document the race condition fix with an inline comment
  • Extract magic number 90_000 to a named constant

🧪 Testing Suggestions

Consider adding tests for:

  1. Race condition where actor becomes ready between subscription and state check
  2. GC behavior when last_ping is updated via keepalive_hws
  3. The ensure! assertion when attempting to create a new entry after hibernation
  4. Keepalive task failure scenarios

Summary

The core fixes are architecturally sound and address real race conditions in the hibernation logic. The main blocker is the debug logging that shouldn't be in production code. Once the critical issues are addressed, this is good to merge.

Overall: Approve with changes ⭐⭐⭐⭐ (4/5)

@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: Fix Gateway Hibernating Request Handling

Summary

This PR fixes critical issues with hibernating WebSocket request handling in the pegboard gateway, specifically preventing garbage collection from removing hibernating in-flight requests and adding proper actor state checks during hibernation. The changes are well-structured and address real race conditions.


Code Quality & Best Practices

✅ Strengths

  1. Good separation of concerns: The keepalive_hws function cleanly separates the keepalive logic for hibernating websockets
  2. Proper error handling: Uses ensure! macro appropriately for validation (line 312-315 in lib.rs)
  3. Consistent logging: Changed from debug format ? to display format % for request IDs throughout, which is more appropriate for UUIDs
  4. Configuration-driven timeout: Properly uses config for hibernation_timeout instead of hardcoding

⚠️ Issues to Address

  1. Debug logging left in production code (shared_state.rs:383):

    tracing::info!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

    This appears to be leftover debug code with a non-descriptive message. Should either be removed or converted to a proper debug statement with meaningful context.

  2. Inconsistent return on early check: In handle_websocket_hibernation (lib.rs:646-659), after checking if the actor is already ready, the function checks again in the tokio::select. This is correct for race condition prevention, but the early return at line 657 should probably include cleanup similar to the select branch.

  3. Missing field in destructuring (lib.rs:169):

    let InFlightRequestHandle {
        mut msg_rx,
        mut drop_rx,
        ..  // <- using .. pattern
    }

    While not incorrect, explicitly naming the new field and ignoring it would be clearer: new: _.


Potential Bugs

❌ Critical Issue

Race condition in actor state check (lib.rs:646-659):

The code fetches actor info after subscribing to prevent race conditions, but there's still a TOCTOU (Time-of-Check-Time-of-Use) issue:

// Fetch actor info after sub to prevent race condition
if let Some(actor) = self.ctx.op(...).await? {
    if actor.runner_id.is_some() {
        tracing::debug!("actor became ready during hibernation");
        return Ok(HibernationResult::Continue);
    }
}

Problem: Between the subscription (line 641-644) and the actor info fetch (line 647-652), the actor could become ready. The subscription would miss this event, and the fetch would also show the old state.

Recommendation: Consider either:

  1. Fetching actor state first, then subscribing, then checking again
  2. Or ensure the subscription includes historical events if the actor is already ready

⚠️ Minor Issues

  1. Keepalive task spawn timing (lib.rs:423-435): The keepalive task is now only spawned if can_hibernate is true. The previous version spawned it always but it would check internally. Make sure this doesn't introduce issues if hibernation state changes mid-request.

  2. Error message could be more specific (shared_state.rs:276):

    tracing::warn!("should not call keepalive_hws for non-hibernating ws");

    This should probably be a debug assertion or return an error, as it indicates programmer error rather than a runtime condition.


Performance Considerations

✅ Good

  1. Parallel operations: Using tokio::try_join! in keepalive_task.rs:49-57 is excellent - it runs both the database upsert and the in-memory keepalive in parallel
  2. GC optimization: The new HibernationTimeout GC reason prevents premature cleanup of hibernating requests, reducing unnecessary reconnection overhead

💡 Suggestions

  1. GC timeout fallback (shared_state.rs:490):

    Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000));

    The fallback value (90 seconds) should probably be a named constant and documented as to why this specific value was chosen.

  2. Pending message iteration (shared_state.rs:382-388): When resending pending messages, consider whether you need to batch these publishes for very large message counts to avoid blocking the task.


Security Concerns

✅ No major security issues identified

  1. The hibernation timeout mechanism properly prevents resource exhaustion from abandoned connections
  2. Message acknowledgment prevents unbounded memory growth
  3. No obvious injection or validation issues

💡 Minor Suggestion

Consider adding metrics/monitoring for:

  • Number of requests GC'd due to HibernationTimeout
  • Distribution of hibernation durations
  • Frequency of the race condition being hit (line 312-315)

This would help detect potential DoS scenarios where actors are repeatedly hibernating/waking.


Test Coverage

⚠️ Recommendations

The PR doesn't include test changes. Consider adding tests for:

  1. Race condition test: Verify that if an actor becomes ready between subscription and state check, the handler correctly continues
  2. GC timeout test: Verify that hibernating requests are not GC'd while keepalive is running
  3. Keepalive failure test: Verify behavior when keepalive_hws fails during hibernation
  4. New in-flight entry check: Test the ensure! at line 312-315 triggers correctly when reconnecting after hibernation with a new entry

Action Items

Must Fix

  • Remove or fix the debug log at shared_state.rs:383
  • Review and address the TOCTOU race condition in handle_websocket_hibernation

Should Fix

  • Make GC timeout fallback a named constant
  • Change keepalive_hws warning to return an error
  • Explicitly name the new field when destructuring InFlightRequestHandle

Nice to Have

  • Add test coverage for new race condition handling
  • Add metrics for hibernation timeout GC events
  • Document why can_hibernate determines keepalive task spawning

Overall Assessment

Verdict: Approve with minor changes required

This PR addresses real issues with hibernating WebSocket handling and shows good understanding of the codebase patterns. The main concern is the potential TOCTOU race condition that should be addressed before merge. The debug logging should definitely be cleaned up. Otherwise, the changes follow the codebase conventions and improve system reliability.

The core idea of adding a keepalive mechanism during hibernation and checking actor state after subscription is sound and will prevent the issues described in the PR title.

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch from 31af31a to b6dfb0d Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

Summary

This PR fixes two important race conditions in the pegboard gateway's hibernation logic:

  1. Prevents garbage collection from removing hibernating in-flight requests
  2. Ensures actor readiness is checked after subscription to prevent race conditions

Code Quality ✅

Strengths:

  • Well-structured fix addressing clear race conditions
  • Good use of structured logging improvements (? to % for request IDs)
  • Proper error handling with ensure\! macro
  • Clean separation of concerns

Areas for Improvement:

1. Debug Log Cleanup (shared_state.rs:383)

tracing::info\!(?pending_msg.payload, ?pending_msg.message_index, "------2---------");

This appears to be leftover debug logging with a placeholder message. Should either be removed or converted to proper debug logging with a meaningful message.

2. Type Conversion Safety (shared_state.rs:490)

Duration::from_millis(self.hibernation_timeout.try_into().unwrap_or(90_000))

The fallback value (90 seconds) doesn't match the field type. Since hibernation_timeout is i64, consider:

  • Using a proper constant like const DEFAULT_HIBERNATION_TIMEOUT_MS: i64 = 90_000
  • Adding a tracing warning when the conversion fails
  • Validating the config value at initialization time

Potential Bugs 🐛

1. Inconsistent Naming (keepalive_task.rs:11-12)

Comment says "pings writes keepalive" - should be "periodically writes keepalive". Minor typo.

2. Race Condition Check Logic (lib.rs:311-314)

ensure\!(
    \!after_hibernation || \!new,
    "should not be creating a new in flight entry after hibernation"
);

Good defensive check! However, the error message could be more actionable. Consider logging additional context (actor_id, request_id) before the ensure! to help debug if this ever triggers.

Performance Considerations ⚡

Good Practices:

  • Proper use of tokio::try_join\! for parallel operations (keepalive_task.rs:49)
  • Efficient jitter implementation to prevent thundering herd
  • Smart interval timing (hibernation_timeout / 2)

Potential Optimization:
The keepalive task now does two operations: database upsert + in-memory keepalive. The database operation is likely more expensive. Consider if the in-memory keepalive frequency could be different from the database upsert frequency if performance becomes an issue.

Security Concerns 🔒

Positive:

  • No security vulnerabilities introduced
  • Proper timeout handling prevents resource exhaustion
  • Good validation of hibernation state transitions

Note:
The hibernation timeout fallback (90 seconds) is reasonable for preventing DoS through resource holding.

Test Coverage 🧪

Missing Test Scenarios:

  1. Race condition test: Actor becomes ready between subscription and actor info fetch
  2. GC timeout test: Verify hibernating requests are GC'd when keepalive stops
  3. New in-flight entry check: Verify the ensure\! at lib.rs:311-314 works correctly

Recommended test additions:

// Test that hibernating requests survive GC when keepalive is active
// Test that hibernating requests are GC'd after hibernation_timeout elapses
// Test the race condition fix: actor ready signal arrives during subscription

Architecture & Design 🏗️

Strengths:

  • Clean addition of last_ping to HibernationState for GC tracking
  • Good use of the new field in InFlightRequestHandle to detect race conditions
  • Proper layering: config passed through initialization chain

Suggestions:

  1. The new field in InFlightRequestHandle is a good pattern. Consider documenting this pattern more explicitly as it's used for detecting race conditions.

  2. Consider extracting the actor readiness check to a helper method:

async fn check_actor_ready(&self) -> Result<bool> {
    if let Some(actor) = self.ctx.op(/*...*/).await? {
        Ok(actor.runner_id.is_some())
    } else {
        Ok(false)
    }
}

Logging Improvements ✨

Excellent changes:

  • Migration from ? to % for tunnel IDs (more efficient, better formatting)
  • Consistent use of structured logging throughout

Consistency Note:
Most request_id logging now uses %, which is great. Ensure all similar IDs use the same format.

Recommendations

High Priority:

  1. Remove or fix the debug log at line 383 in shared_state.rs
  2. Add proper error handling/logging for the .try_into().unwrap_or(90_000) conversion
  3. Add test coverage for the race conditions being fixed

Medium Priority:

  1. Document the new field pattern in InFlightRequestHandle
  2. Fix the typo in keepalive_task.rs comment
  3. Consider extracting the actor readiness check to a helper

Low Priority:

  1. Consider if database and in-memory keepalive frequencies should differ
  2. Add metrics for hibernation GC events (by reason)

Verdict

Approve with minor changes requested

This is a solid fix for important race conditions. The core logic is sound and well-implemented. The main issues are:

  • Debug logging cleanup needed
  • Better error handling for config conversion
  • Test coverage should be added

The changes follow the repository's conventions (conventional commits, structured logging, proper error handling) and integrate well with the existing architecture.


Generated with Claude Code

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

graphite-app bot pushed a commit that referenced this pull request Nov 21, 2025
@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-20-fix_gateway_prevent_gc_from_removing_hibernating_in_flight_req_check_actor_started_after_sub_when_hibernating branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants