Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 07:59 Inactive
Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pegboard-gateway): write hws immediately on socket hibernate

Summary

This PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be immediately rewoken without having a hibernating request entry in the database.


Code Quality ✅

Strengths:

  • Race condition fix: The core fix correctly addresses a timing issue by moving the hibernating_request::upsert operation before the has_pending_websocket_messages check (lib.rs:542-550)
  • Clear comments: Good inline documentation explaining the rationale for the change
  • Proper cleanup: The deletion logic at lib.rs:606-616 correctly handles cleanup for non-Continue results
  • First tick handling: Smart addition at lib.rs:581-583 to discard the immediate tick from the interval, preventing a duplicate upsert

Code structure follows Rivet conventions:

  • Uses structured logging with tracing macros
  • Follows error handling patterns with Result<> returns
  • Proper use of workspace operations

Logic Analysis ✅

The fix addresses this sequence:

Before (buggy):

  1. Check for pending messages
  2. If messages exist → return Continue immediately
  3. Problem: No hibernating request entry was created

After (fixed):

  1. Write hibernating request entry
  2. Check for pending messages
  3. If messages exist → return Continue (entry already exists)
  4. Otherwise → start keepalive loop

This ensures the entry exists for the cleanup logic at lib.rs:609-615, which deletes entries when returning Close or on error.


Potential Issues 🔍

Minor: Unnecessary database write in early-return path

When has_pending_websocket_messages returns true and we immediately return Continue (lib.rs:563), we've written a hibernating request entry that will never be cleaned up in this code path since the cleanup only happens for Close or Err cases (lib.rs:607).

Question: Does the entry get cleaned up elsewhere when returning Continue with pending messages? If not, this could lead to stale entries in the database.

Suggestion: Consider whether the early-return path at lib.rs:563 should delete the entry it just created, or document why it's safe to leave it:

if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
    
    // Delete the entry since we're not actually hibernating
    self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    }).await?;
    
    return Ok(HibernationResult::Continue);
}

Performance Considerations ⚡

Database write overhead: The fix adds an unconditional database write at the start of hibernation. This is the correct tradeoff for correctness, but worth noting:

  • Every hibernation attempt now performs at least one upsert operation
  • For actors that frequently have pending messages, this adds overhead for entries that are immediately discarded

Mitigation already in place: The keepalive loop has jitter (lib.rs:589-590) to prevent stampeding herds, which is good.


Security Considerations 🔒

No security concerns identified. The change maintains existing access patterns and doesn't introduce new attack surfaces.


Test Coverage 📋

Observation: No test files were modified in this PR.

Recommendation: Consider adding integration tests for:

  1. Race condition scenario: Test that hibernating request entries exist when has_pending_websocket_messages returns true during hibernation
  2. Cleanup verification: Verify entries are properly deleted on Close and error paths
  3. Early-return path: Test the cleanup behavior when immediately returning Continue

Example test scenarios:

  • Actor hibernates with pending messages → verify entry exists briefly then gets cleaned up
  • Actor hibernates without pending messages → verify entry exists during hibernation and is deleted on close
  • Actor hibernates and receives ready signal → verify entry is deleted

Questions for Author

  1. Is the hibernating request entry intentionally left in the database when returning Continue due to pending messages? If so, where is it cleaned up?
  2. Have you manually tested the race condition scenario that this fixes?
  3. Should we add telemetry/metrics around hibernation entry lifecycle to catch similar issues in the future?

Verdict: ✅ Approve with Minor Suggestion

The core fix is correct and addresses a real race condition. My only concern is potential cleanup of the entry in the early-return path, which needs clarification from the author. Otherwise, the implementation is solid and follows project conventions well.

@NathanFlurry NathanFlurry marked this pull request as ready for review November 20, 2025 18:24
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from 34ee3a4 to e1152dc Compare November 20, 2025 19:38
@MasterPtato MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from b960df1 to 8145fc3 Compare November 20, 2025 19:38
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pegboard-gateway): write hws immediately on socket hibernate

Summary

This PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be immediately rewoken without having a hibernating request entry in the database.


Code Quality ✅

Strengths:

  • Race condition fix: The core fix correctly addresses a timing issue by moving the hibernating_request::upsert operation before the has_pending_websocket_messages check (lib.rs:542-550)
  • Clear comments: Good inline documentation explaining the rationale for the change
  • Proper cleanup: The deletion logic at lib.rs:606-616 correctly handles cleanup for non-Continue results
  • First tick handling: Smart addition at lib.rs:581-583 to discard the immediate tick from the interval, preventing a duplicate upsert

Code structure follows Rivet conventions:

  • Uses structured logging with tracing macros
  • Follows error handling patterns with Result<> returns
  • Proper use of workspace operations

Logic Analysis ✅

The fix addresses this sequence:

Before (buggy):

  1. Check for pending messages
  2. If messages exist → return Continue immediately
  3. Problem: No hibernating request entry was created

After (fixed):

  1. Write hibernating request entry
  2. Check for pending messages
  3. If messages exist → return Continue (entry already exists)
  4. Otherwise → start keepalive loop

This ensures the entry exists for the cleanup logic at lib.rs:609-615, which deletes entries when returning Close or on error.


Potential Issues 🔍

Minor: Unnecessary database write in early-return path

When has_pending_websocket_messages returns true and we immediately return Continue (lib.rs:563), we've written a hibernating request entry that will never be cleaned up in this code path since the cleanup only happens for Close or Err cases (lib.rs:607).

Question: Does the entry get cleaned up elsewhere when returning Continue with pending messages? If not, this could lead to stale entries in the database.

Suggestion: Consider whether the early-return path at lib.rs:563 should delete the entry it just created, or document why it's safe to leave it:

if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
    
    // Delete the entry since we're not actually hibernating
    self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    }).await?;
    
    return Ok(HibernationResult::Continue);
}

Performance Considerations ⚡

Database write overhead: The fix adds an unconditional database write at the start of hibernation. This is the correct tradeoff for correctness, but worth noting:

  • Every hibernation attempt now performs at least one upsert operation
  • For actors that frequently have pending messages, this adds overhead for entries that are immediately discarded

Mitigation already in place: The keepalive loop has jitter (lib.rs:589-590) to prevent stampeding herds, which is good.


Security Considerations 🔒

No security concerns identified. The change maintains existing access patterns and doesn't introduce new attack surfaces.


Test Coverage 📋

Observation: No test files were modified in this PR.

Recommendation: Consider adding integration tests for:

  1. Race condition scenario: Test that hibernating request entries exist when has_pending_websocket_messages returns true during hibernation
  2. Cleanup verification: Verify entries are properly deleted on Close and error paths
  3. Early-return path: Test the cleanup behavior when immediately returning Continue

Example test scenarios:

  • Actor hibernates with pending messages → verify entry exists briefly then gets cleaned up
  • Actor hibernates without pending messages → verify entry exists during hibernation and is deleted on close
  • Actor hibernates and receives ready signal → verify entry is deleted

Questions for Author

  1. Is the hibernating request entry intentionally left in the database when returning Continue due to pending messages? If so, where is it cleaned up?
  2. Have you manually tested the race condition scenario that this fixes?
  3. Should we add telemetry/metrics around hibernation entry lifecycle to catch similar issues in the future?

Verdict: ✅ Approve with Minor Suggestion

The core fix is correct and addresses a real race condition. My only concern is potential cleanup of the entry in the early-return path, which needs clarification from the author. Otherwise, the implementation is solid and follows project conventions well.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a race condition in websocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents a scenario where an actor could be rewoken immediately but the hibernating request entry was never created.

Positive Aspects

Race Condition Fix: The primary fix correctly addresses a race condition where:

  • Previously: Check for pending messages → write hibernating request entry
  • Now: Write hibernating request entry → check for pending messages

This ensures the entry exists in the database even if the actor is immediately rewoken, which is crucial for proper lifecycle management.

Interval Tick Fix: Discarding the first tick of ping_interval is correct since tokio::time::interval fires immediately on creation. Since the hibernating request was just upserted above, the immediate tick would be redundant.

Code Documentation: Good inline comments explaining the reasoning behind both changes.

Code Quality

Strong Points:

  • Follows existing error handling patterns with `?" operator
  • Maintains tracing/logging practices (structured logging at line 558-561)
  • Preserves existing transaction safety

Potential Considerations

  1. Error Handling Path: If the initial upsert operation at line 544-550 fails, the function will return early with an error. This is correct behavior, but it means the hibernation will fail entirely. Consider if this aligns with your error handling strategy - should hibernation failure be fatal or should it fall back to keeping the actor awake?

  2. Database Write Before State Check: The new order means we're doing a database write operation before checking in-memory state (has_pending_websocket_messages). This is necessary for correctness but does mean an extra database write in the rewake path. This is likely acceptable given:

    • The race condition fix is critical
    • The rewake scenario should be relatively uncommon
    • Database writes are already part of the keepalive loop anyway
  3. Cleanup on Early Exit: When HibernationResult::Continue is returned at line 563 (immediate rewake), the hibernating request entry remains in the database. Looking at the code, this appears intentional since the entry will be cleaned up later when the request truly completes (lines 607-616). However, verify this doesn't cause accumulation of stale entries if rewakes are frequent.

  4. Test Coverage: No test files found for pegboard-gateway. Consider adding integration tests for:

    • Hibernation with immediate rewake scenario
    • Hibernation keepalive interval behavior
    • Cleanup of hibernating request entries

Minor Observations

  • Line 562: Empty line added after the tracing debug statement - maintains consistent spacing ✓
  • The unique_request_id to request_id assignment at line 540 appears redundant but is preserved for consistency

Performance & Security

  • Performance: Minimal impact. One additional database upsert in the hibernation path, which is acceptable for correctness.
  • Security: No security concerns identified. The change maintains proper isolation and doesn't introduce any new attack vectors.

Recommendation

LGTM with minor suggestion

The fix is correct and addresses a real race condition. The implementation is clean and well-commented.

Optional Enhancement: Consider adding a metric or trace to track how often immediate rewakes occur (after the upsert but before hibernation), which would help validate this fix in production and monitor the frequency of this edge case.


Conventional Commit Format: ✅ Correctly follows the pattern: fix(pegboard-gateway): write hws immediately on socket hibernate

@NathanFlurry NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 20, 2025 20:51
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 20, 2025 20:51
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. Overall, the changes look good and address an important timing issue. Here are my findings:

✅ Positive Observations

  1. Race Condition Fix: The core fix is correct - moving the hibernating_request::upsert call to occur before the has_pending_websocket_messages check ensures the database entry exists even if the actor is immediately rewoken. This prevents a window where the system might try to look up a hibernating request that hasn't been written yet.

  2. Interval Tick Fix: The addition of the initial ping_interval.tick().await (lines 581-583) is correct. Since tokio::time::interval fires immediately on the first tick, and the hibernating request was already upserted at line 545, this prevents a redundant immediate write in the keepalive task.

  3. Code Quality:

    • Clear, explanatory comments
    • Follows Rust conventions
    • Consistent with the existing codebase style
    • Uses structured logging correctly (line 558-560)
  4. Error Handling: Proper error propagation with ? operator throughout

🔍 Considerations & Questions

  1. Cleanup on Early Return: When the function returns early at line 563 (HibernationResult::Continue for pending messages), the hibernating_request entry is left in the database. Is this intentional?

    • The cleanup logic at lines 605-617 only runs if we reach the end of the function
    • If this is intended behavior (perhaps the entry will be cleaned up elsewhere), it might be worth a comment explaining why
  2. Testing:

    • Are there integration tests that cover this race condition scenario?
    • The test file at engine/packages/guard-core/tests/custom_serve.rs exists but may not cover this specific edge case
    • Consider adding a test that simulates: hibernate → immediate pending message → rewake
  3. Keepalive Task Error Handling: The keepalive task (lines 571-599) can fail with ? but the handle is simply aborted at line 603. Failed keepalive tasks might leave stale entries. Consider:

    • Logging when the keepalive task fails
    • Whether task failures should trigger cleanup

🎯 Suggested Improvements (Optional)

// Consider adding cleanup on early return:
if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(
        ?unique_request_id,
        "detected pending requests on websocket hibernation, rewaking actor"
    );
    
    // Clean up the entry since we're not actually hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;
    
    return Ok(HibernationResult::Continue);
}

📋 Summary

Approval Status: ✅ Approve with minor suggestions

The PR correctly fixes the race condition and improves the hibernation flow. The main question is whether the early-return path (line 563) should clean up the database entry. If that's intended behavior, this is ready to merge as-is. If not, consider the suggested cleanup addition.

Security: ✅ No security concerns
Performance: ✅ Minimal impact, actually reduces redundant DB writes
Style: ✅ Follows repository conventions per CLAUDE.md

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: WebSocket Hibernation Race Condition Fix

Summary

This PR fixes a race condition in the WebSocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages that might immediately wake the actor.

Code Quality ✅

Positive aspects:

  • Excellent fix for a subtle race condition: The reordering prevents a scenario where an actor might wake before its hibernating request is recorded, which could cause the system to lose track of the hibernating connection.
  • Clear, descriptive comments: The inline comment // Insert hibernating request entry before checking for pending messages clearly explains the rationale.
  • Proper handling of interval timing: Discarding the first tick from ping_interval is correct since tokio::time::interval fires immediately on creation, and the upsert was already called above.

Code organization:

  • The flow is now logically correct:
    1. Write hibernating request entry (lines 544-550)
    2. Check for pending messages (lines 552-564)
    3. If no pending messages, start keepalive task (lines 566-599)
  • Follows structured logging conventions from CLAUDE.md ✅

Potential Issues & Considerations

1. Error Handling Gap 🔍

The initial upsert operation (line 544-550) can fail with await?, but if it does, the function returns early without any cleanup. This is likely correct behavior, but worth noting:

  • If the upsert fails, the websocket will close (due to the error propagation)
  • No stale state is left behind since the write never succeeded

Verdict: Current behavior appears correct, but consider adding a debug/info log before the operation for observability:

tracing::debug!(?request_id, "writing initial hibernating request entry");
self.ctx.op(/* ... */).await?;

2. Duplicate Write Pattern 📝

The same upsert operation is called in three places now:

  • Initial write (line 544-550) - NEW
  • Keepalive loop (line 592-597)
  • In the main loop (assumed, not visible in diff)

Consideration: This repetition is acceptable given the different contexts, but ensure all three locations use identical parameters. The code does this correctly.

3. Race Condition with has_pending_websocket_messages ⚠️

The check at line 553-556 happens after the upsert, which is correct. However, there's a potential TOCTOU (time-of-check-time-of-use) issue:

  • Between checking for pending messages and returning Continue, new messages could arrive
  • If this happens, the websocket won't hibernate but the hibernating request entry exists in the DB

Impact Analysis:

  • The entry will be cleaned up by the keepalive task abort + cleanup logic (lines 603-617)
  • Actually, looking at lines 605-607: if result is Ok(HibernationResult::Continue), the entry is NOT deleted
  • This could leave orphaned entries in the database when immediate wake occurs

Recommendation: Consider deleting the hibernating request entry on immediate wake:

if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(
        ?unique_request_id,
        "detected pending requests on websocket hibernation, rewaking actor"
    );
    
    // Clean up hibernating request entry since we're not actually hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;
    
    return Ok(HibernationResult::Continue);
}

Performance Considerations

Database Operations

  • Initial upsert: Adds one additional DB operation on every hibernation attempt
  • Impact: Minimal - this only happens when a WebSocket goes idle, not on every message
  • Trade-off: Correctness > performance in this case ✅

Interval Tick Handling

  • Skipping the first tick (line 583) is efficient and correct
  • Prevents an unnecessary immediate ping after already writing the entry

Security Concerns

No security issues identified. The change is purely about timing and state management.

Test Coverage 🚨

Current state: No unit tests found for pegboard-gateway package.

Recommendations:

  1. Add integration test for the race condition scenario:
    • Mock a scenario where messages arrive between hibernation check and actual sleep
    • Verify the hibernating request entry is properly managed
  2. Add test for the keepalive mechanism:
    • Verify pings occur at correct intervals
    • Verify first tick is properly skipped
  3. Add test for cleanup on early wake:
    • Verify no orphaned DB entries when HibernationResult::Continue is returned

Example test case (pseudocode):

#[tokio::test]
async fn test_hibernation_immediate_wake_cleanup() {
    // Setup: Create gateway with pending messages
    // Act: Trigger hibernation
    // Assert: 
    //   1. Hibernating request entry is created
    //   2. Immediate wake occurs (Continue returned)
    //   3. Hibernating request entry is cleaned up
}

Adherence to CLAUDE.md ✅

  • Error handling uses Result and ? operator ✅
  • Logging uses structured format with ?request_id
  • Comments are lowercase ✅
  • Follows existing code patterns ✅

Summary & Recommendations

Approve with minor suggestions:

  1. ⚠️ Important: Add cleanup of hibernating request entry on immediate wake (lines 558-564) to prevent potential orphaned DB records
  2. 💡 Optional: Add debug logging before the initial upsert for better observability
  3. 🧪 Strongly recommended: Add test coverage for this critical path, especially the race condition scenario

Overall assessment: This is a well-thought-out fix that addresses a real race condition. The main concern is ensuring proper cleanup when immediate wake occurs. With the suggested cleanup addition, this will be a solid, production-ready change.

Severity of issues:

  • Orphaned DB entries on immediate wake: Medium (functional correctness issue, but likely cleaned up eventually by other mechanisms)
  • Missing test coverage: Medium (this is a critical path that should be tested)

Great work identifying and fixing this subtle race condition! 🎯

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a race condition in WebSocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data loss in edge cases.

Positive Aspects

Race Condition Fix: The main fix correctly addresses a timing issue where:

  • Previously, if pending messages existed, the function would return early without ever writing the hibernation entry
  • Now the entry is always written first, ensuring it exists even if we immediately rewake

Proper Interval Handling: The addition of ping_interval.tick().await to discard the first immediate tick is correct, since intervals fire immediately on creation and the entry was just written above.

Good Code Comments: The inline comments clearly explain the "why" behind both changes, making the intent obvious to future maintainers.

Code Quality

Follows Repository Standards:

  • ✅ Structured logging with tracing (line 558-560)
  • ✅ Proper error handling with Result types
  • ✅ Uses workspace operations pattern
  • ✅ Follows existing code style

Potential Concerns & Questions

1. Error Handling After First Upsert (Minor)

Lines 544-550: If the initial upsert fails, the function returns an error. However, there's no corresponding delete operation since we haven't entered the hibernation state yet. This seems correct, but worth confirming:

  • Is it safe to have orphaned hibernation entries if the upsert succeeds but we fail before reaching the cleanup logic?
  • Should there be a timeout/TTL on these entries to handle edge cases?

2. Test Coverage (Important)

I don't see test changes in this PR. Given that this fixes a race condition, consider adding a test case that:

  • Triggers hibernation with pending messages
  • Verifies the hibernating request entry exists in the database
  • Confirms immediate rewake behavior

Example test pattern:

#[tokio::test]
async fn test_hibernation_with_pending_messages() {
    // Setup: Create actor with pending message
    // Trigger: Send hibernate command
    // Assert: Verify hibernating_request entry was created
    // Assert: Verify immediate rewake (HibernationResult::Continue)
}

3. Duplicate Work Consideration (Performance - Minor)

The upsert operation is now called twice in quick succession when there are no pending messages:

  1. Line 544-550: Initial upsert
  2. Line 592-597: First keepalive ping (after interval tick)

While this ensures correctness, it means we're writing the same timestamp twice within a short period. This is likely acceptable given the importance of correctness over this minor performance cost, but worth documenting.

Security & Performance

🔒 Security: No security concerns identified.

Performance:

  • Minor: One additional database write on hibernation entry
  • This is a worthwhile trade-off for correctness

Recommendations

  1. Add test coverage for the race condition scenario
  2. Consider documenting the hibernating request cleanup/TTL strategy if not already documented
  3. ✅ No changes needed to the fix itself - it looks correct

Overall Assessment

LGTM with suggestions - The core fix is sound and addresses a real race condition. The main recommendation is to add test coverage to prevent regression.

Great catch on this race condition! The fix is clean and well-documented.


🤖 Generated with Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a race condition in the WebSocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data inconsistency when actors are immediately rewoken.


Analysis

What the code does:
The hibernating request system tracks which actors are currently hibernating (suspended) while waiting for messages. When a CommandStartActor is sent to a runner, it queries these hibernating requests to know which connections to restore. The periodic "keepalive" upserts ensure the entries don't become stale.

The bug this fixes:
Previously, the code checked for pending messages before writing the hibernating request entry. If pending messages existed, the actor would immediately rewake (return HibernationResult::Continue), but the hibernating request entry was never written. This could cause:

  1. The actor to be restarted without knowledge of hibernating connections
  2. Lost WebSocket connections when the actor is restarted
  3. Inconsistent state between the gateway and runner

The fix:
By moving the initial hibernating_request::upsert call before the pending message check, we ensure:

  1. The entry always exists in the database when hibernation is attempted
  2. Even if we immediately rewake, the entry was written and can be cleaned up properly
  3. The keepalive task's first tick is now discarded (since it fires immediately) to avoid a redundant database write

Positive Observations

  1. Correct fix for race condition: Moving the upsert before the check ensures atomicity of the hibernation process
  2. Proper cleanup logic preserved: The delete operation in lines 607-616 still handles cleanup for Close/Error cases
  3. Performance optimization: Discarding the first tick of the interval (lines 581-583) prevents an unnecessary immediate database write since we just wrote on line 544-550
  4. Good code comments: The inline comments clearly explain the rationale
  5. Follows existing patterns: Uses the same operation structure as the keepalive loop

Considerations & Questions

1. Cleanup on immediate rewake

When HibernationResult::Continue is returned at line 563, the cleanup logic at lines 606-616 doesn't execute, leaving the hibernating request entry in the database.

Question: Is this intentional? If the actor immediately rewakes and continues processing, should we delete the hibernating request entry since the actor never truly hibernated?

Recommendation: Consider whether this case needs explicit cleanup:

if self
    .shared_state
    .has_pending_websocket_messages(request_id)
    .await?
{
    tracing::debug!(
        ?unique_request_id,
        "detected pending requests on websocket hibernation, rewaking actor"
    );
    
    // Delete the entry we just created since we're not actually hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;

    return Ok(HibernationResult::Continue);
}

2. Error handling gap

If the initial upsert fails (line 544-550), the function returns early with an error. However, there's no entry to clean up in this case, which is fine. But if the has_pending_websocket_messages check fails after the upsert succeeds, we'd have an orphaned entry.

Question: Should there be error handling that cleans up the entry if subsequent operations fail?

3. Test coverage

There are no tests for this hibernation logic, which makes regression testing difficult for this critical path.

Recommendation: Consider adding integration tests for:

  • Normal hibernation flow
  • Immediate rewake due to pending messages
  • Cleanup on hibernation close/error
  • Keepalive task behavior

4. Database transaction considerations

The upsert operation uses a database transaction (udb), but the subsequent has_pending_websocket_messages check happens outside this transaction.

Question: Could there be a TOCTOU (time-of-check-time-of-use) issue where messages arrive between the upsert and the pending check?


Code Quality

Strengths:

  • Clear, well-commented code
  • Follows Rivet's error handling patterns
  • Proper use of structured logging with tracing
  • Consistent naming conventions

Minor style notes:

  • Line 562: Extra blank line added (intentional formatting change?)
  • Code follows the hard-tab indentation from rustfmt.toml

Security & Performance

Security: No security concerns identified. The operations use proper authorization through the operation context.

Performance:

  • ✅ The fix adds one database write that would have happened anyway in the keepalive loop
  • ✅ Skipping the first tick prevents a redundant write
  • ✅ Overall performance impact is negligible

Verdict

✅ Approve with suggestions

This is a solid fix for a race condition. The core logic is correct and the implementation is clean. The main question is whether immediate rewake scenarios need explicit cleanup of the hibernating request entry. I'd recommend:

  1. Clarifying the intended behavior for immediate rewake (keep or delete entry?)
  2. Adding test coverage for hibernation scenarios
  3. Considering error handling for the gap between upsert and pending-message check

The PR is safe to merge as-is if the orphaned entry on immediate rewake is acceptable behavior (perhaps it's cleaned up elsewhere or times out naturally).

@NathanFlurry NathanFlurry force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 8145fc3 to 193f67a Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: fix(pegboard-gateway): write hws immediately on socket hibernate

Summary

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. Additionally, it discards the first tick of the ping interval to avoid an immediate duplicate write.


Code Quality ✅

Positive aspects:

  • Clear, descriptive comments explaining the "why" behind each change
  • Follows existing code patterns and style conventions
  • Proper error handling maintained with ? operator
  • Structured logging patterns match CLAUDE.md guidelines (uses ?unique_request_id for structured logging)

Observations:

  • The variable assignment let request_id = unique_request_id; at line 540 appears redundant - both variables are used interchangeably throughout the function
  • Consider using request_id consistently instead of switching between request_id and unique_request_id (e.g., line 559 uses ?unique_request_id in logging while the rest uses request_id)

Bug Fixes & Race Conditions ✅

Primary fix - Race condition:
The core issue being addressed is excellent:

  1. Before: Check pending messages → insert DB entry
  2. After: Insert DB entry → check pending messages

This ensures that if the actor is immediately rewoken (due to pending messages), the hibernating request entry already exists in the database. This prevents a window where the system might try to rewake an actor whose hibernation state hasn't been persisted yet.

Secondary fix - Duplicate ping:
The addition of ping_interval.tick().await; at line 583 correctly handles the fact that tokio::time::interval fires immediately on the first tick. Since the upsert operation was just called above (lines 544-550), this prevents an unnecessary duplicate write.


Performance Considerations ⚠️

Potential concern:
Moving the database write (lines 544-550) to happen before the pending message check (lines 552-564) means:

  • Every hibernation attempt now performs a DB write, even for requests that immediately rewake
  • Previously, requests with pending messages would skip the DB write entirely via early return

Impact assessment:

  • For requests with pending messages: Adds one extra DB operation (upsert followed by early return)
  • However, this is likely acceptable because:
    1. The correctness gain (eliminating race condition) outweighs the performance cost
    2. The upsert operation in /home/runner/work/rivet/rivet/engine/packages/pegboard/src/ops/actor/hibernating_request/upsert.rs:15-51 appears relatively lightweight
    3. If immediate rewake is common, this could be optimized later with a check-then-write pattern using proper locking

Recommendation: This trade-off seems reasonable for correctness, but consider monitoring metrics on hibernation frequency and immediate rewake rates to assess if optimization is needed.


Security Concerns ✅

No security issues identified:

  • No new attack surfaces introduced
  • Proper error propagation maintained
  • No credential handling or injection risks
  • Database operations follow existing patterns with proper transaction handling

Test Coverage ⚠️

Missing test coverage:
According to CLAUDE.md guidelines, complex race conditions and timing-sensitive code should have test coverage. I couldn't find tests specifically for:

  • The race condition scenario (pending messages arriving during hibernation setup)
  • The immediate rewake path
  • The ping interval behavior with the discarded first tick

Recommendation:
Consider adding integration tests that:

  1. Verify hibernating request entry exists before the actor checks for pending messages
  2. Test the immediate rewake scenario when has_pending_websocket_messages() returns true
  3. Verify the ping interval doesn't fire duplicate pings immediately after hibernation starts

Test suggestion:

#[tokio::test]
async fn test_hibernation_race_condition() {
    // Setup: Create request with pending messages
    // Action: Trigger hibernation
    // Assert: Hibernating request entry exists in DB even though rewake is immediate
}

Additional Notes

Cleanup opportunity:
The deletion of the blank line at line 562 and addition of a blank line at line 563 appears to be a minor formatting change. While it improves readability by separating the log from the return, ensure this aligns with the project's formatting standards (though I see the PR correctly avoids running cargo fmt per CLAUDE.md).

Documentation:
The inline comments are excellent and explain the reasoning well. No additional documentation needed.


Verdict: ✅ Approve with Suggestions

This is a solid bug fix that addresses a real race condition. The changes are minimal, well-commented, and correct. My suggestions:

  1. Optional: Clean up the request_id/unique_request_id variable usage for consistency
  2. Recommended: Add test coverage for the race condition scenario
  3. Optional: Monitor performance impact of the additional DB write on immediate rewakes

The core fix is sound and should be merged. The test coverage can be addressed in a follow-up if needed.

@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from 968a92b to e1152dc Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-19-chore_rivetkit_fix_hibernation_implementation branch from 193f67a to 8145fc3 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed the changes in this PR. Overall, the fix looks good and addresses a race condition in the websocket hibernation flow. Here's my detailed feedback:

✅ Positives

  1. Race Condition Fix: Moving the hibernating request upsert before checking for pending messages is the correct approach. This ensures the entry exists in the database before any potential early return, preventing the race condition where the keepalive task would run but the initial entry was never written.

  2. Interval Behavior Fix: Adding ping_interval.tick().await to discard the first immediate tick is correct. Since tokio::time::interval fires immediately on the first tick, and you've already written the hibernating request entry above, this prevents a redundant immediate write.

  3. Clear Comments: The inline comments effectively explain the reasoning behind both changes, which will help future maintainers understand the intent.

🔍 Minor Observations

  1. Error Handling Consistency: The upsert operation at line 544-550 uses .await? which will propagate errors. This is appropriate, but it means hibernation will fail if the database operation fails. Consider whether there should be any additional logging or metrics for this case (though current behavior is likely fine).

  2. Database Write Ordering: The fix ensures the write happens before the has_pending_websocket_messages check. This is good, but worth noting that there's a brief window where the hibernating request entry exists even if we immediately return HibernationResult::Continue. The entry will persist until the cleanup at lines 609-615. This seems acceptable as it's a short-lived inconsistency.

  3. Keepalive Task Behavior: After the fix, the keepalive task will:

    • Wait for the interval (not immediate due to the discarded tick)
    • Add jitter
    • Upsert the hibernating request

    This ensures the timestamp stays fresh. The logic looks correct.

🎯 Suggestions

Optional Enhancement: Consider adding a trace-level log after the initial upsert to aid in debugging:

self.ctx
    .op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    })
    .await?;

tracing::trace!(?request_id, "inserted initial hibernating request entry");

However, this is truly optional and may be unnecessary noise.

📋 Summary

The changes correctly fix the race condition by ensuring the hibernating request entry is written immediately upon hibernation, before any early returns. The interval tick discard is a nice touch that prevents redundant writes. The code follows the project's conventions and error handling patterns.

Recommendation: ✅ Approve - This is a solid fix that addresses the root cause of the issue.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents a scenario where the actor could be rewoken before the hibernation state is properly recorded.

Changes Analysis

The PR makes two key changes to handle_websocket_hibernation:

  1. Moves the hibernating_request::upsert operation earlier (lines 542-550)

    • Now executes before the has_pending_websocket_messages check
    • Previously was only called inside the keepalive task loop
    • Ensures the hibernation state is persisted even if we immediately return HibernationResult::Continue
  2. Discards the first interval tick (lines 581-583)

    • Prevents duplicate upsert operation since interval fires immediately on creation
    • The initial upsert now happens before the keepalive task starts

Positive Aspects ✅

  1. Race Condition Fix: This correctly addresses a timing issue where:

    • Actor could be rewoken (via has_pending_websocket_messages returning true)
    • But hibernation state was never persisted to the database
    • External systems checking for hibernating requests would have stale/missing data
  2. Code Comments: Excellent inline documentation explaining the rationale:

    • "This ensures the entry exists even if we immediately rewake the actor"
    • "Discard the first tick since it fires immediately"
  3. Consistent State: By writing the hibernation record upfront, the system maintains consistency even during edge cases

  4. Idempotent Operation: The upsert operation is idempotent (based on upsert.rs), so calling it early is safe

Potential Concerns & Questions 🤔

  1. Error Handling Flow (Minor)

    • If the initial upsert operation (lines 544-550) fails, the function returns early with `?"
    • This means no hibernation entry AND no cleanup occurs
    • Question: Is this the desired behavior? Should there be any logging or cleanup?
    • Current behavior seems correct since we never entered hibernation state, but worth confirming
  2. Cleanup on Early Return (Design Question)

    • When has_pending_websocket_messages returns true (line 555-563), we return Continue immediately
    • The hibernation entry stays in the database despite never actually hibernating
    • The cleanup only happens in lines 607-616 for Close or Err results
    • Question: Should we delete the entry when returning early at line 563? Or is this intentional for tracking purposes?
    • Looking at the cleanup logic (lines 605-617), it seems like Continue results intentionally don't delete the entry
  3. Transaction Semantics

    • The upsert operation writes to UniversalDB
    • If the subsequent has_pending_websocket_messages check also reads from the DB, are there any transaction isolation concerns?
    • Based on upsert.rs, it uses Serializable isolation, which should be safe

Performance Considerations

  • Minimal Impact: The additional upfront upsert call adds one database write per hibernation attempt
  • Trade-off: Slight overhead vs correctness - this is the right trade-off
  • Optimization: The interval tick discard (line 583) prevents an unnecessary duplicate write, which is good

Security & Best Practices

  • ✅ Follows Rivet error handling patterns (Result with ?)
  • ✅ Uses structured logging with tracing::debug! and field syntax
  • ✅ Maintains proper async/await patterns
  • ✅ No security concerns identified

Test Coverage Recommendations

Consider adding test cases for:

  1. Hibernation with immediate rewake due to pending messages
  2. Verify hibernation entry exists even when Continue is returned early
  3. Cleanup behavior for different HibernationResult variants

Verdict

LGTM with minor questions

This is a solid fix for a subtle race condition. The code quality is good with clear comments. The only questions are around the intended behavior of leaving the hibernation entry in the DB when returning Continue early - but this appears to be intentional based on the cleanup logic.

The fix is minimal, focused, and addresses a real correctness issue without introducing unnecessary complexity.


Review generated by Claude Code

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Overview

This PR fixes a race condition in the websocket hibernation logic by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix for the hibernation workflow.

Positive Aspects ✅

  1. Race Condition Fix: The main fix addresses a real race condition. Previously, if pending messages existed, the code would return early without ever writing the hibernating request entry. This could cause issues if external systems query for hibernating requests (via pegboard_actor_hibernating_request_list) before the entry is created.

  2. Proper Sequencing: Moving the upsert call before the pending message check ensures the entry exists in the database regardless of the execution path (early return or hibernation).

  3. Interval Tick Fix: The addition of discarding the first tick (ping_interval.tick().await) is correct, since tokio::time::interval fires immediately on creation. Without this, you'd get a duplicate upsert call right away.

  4. Good Comments: The inline comments clearly explain the "why" behind both changes.

Concerns & Suggestions

1. Error Handling on Early Return ⚠️

// Insert hibernating request entry before checking for pending messages
self.ctx
    .op(pegboard::ops::actor::hibernating_request::upsert::Input {
        // ...
    })
    .await?;

// Immediately rewake if we have pending messages
if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
    return Ok(HibernationResult::Continue);
}

Issue: If we write the hibernating request entry and then immediately return HibernationResult::Continue, the entry remains in the database even though the socket is not actually hibernating.

Looking at the cleanup logic (lines 607-616), the entry is only deleted when:

  • HibernationResult::Close is returned
  • An error occurs

But when we return early with HibernationResult::Continue on line 563, the cleanup code never runs, leaving a stale entry.

Suggestion: Consider deleting the entry if returning early:

if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
    
    // Clean up the entry since we're not actually hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;
    
    return Ok(HibernationResult::Continue);
}

2. Consider Transaction Semantics 💭

The upsert operation involves a transaction that:

  1. Reads the old last_ping_ts
  2. Deletes the old HibernatingRequestKey
  3. Writes the new timestamp and key

If this is called immediately followed by checking for pending messages, there's still a tiny window where the entry exists but the actor isn't truly hibernating. Depending on your requirements, this might be acceptable or might need further refinement.

3. Test Coverage 📝

There don't appear to be any unit tests for the pegboard-gateway package. While integration tests might exist elsewhere, consider adding tests for:

  • The race condition scenario (pending messages during hibernation attempt)
  • The keepalive interval behavior
  • Cleanup of hibernating request entries in different exit paths

Performance Considerations

No performance issues identified. The additional database write is minimal and necessary for correctness.

Security Considerations

No security concerns identified in this change.

Summary

This is a good fix for a race condition, but the early return path (line 563) may leave a stale entry in the database. Please verify this doesn't cause issues with your hibernating request tracking system, or add cleanup logic as suggested above.

Recommendation: Request changes to handle cleanup on early return, unless there's a specific reason the entry should remain (in which case, please document that behavior).

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR fixes a race condition in websocket hibernation by ensuring the hibernating request entry is written to the database before checking for pending messages. This is a critical fix that prevents data loss when an actor immediately rewakes.


✅ Strengths

  1. Correct Race Condition Fix: Moving the upsert call before the has_pending_websocket_messages check is the right solution. Previously, if pending messages existed, the function would return early without writing the hibernation state, causing the entry to not exist in the database.

  2. Good Comments: The inline comment clearly explains why the upsert happens first: "This ensures the entry exists even if we immediately rewake the actor"

  3. Interval Tick Fix: Discarding the first tick from the interval is correct since tokio::time::interval fires immediately on creation, and the initial upsert already happened above.


🔍 Observations & Questions

1. Cleanup on Early Return (Minor)

When has_pending_websocket_messages returns true and we return HibernationResult::Continue early at line 563, we do not clean up the hibernating request entry we just created.

Question: Is this intentional? I see the cleanup logic at lines 607-616 only runs if we get past the early return. This means:

  • The entry remains in the database until it expires or is cleaned up later
  • The keepalive task never runs to maintain the timestamp

Suggestion: Consider whether we need to delete the entry when immediately rewaking:

if self
    .shared_state
    .has_pending_websocket_messages(request_id)
    .await?
{
    tracing::debug!(
        ?unique_request_id,
        "detected pending requests on websocket hibernation, rewaking actor"
    );
    
    // Clean up the entry we just created since we're not hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;

    return Ok(HibernationResult::Continue);
}

Or document if the stale entry is handled elsewhere (e.g., TTL-based cleanup).

2. Test Coverage (Minor)

I do not see tests covering this specific race condition scenario. The existing test at engine/packages/guard-core/tests/custom_serve.rs::test_custom_serve_websocket_hibernation tests basic hibernation but not the immediate rewake case.

Suggestion: Consider adding a test that:

  1. Triggers hibernation with pending messages already queued
  2. Verifies the request entry exists in the database despite immediate rewake
  3. Verifies proper cleanup of the entry

📊 Code Quality

  • Error Handling: ✅ Proper propagation with ?
  • Logging: ✅ Good use of structured logging with ?unique_request_id
  • Code Style: ✅ Follows repo conventions (hard tabs, lowercase log messages)
  • Dependencies: ✅ No new dependencies added

🔒 Security & Performance

  • Security: ✅ No security concerns identified
  • Performance: ✅ Minimal performance impact - adds one database write before the check, which is necessary for correctness

🎯 Recommendation

Approve with minor suggestions. The core fix is solid and addresses a real race condition. The only question is about cleanup of the database entry when immediately rewaking - this may already be handled by other mechanisms (TTL, periodic cleanup, etc.), but it would be good to confirm or add a cleanup call.

Great catch on this subtle race condition! 🎉

@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: fix(pegboard-gateway): write hws immediately on socket hibernate

Summary

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a scenario where an actor could be woken immediately but the hibernating request entry doesn't exist yet.


Code Quality & Best Practices ✅

Strengths:

  1. Clear intent: The inline comment clearly explains why the upsert is moved earlier
  2. Follows existing patterns: Uses the same hibernating_request::upsert operation that's called in the keepalive loop
  3. Proper error handling: Maintains the await? pattern for error propagation
  4. Good logging: The existing debug log message is preserved and remains helpful

Code style compliance:

  • ✅ Follows structured logging patterns (already in place)
  • ✅ Proper async/await usage
  • ✅ Consistent with codebase conventions

Race Condition Fix 🔧

The Problem:
Previously, the flow was:

  1. Check for pending messages
  2. If no pending messages, upsert hibernating request (in keepalive loop)
  3. Start hibernation

This created a race window where:

  • If a message arrives and triggers a wake signal immediately
  • But the hibernating request entry hasn't been written yet
  • The wake mechanism might not find the entry to wake

The Solution:
The new flow:

  1. Upsert hibernating request immediately
  2. Check for pending messages (and return early if needed)
  3. Start keepalive loop

This ensures the entry exists before any potential wake signals.


Potential Issues & Questions 🤔

1. Interval First Tick Handling

// Discard the first tick since it fires immediately and we've already called this
// above
ping_interval.tick().await;

Good catch! This prevents a duplicate immediate upsert since:

  • The initial upsert happens at line 544-550
  • tokio::time::interval fires immediately on the first tick
  • Without this discard, we'd upsert twice in rapid succession

Minor suggestion: Consider if this could cause timing issues if the function returns early (before the keepalive task starts). Currently, if has_pending_websocket_messages returns true at line 555, we return early and never reach the keepalive task, so this is fine.

2. Database Transaction Timing

The hibernating_request::upsert operation uses a transaction that:

  • Reads the existing last_ping_ts with Serializable isolation
  • Deletes the old entry (if exists)
  • Writes a new entry with current timestamp

Question: What happens if this transaction takes longer than expected? Could there be a race between:

  1. This initial upsert completing
  2. The has_pending_websocket_messages check
  3. A message arriving that tries to wake the actor

Recommendation: The current implementation seems safe because the upsert completes fully before the pending check, but consider adding a comment about the transaction boundaries if this becomes relevant in the future.

3. Cleanup on Early Return

When returning early due to pending messages (line 563), the hibernating request entry is not cleaned up. This means:

  • Entry remains in the database
  • But the keepalive loop never starts
  • The entry won't be refreshed

Is this intentional? Looking at the code:

  • The cleanup only happens at line 608-616 for errors or HibernationResult::Close
  • For HibernationResult::Continue, no cleanup occurs

This could potentially leave stale entries if the actor is immediately woken. However, looking at line 605-607, the Continue result is explicitly listed as a case where we don't delete the entry, suggesting this is intentional behavior.

Recommendation: Add a comment explaining that the entry is left in place for Continue results since the actor is resuming active handling.


Performance Considerations ⚡

  1. Additional database write: The upsert now happens unconditionally, even if we return early

    • Impact: Minimal - this is a single key-value write to UniversalDB
    • Benefit: Eliminates the race condition, worth the cost
  2. No blocking operations: The upsert is async and doesn't block the hibernation check

    • ✅ Good design
  3. Keepalive interval optimization: The jitter in the keepalive loop (line 589) prevents thundering herd

    • ✅ Already well optimized

Security Concerns 🔒

No security issues identified. The change:

  • Uses existing authenticated operations
  • Doesn't expose new attack surfaces
  • Maintains proper error handling
  • Follows principle of least privilege

Test Coverage 🧪

Existing tests: The guard-core/tests/custom_serve.rs file has:

  • test_custom_serve_websocket_hibernation - Tests basic hibernation flow
  • ✅ Tests for normal websocket operations

Recommendation: Consider adding a test that specifically exercises the race condition:

#[tokio::test]
async fn test_hibernation_with_immediate_wake() {
    // Test scenario:
    // 1. Trigger hibernation
    // 2. Immediately send a message (while hibernation is starting)
    // 3. Verify the hibernating request entry exists
    // 4. Verify the actor wakes correctly
}

This would validate that the fix actually prevents the race condition. However, this may be difficult to test reliably due to timing sensitivity.


Recommendations Summary

  1. Approve the core fix - This is a solid solution to a real race condition
  2. 📝 Add comment - Clarify why the entry isn't deleted on early return (line 563)
  3. 🧪 Consider integration test - If feasible, add a test for the immediate-wake scenario
  4. 📚 Documentation - If there's design documentation for the hibernation system, update it to reflect this ordering requirement

Final Verdict: ✅ LGTM with minor suggestions

This is a well-thought-out fix for a subtle race condition. The code is clean, follows project conventions, and the solution is elegant. The minor suggestions above are for future hardening but shouldn't block merging.

Changes are:

  • Correct
  • Safe
  • Well-commented
  • Performance-neutral

Great work catching this race condition! 🎉

@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: fix(pegboard-gateway): write hws immediately on socket hibernate

Summary

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages, rather than relying solely on the keepalive task.

Code Quality ✅

Positive aspects:

  • Clean, focused fix that addresses a specific race condition
  • Good inline comments explaining the reasoning (lines 542-543, 581-582)
  • Follows existing code patterns and style conventions
  • Proper error propagation with ? operator

Minor observations:

  • The comment on line 581-582 could be slightly clearer about why the first tick fires immediately (tokio intervals tick immediately on creation)

Correctness ✅

The fix is correct and addresses a real race condition:

Problem being solved:
Previously, if an actor hibernated and immediately needed to rewake (detected by has_pending_websocket_messages), the function would return early with HibernationResult::Continue at line 563 without ever writing the hibernating request entry to the database. This entry would only get written by the keepalive task starting at line 571, but that task wouldn't run if we returned early.

Why this matters:
The pegboard::ops::actor::hibernating_request::list operation (line 25-28 in list.rs) looks for hibernating requests that haven't been pinged recently. If the entry was never written, the actor could be incorrectly identified as non-hibernating even though it had intended to hibernate.

The solution:

  1. Write the entry upfront (lines 544-550) so it exists regardless of the code path taken
  2. Skip the first tick (line 583) to avoid a duplicate write since we already wrote it above
  3. The keepalive task continues to update the timestamp periodically as before

Performance Considerations ✅

Positive:

  • The extra database write is negligible overhead (single upsert operation)
  • Actually reduces work in the early-return case since we avoid spawning the keepalive task unnecessarily
  • The MissedTickBehavior::Skip is appropriate for keepalive semantics

Note:

  • The keepalive interval is hibernating_request_eligible_threshold / 2 (default: 45 seconds), which is reasonable for preventing the actor from being considered stale

Security Considerations ✅

No security concerns identified:

  • No user input is processed in this code path
  • Database operations use proper error handling
  • No potential for injection or data leakage

Potential Issues & Suggestions

1. Error Handling in Early Return Path

Currently at lines 542-550, if the upsert operation fails, the function will return an error before checking for pending messages. This could prevent an immediate rewake even when one is needed.

Suggestion: Consider whether it's acceptable to fail-fast here, or if you should:

// Insert hibernating request entry before checking for pending messages
// This ensures the entry exists even if we immediately rewake the actor
if let Err(e) = self.ctx
    .op(pegboard::ops::actor::hibernating_request::upsert::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    })
    .await
{
    tracing::warn!(?e, "failed to upsert hibernating request, continuing anyway");
}

Though the current strict error handling may be intentional for correctness guarantees.

2. Cleanup on Early Return

When returning HibernationResult::Continue at line 563 (due to pending messages), we never reach the cleanup logic at lines 605-617 that deletes the hibernating request entry. The entry will remain in the database until the next actual hibernation or until it expires.

Is this intentional? If the actor immediately continues without actually hibernating, should we delete the entry we just created? Currently:

  • Entry gets created (line 544-550)
  • Early return if pending messages (line 563)
  • No cleanup since we returned early
  • Entry remains in DB

Suggestion: Consider adding cleanup before the early return:

if self
    .shared_state
    .has_pending_websocket_messages(request_id)
    .await?
{
    tracing::debug!(
        ?unique_request_id,
        "detected pending requests on websocket hibernation, rewaking actor"
    );
    
    // Clean up the entry we just created since we're not actually hibernating
    self.ctx
        .op(pegboard::ops::actor::hibernating_request::delete::Input {
            actor_id: self.actor_id,
            gateway_id: self.shared_state.gateway_id(),
            request_id,
        })
        .await?;

    return Ok(HibernationResult::Continue);
}

3. Test Coverage ⚠️

Missing: No tests found for the pegboard-gateway package. This race condition fix would benefit from:

  • Integration test simulating the hibernation flow with immediate pending messages
  • Test verifying the hibernating request entry exists in both code paths
  • Test for the keepalive task updating timestamps correctly

Architecture & Design ✅

The fix aligns well with the existing architecture:

  • Maintains the upsert-based keepalive pattern
  • Preserves transaction boundaries (lines 19-50 in upsert.rs)
  • Follows the operation pattern used throughout the codebase
  • Properly uses custom_instrument for observability

Verdict: APPROVE with suggestions

This is a solid fix for a real race condition. The core logic is correct and addresses the issue effectively. The suggestions above are minor enhancements around error handling and cleanup that could make the code more robust, but they don't block approval.

Recommended next steps:

  1. Consider the cleanup suggestion for the early-return path (most important)
  2. Consider adding integration tests for hibernation scenarios
  3. Consider whether the error handling strategy at line 544-550 is appropriate for this use case

Great work identifying and fixing this race condition! 🎯

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch from e1152dc to 968a92b Compare November 21, 2025 01:27
@claude
Copy link

claude bot commented Nov 21, 2025

Code Review

Summary

This PR fixes a race condition in the websocket hibernation flow by ensuring the hibernating request entry is written to the database before checking for pending messages. This prevents a bug where the entry would never be created if the actor immediately rewakes.

What Changed

  1. Moved upsert operation earlier (lines 542-550): The hibernating_request::upsert call now happens before the has_pending_websocket_messages check
  2. Added interval tick discard (lines 581-583): Discards the first immediate tick from tokio::time::interval to avoid duplicate upsert calls

Code Quality ✅

Strengths:

  • Clear, well-commented code explaining the rationale for each change
  • Follows proper error handling patterns with Result types
  • Maintains consistency with existing code style and patterns
  • The fix is minimal and surgical - only changes what's necessary

Observations:

  • The comment on line 581-582 is clear and explains the tokio interval behavior well
  • Structured logging is properly used (tracing::debug!(?unique_request_id, ...))
  • The early return pattern (line 563) is appropriate for the rewake case

Logic & Correctness ✅

Analysis of the fix:

Before this PR:

  1. Check for pending messages
  2. If pending → return early (Continue)
  3. If not pending → upsert hibernating request entry
  4. Bug: Early return skips upsert, leaving no database entry

After this PR:

  1. ✅ Upsert hibernating request entry (ensures it exists)
  2. Check for pending messages
  3. If pending → return early (Continue) - entry already exists ✅
  4. If not pending → start keepalive task

The fix ensures the database entry exists regardless of which code path is taken.

Interval behavior:

  • tokio::time::interval fires immediately on creation
  • The keepalive loop now correctly discards this first tick since we've already called upsert above (line 544-550)
  • This prevents a duplicate upsert call immediately after spawning the task

Potential Issues 🟡

1. Cleanup in early-return case

When HibernationResult::Continue is returned early (line 563), the hibernating request entry is not cleaned up because:

  • The keepalive task never starts
  • The cleanup logic (lines 605-617) only runs for Close or Err cases

Question: Is this intentional? When we immediately rewake due to pending messages, should we:

  • Option A: Leave the entry (current behavior) - assumes it will be cleaned up later when the actor truly hibernates
  • Option B: Clean it up before returning

If Option B is desired, consider adding:

if self.shared_state.has_pending_websocket_messages(request_id).await? {
    tracing::debug!(?unique_request_id, "detected pending requests on websocket hibernation, rewaking actor");
    
    // Clean up since we're not actually hibernating
    self.ctx.op(pegboard::ops::actor::hibernating_request::delete::Input {
        actor_id: self.actor_id,
        gateway_id: self.shared_state.gateway_id(),
        request_id,
    }).await?;
    
    return Ok(HibernationResult::Continue);
}

However, this might be correct as-is if the entry serves as a "this request is eligible for hibernation" marker that persists across rewake cycles.

Performance ✅

  • The additional database upsert on the fast path (early return) is minimal overhead
  • Preventing duplicate interval ticks is a good optimization
  • No unnecessary allocations or blocking operations introduced

Security ✅

  • No security concerns identified
  • Proper use of existing security patterns
  • No exposure of sensitive data
  • Request IDs and actor IDs are properly scoped

Testing Recommendations 📋

Consider adding tests for:

  1. Early return with pending messages - Verify that:
    • Hibernating request entry is created before the early return
    • Entry cleanup behavior is correct (or verify intentional persistence)
  2. Normal hibernation flow - Verify keepalive task properly updates timestamps
  3. Interval tick behavior - Verify no duplicate upserts happen immediately after spawning

Since there are no test files found for hibernating_request, this would be a good opportunity to add coverage for this critical path.

Verdict: ✅ Approve with Minor Question

The fix is correct and well-implemented. The only question is about the cleanup behavior in the early-return case - this might be intentional, but worth confirming the expected behavior.

Great work on catching and fixing this race condition! The comments explaining the fix are particularly helpful for future maintainers.


Recommendation: Approve and merge once the cleanup behavior question is addressed (either confirming it's intentional or adding cleanup if needed).

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 21, 2025

Merge activity

  • Nov 21, 2:40 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 21, 2:42 AM UTC: CI is running for this pull request on a draft pull request (#3515) due to your merge queue CI optimization settings.
  • Nov 21, 2:43 AM UTC: Merged by the Graphite merge queue via draft PR: #3515.

@graphite-app graphite-app bot closed this Nov 21, 2025
@graphite-app graphite-app bot deleted the 11-19-fix_pegboard-gateway_write_hws_immediately_on_socket_hibernate branch November 21, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants