Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:52pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:52pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:52pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:52pm

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 07:59 Inactive
Copy link
Member Author

NathanFlurry commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed this PR and have the following feedback:

Overview

This PR extracts the keepalive task into a separate module and ensures it runs both during active WebSocket connections and during hibernation. The goal is to prevent hibernating requests from being garbage collected.

Positive Changes

  • Good refactoring: Extracting the keepalive logic into a dedicated module improves code organization and maintainability
  • Proper task lifecycle management: Using watch channels for graceful shutdown is the correct pattern
  • Consistent error handling: The new task follows the same pattern as other lifecycle tasks

Issues and Concerns

1. Documentation typo (Minor)

Line 13 in keepalive_task.rs - Missing space in comment. Should be "/// Only ran for hibernating requests." instead of "///Only ran for hibernating requests."

2. Potential unnecessary work during active connections (Performance)

Lines 480-507 in lib.rs - The keepalive task now runs during active WebSocket connections when can_hibernate is true. Previously, it only ran during actual hibernation (handle_websocket_hibernation).

Question: Is it necessary to ping the hibernating_request UDB entry while the connection is still active? This adds database writes during normal operation. The original implementation only ran the keepalive task during actual hibernation (lines 603-616), which seems more efficient.

Recommendation: Consider whether the keepalive task should only run during actual hibernation, not during active connections. If it must run during active connections, please document why this change is necessary in the PR description or code comments.

3. Task spawning pattern (Code Quality)

Lines 485-491 in lib.rs - The keepalive task is spawned inside a tokio::join! future, which is different from how the other three tasks are spawned (lines 398-418). For consistency, consider spawning it outside the join block conditionally and then awaiting it inside, similar to how tunnel_to_ws, ws_to_tunnel, and ping tasks are handled.

4. Result matching could be more robust (Correctness)

Lines 511-523 in lib.rs - The result matching logic prefers the first non-aborted result, but doesn't explicitly handle cases where multiple tasks return non-aborted results. Consider being more explicit about the precedence for different LifecycleResult variants (ServerClose vs ClientClose vs Aborted).

5. Clone variables optimization (Minor Performance)

Lines 424-428 in lib.rs - Several clones are performed that may not be necessary since most of these types are cheap to clone (Ids are Copy types, ctx is an Arc), but consider capturing them directly in the async block if possible.

Testing Recommendations

  • Add integration tests to verify that hibernating requests don't get garbage collected
  • Test the keepalive interval is working correctly (at threshold/2)
  • Verify that jitter prevents stampeding herds in high-load scenarios

Security/Safety

No security concerns identified. The changes maintain proper error handling and resource cleanup.

Verdict

The core fix looks sound, but there are some architectural questions (particularly issue 2 about running during active connections) that should be addressed before merging. The code quality could be improved with more consistent patterns and better documentation of the design decisions.

@NathanFlurry NathanFlurry marked this pull request as ready for review November 20, 2025 18:24
@MasterPtato MasterPtato force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from 4f192f3 to a11d203 Compare November 20, 2025 19:38
@MasterPtato MasterPtato force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 5552b74 to 815c918 Compare November 20, 2025 19:38
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review - PR #3498

Summary

This PR refactors the keepalive task for hibernating requests by extracting it into a separate module and ensuring it runs both during active WebSocket connections and during hibernation. Overall, the changes look good and follow the repository's patterns well.


Positive Observations

Good refactoring: Extracting the keepalive logic into keepalive_task.rs improves code organization and reusability

Consistent patterns: The new task follows the same pattern as ping_task.rs with proper abort handling via watch channels

Proper lifecycle management: The keepalive task is correctly spawned and cleaned up in both handle_websocket and handle_websocket_hibernation

Jitter implementation: Good use of jitter (0-128ms) to prevent thundering herd problems

Import organization: Follows the repository convention of keeping imports at the top of the file


Issues & Suggestions

1. Typo in documentation (Minor)

Location: keepalive_task.rs:10

/// Periodically pings writes keepalive in UDB. This is used to restore hibernating request IDs on

Should be:

/// Periodically writes keepalive pings in UDB. This is used to restore hibernating request IDs on

2. Formatting issue in documentation (Minor)

Location: keepalive_task.rs:13

There's a missing space:

///Only ran for hibernating requests.

Should be:

/// Only ran for hibernating requests.

3. Inconsistent lifecycle result handling (Moderate)

Location: lib.rs:511-522

The lifecycle result matching logic shows asymmetry. When multiple tasks complete successfully but with different results, the code only checks the first two positions:

(Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
(Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
// Unlikely case
(res, _, _, _) => res,

Potential issue: If tunnel_to_ws_res and ws_to_tunnel_res are both Aborted, but ping_res or keepalive_res contains a meaningful result (like ServerClose or ClientClose), that result will be lost.

Suggestion: Consider a more comprehensive pattern that checks all positions for non-Aborted results:

match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
    // Prefer error
    (Err(err), _, _, _) => Err(err),
    (_, Err(err), _, _) => Err(err),
    (_, _, Err(err), _) => Err(err),
    (_, _, _, Err(err)) => Err(err),
    // Find first non-aborted result
    (Ok(res), _, _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, Ok(res), _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, Ok(res), _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    (_, _, _, Ok(res)) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
    // All aborted (unlikely)
    (res, _, _, _) => res,
}

4. Error handling observation (Low)

Location: keepalive_task.rs:21-26

The .try_into()? conversion could fail if hibernating_request_eligible_threshold / 2 exceeds u64::MAX, though this is extremely unlikely in practice. The current error handling is appropriate, but you might consider adding a more descriptive error context:

let mut ping_interval = tokio::time::interval(Duration::from_millis(
    (ctx.config()
        .pegboard()
        .hibernating_request_eligible_threshold()
        / 2)
    .try_into()
    .context("hibernating_request_eligible_threshold too large for interval")?,
));

However, this is optional since the threshold would need to be absurdly large to cause issues.


5. Missing context on abort handling (Low)

Location: keepalive_task.rs:37-39

The abort handling via keepalive_abort_rx.changed() is correct, but it might be worth adding a debug log when the task is aborted, similar to how other tasks in lib.rs log their completion states. This would help with debugging:

_ = keepalive_abort_rx.changed() => {
    tracing::debug!("keepalive task aborted");
    return Ok(LifecycleResult::Aborted);
}

Performance Considerations

✅ The keepalive interval calculation (threshold / 2) is appropriate to ensure pings happen well before the threshold expires

✅ Using MissedTickBehavior::Skip prevents backlog buildup if the task falls behind

✅ Jitter range (0-128ms) is reasonable for preventing synchronization


Security Considerations

✅ No security concerns identified
✅ Proper use of watch channels for cancellation
✅ No credentials or sensitive data handling issues


Test Coverage

⚠️ Missing tests: This package doesn't appear to have any tests. While not a blocker for this PR, consider adding integration tests for:

  • Keepalive task behavior during normal operation
  • Keepalive task abort handling
  • Proper cleanup when websockets close during hibernation

Verdict

Recommendation: Approve with minor fixes

The changes are well-structured and solve the stated problem effectively. The only critical item is the documentation typo, which is trivial to fix. The lifecycle result handling observation is worth considering but not blocking since the current logic works for the common case.

Great work on maintaining consistency with existing patterns! 🎯

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Keepalive Task Refactoring

Summary

This PR extracts keepalive logic into a dedicated task module and ensures hibernating requests are pinged both during active WebSocket connections and during hibernation. The refactoring improves code organization and fixes a potential bug where keepalive pings weren't being sent during active connections.


Code Quality & Best Practices ✅

Strengths:

  • Good separation of concerns: Extracting keepalive_task.rs follows the existing pattern of separate task modules (ping_task.rs, tunnel_to_ws_task.rs, etc.)
  • Consistent error handling: Uses anyhow::Result appropriately
  • Proper abort handling: Uses watch::Receiver for graceful task cancellation, consistent with other tasks
  • Code reuse: DRY principle applied - same keepalive logic used in both contexts

Minor Issues:

  1. Comment formatting (keepalive_task.rs:13)

    ///Only ran for hibernating requests.
    

    Missing space after ///. Should be:

    /// Only ran for hibernating requests.
    
  2. Import ordering (lib.rs:10-16)
    The imports are reordered in the diff, but this appears to be automatic formatting. Consider running cargo fmt if it hasn't been run yet (per CLAUDE.md: "DO NOT RUN CARGO FMT AUTOMATICALLY").


Potential Bugs 🐛

Medium Priority:

  1. Result handling in keepalive task abort (lib.rs:615-616)

    let _ = keepalive_abort_tx.send(());
    let _ = keepalive_handle.await;

    The await result is ignored. If the keepalive task panicked, we should probably log it:

    let _ = keepalive_abort_tx.send(());
    if let Err(e) = keepalive_handle.await {
        tracing::warn!(?e, "keepalive task failed during cleanup");
    }

    Same issue exists in lines 440-441, 457-458, 473-474, and others where abort signals are sent.

  2. Lifecycle result priority logic (lib.rs:511-523)
    The logic for determining the single result from all tasks seems incomplete:

    // Prefer non aborted result if all succeed
    (Ok(res), Ok(LifecycleResult::Aborted), _, _) => Ok(res),
    (Ok(LifecycleResult::Aborted), Ok(res), _, _) => Ok(res),
    // Unlikely case
    (res, _, _, _) => res,

    This only checks the first two tasks. What if:

    • (Aborted, Aborted, ServerClose, Aborted) → Would return Aborted instead of ServerClose
    • (Aborted, Aborted, Aborted, ClientClose) → Would return Aborted instead of ClientClose

    Consider a more comprehensive approach:

    let mut lifecycle_res = match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res, keepalive_res) {
        // Prefer error
        (Err(err), _, _, _) => Err(err),
        (_, Err(err), _, _) => Err(err),
        (_, _, Err(err), _) => Err(err),
        (_, _, _, Err(err)) => Err(err),
        // Find first non-aborted result
        (Ok(res), _, _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
        (_, Ok(res), _, _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
        (_, _, Ok(res), _) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
        (_, _, _, Ok(res)) if !matches!(res, LifecycleResult::Aborted) => Ok(res),
        // All aborted
        (res, _, _, _) => res,
    };

Performance Considerations ⚡

Good:

  • Jitter implementation (keepalive_task.rs:42-44): Prevents thundering herd with 0-128ms jitter
  • Missed tick behavior: MissedTickBehavior::Skip prevents tick accumulation if system is slow
  • Efficient abort signaling: Uses watch::channel which is lightweight

Potential Improvement:

  • The keepalive task spawns unconditionally in handle_websocket (lib.rs:485) even when can_hibernate is false, then immediately returns Aborted. Consider spawning conditionally:
    let keepalive = if can_hibernate {
        Some(tokio::spawn(keepalive_task::task(...)))
    } else {
        None
    };
    Then handle None case in the join. This avoids spawning unnecessary tasks.

Security Concerns 🔒

No major security issues identified.

Minor observations:

  • The jitter range (0-128ms) is small and predictable, but this is acceptable for load distribution
  • No input validation needed as all parameters are internal IDs

Test Coverage 🧪

Critical Gap:

  • No tests found for the pegboard-gateway package
  • This PR adds significant functionality (keepalive during active connections) but has no automated test coverage

Recommendations:

  1. Add integration tests for:
    • Keepalive task sending periodic updates during active WebSocket connections
    • Keepalive task sending periodic updates during hibernation
    • Proper cleanup when connection closes
    • Proper cleanup when tasks are aborted
  2. Add unit tests for:
    • Lifecycle result priority logic (especially the edge cases mentioned above)
    • Jitter behavior

Architecture & Design 🏗️

Excellent:

  • Follows existing patterns in the codebase (separate task modules with abort channels)
  • The dual-use of keepalive (during connection and hibernation) makes sense architecturally
  • Clean separation between the two contexts where keepalive is needed

Adherence to CLAUDE.md Standards ✅

  • ✅ Uses workspace dependencies correctly
  • ✅ Follows structured logging patterns with tracing
  • ✅ Error handling follows the custom error system
  • ✅ Uses hard tabs (consistent with rustfmt.toml)
  • ⚠️ Comment formatting needs minor fix

Summary & Recommendations

Approve with minor changes recommended:

  1. Fix comment spacing in keepalive_task.rs:13
  2. Improve error logging when awaiting task handles (don't ignore panics)
  3. Fix lifecycle result priority logic to handle all 4 tasks correctly
  4. Consider conditional spawning of keepalive task when can_hibernate is false
  5. Add test coverage for this functionality

The core logic is sound and this is a good refactoring. The issues identified are relatively minor but should be addressed to improve robustness.

Priority order:

  1. Fix lifecycle result priority logic (Medium - could cause incorrect behavior)
  2. Add error logging for task cleanup (Low - helps debugging)
  3. Fix comment formatting (Low - code quality)
  4. Add tests (Medium - prevents regressions)
  5. Conditional task spawning (Low - micro-optimization)

@NathanFlurry NathanFlurry force-pushed the 11-19-fix_pegboard-gateway_ping_hibernating_requests_both_during_open_hws_connections_and_during_hibernation branch from 815c918 to 986289f Compare November 20, 2025 20:51
@NathanFlurry NathanFlurry force-pushed the 11-19-fix_rivetkit_fix_graceful_runner_shutdown branch from a11d203 to 0ed702c Compare November 20, 2025 20:51
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Overview

This PR refactors the keepalive logic for hibernating WebSocket requests by extracting it into a dedicated task module and running it alongside other lifecycle tasks. The change improves architecture and ensures hibernating requests are properly maintained both during active connections and during hibernation.

Positive Aspects ✅

  1. Good separation of concerns: Extracting keepalive logic into keepalive_task.rs follows the same pattern as ping_task.rs and ws_to_tunnel_task.rs, improving code organization.

  2. Consistent task lifecycle management: The new task integrates well with the existing abort channel pattern using watch::Receiver<()>.

  3. Proper jitter implementation: The rand::thread_rng().gen_range(0..128) prevents thundering herd issues when multiple requests ping simultaneously.

  4. Correct interval calculation: Using hibernating_request_eligible_threshold / 2 ensures pings happen well before the eligibility threshold expires.

Issues and Suggestions 🔍

1. Typo in Comment (Minor)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:13

///Only ran for hibernating requests.

Issue: Missing space after ///.

Fix: Should be /// Only ran for hibernating requests.


2. Import Organization (Style)

Location: engine/packages/pegboard-gateway/src/lib.rs:10-17

The diff shows imports were reordered:

use rivet_guard_core::{
	WebSocketHandle,
	custom_serve::{CustomServeTrait, HibernationResult},
	errors::{ServiceUnavailable, WebSocketServiceUnavailable},
	proxy_service::{ResponseBody, is_ws_hibernate},
	request_context::RequestContext,
	websocket_handle::WebSocketReceiver,
};

Issue: While this appears to be an auto-formatting change, the CLAUDE.md explicitly states:

DO NOT RUN CARGO FMT AUTOMATICALLY

Recommendation: Verify if these import reorderings were intentional or accidental. If accidental, they should be reverted to avoid conflicts with the team's formatting workflow.


3. Potential Integer Overflow (Low Risk)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:21-26

let mut ping_interval = tokio::time::interval(Duration::from_millis(
	(ctx.config()
		.pegboard()
		.hibernating_request_eligible_threshold()
		/ 2)
	.try_into()?,
));

Issue: The code divides i64 by 2 then converts to u64 with try_into()?. While hibernating_request_eligible_threshold() returns a positive value (defaults to 90,000ms), there's no guarantee at the type level.

Consideration:

  • If hibernating_request_eligible_threshold is ever negative, try_into() will fail
  • This is likely acceptable since negative thresholds don't make sense
  • The default value is 90,000ms (90 seconds), so the ping interval would be 45 seconds

Recommendation: Current implementation is acceptable, but consider adding a debug assertion or comment explaining the assumption.


4. Task Lifecycle Abort Pattern Inconsistency (Minor)

Location: engine/packages/pegboard-gateway/src/lib.rs:601-616

In handle_websocket_hibernation, the new keepalive task uses:

let _ = keepalive_abort_tx.send(());
let _ = keepalive_handle.await;

Compare to the old implementation in the same function:

keepalive_handle.abort();

Observation: The new pattern is cleaner and more consistent with other tasks in handle_websocket. However, note that let _ = keepalive_handle.await; ignores the join result.

Recommendation: Consider whether you want to log if the keepalive task panicked:

if let Err(e) = keepalive_handle.await {
    if e.is_panic() {
        tracing::error!(error = ?e, "keepalive task panicked");
    }
}

5. Keepalive Task Only Runs for can_hibernate Requests

Location: engine/packages/pegboard-gateway/src/lib.rs:480-482

if !can_hibernate {
	return Ok(LifecycleResult::Aborted);
}

Question: Is this the intended behavior? The PR title mentions "ping hibernating requests both during open hws connections and during hibernation", but the task immediately aborts if can_hibernate is false.

Clarification Needed:

  • Should the keepalive task run for all requests during the active WebSocket connection phase?
  • Or is it correct that only hibernatable requests need keepalive pings?

Based on the code, it appears correct that only hibernatable requests need keepalive tracking in UDB, but this should be confirmed.


6. Error Handling in Infinite Loop (Design Question)

Location: engine/packages/pegboard-gateway/src/keepalive_task.rs:43-51

ctx.op(pegboard::ops::actor::hibernating_request::upsert::Input {
	actor_id,
	gateway_id,
	request_id,
})
.await?;

Question: If the UDB upsert operation fails, the task will exit with an error. This will cascade to abort all other tasks.

Consideration: Is this the desired behavior? Alternatives could include:

  • Retry logic with exponential backoff
  • Logging the error and continuing the loop
  • Metric emission for failed keepalive attempts

Recommendation: The current "fail fast" approach is reasonable, but consider if transient database errors should be more resilient.


Security Considerations 🔒

No security concerns identified. The changes don't introduce new attack vectors or data exposure risks.


Performance Considerations ⚡

  1. Additional task overhead: The keepalive task adds one more concurrent task during WebSocket connections, but the overhead is minimal.

  2. Jitter range: The 0-128ms jitter is reasonable but fairly small. For large-scale deployments, consider if a larger range would better distribute database load.

  3. Database writes: Each hibernating request now writes to UDB every ~45 seconds. Ensure this scales appropriately with your expected number of concurrent hibernating connections.


Testing Recommendations 🧪

Consider adding tests for:

  1. Keepalive task properly aborts when signaled
  2. Keepalive pings occur at the expected interval
  3. Task cleanup when other tasks fail
  4. Behavior when hibernating_request_eligible_threshold has edge case values

Adherence to Project Conventions ✓

  • ✅ Uses anyhow::Result correctly
  • ✅ Structured logging with tracing (lowercase messages)
  • ✅ Follows existing task pattern architecture
  • ✅ Uses workspace dependencies appropriately
  • ⚠️ Potential auto-formatting changes to imports (see Issue [SVC-2479] Send cluster events to PostHog #2)

Summary

This is a solid refactoring that improves code organization and maintainability. The main concerns are:

  1. A minor typo in the comment
  2. Potential unintended import reformatting
  3. Questions about error handling strategy for database failures

The core logic appears sound and follows established patterns in the codebase. Once the minor issues are addressed, this should be good to merge.


Overall Assessment: ✅ Approve with minor suggested improvements

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

I've reviewed this PR and overall the changes look good! The refactoring properly extracts the keepalive logic into a dedicated module and correctly ensures it runs both during active WebSocket connections and hibernation.

Strengths

  1. Good separation of concerns: Extracting the keepalive logic improves code organization
  2. Proper task lifecycle management: Correctly integrated with the abort signal pattern
  3. Correct hibernation check: Only runs when can_hibernate is true
  4. Consistent error handling: Uses the same LifecycleResult pattern as other tasks

Issues Found

1. Typo in documentation (keepalive_task.rs:10)

  • "pings writes keepalive" should be "writes keepalive pings"

2. Missing space in documentation (keepalive_task.rs:13)

  • Should be: /// Only ran for hibernating requests.

3. Comment reference outdated (keepalive_task.rs:30-31)

  • The comment says "we've already called this above" but there's no code above in this new module that calls the upsert operation. This was copied from the original inline implementation and should be updated.

Recommendations

  1. Fix the typos in the documentation comments
  2. Update the misleading comment to accurately reflect the code flow
  3. Consider adding error context to .try_into()? on line 26 using .context() for easier debugging

Performance & Security

  • The jitter (0-128ms) helps prevent thundering herd issues ✅
  • The ping interval at half the threshold ensures timely updates ✅
  • MissedTickBehavior::Skip is appropriate ✅
  • No security concerns identified ✅

Overall Assessment: Solid refactoring with minor documentation fixes needed. Core logic is correct and properly integrated.

Great work! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants