Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 20, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Error Error Nov 20, 2025 8:56pm
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 20, 2025 8:56pm
rivet-inspector Ignored Ignored Preview Nov 20, 2025 8:56pm
rivet-site Ignored Ignored Preview Nov 20, 2025 8:56pm

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless November 20, 2025 20:16 Inactive
Copy link
Contributor Author

MasterPtato commented Nov 20, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Timeout for Force Rescheduling Pending State

Summary

This PR adds the ability to timeout when force-rescheduling actors in a pending state, preventing actors from waiting indefinitely for allocation. The changes introduce an AllocationOverride enum and integrate timeout logic into the actor wake and allocation flow.


✅ Strengths

  1. Well-structured design: The AllocationOverride enum provides a clean abstraction for controlling allocation behavior, making the intent explicit at call sites.

  2. Proper race condition handling: The timeout handler (lines 634-690 in runtime.rs) correctly handles the race condition where an actor may have been allocated between the timeout and clearing the queue.

  3. Code cleanup: Removal of the wake_for_alarm field simplifies state management and reduces cognitive overhead.

  4. Consistent pattern application: The timeout logic mirrors the existing Destroy signal handling, maintaining consistency in the codebase.


🔍 Issues & Concerns

1. Critical: Hard-coded timeout constant in guard package

// engine/packages/guard/src/routing/pegboard_gateway.rs:11
const ACTOR_FORCE_WAKE_PENDING_TIMEOUT: i64 = util::duration::seconds(60);

Issue: This 60-second timeout is defined in the guard package rather than being configurable or centralized.

Recommendations:

  • Move this to a configuration file or make it part of the Pegboard config
  • Consider if 60 seconds is appropriate for all use cases (production vs development)
  • Add documentation explaining why this specific duration was chosen

2. Potential issue: Change in handle_stopped match logic

// Before:
StoppedVariant::Normal { code: protocol::StopCode::Ok } => {
    state.reschedule_state = Default::default();
    false
}

// After:
StoppedVariant::Normal { code } => {
    if let protocol::StopCode::Ok = code {
        state.reschedule_state = Default::default();
    }
    false
}

Issue: The match pattern changed from exhaustive matching on StopCode::Ok to matching all codes. While this preserves existing behavior for Ok, it changes behavior for Error codes.

Previous behavior: StopCode::Error would fall through to force_reschedule = false
New behavior: Same as above, BUT reschedule_state is NOT reset on Error

Question: Is this intentional? Should error codes also reset the reschedule state, or is preserving retry count on errors the desired behavior? This deserves clarification or a comment explaining the intent.

3. Missing test coverage

No tests were added for the new timeout functionality. Consider adding tests for:

  • Actor pending allocation timeout (successful path)
  • Race condition where actor is allocated just as timeout fires
  • Multiple wake attempts with different AllocationOverride values
  • Timeout behavior with DontSleep policy

4. Documentation gaps

The new AllocationOverride enum and Wake signal field lack documentation:

#[derive(Debug, Serialize, Deserialize, Default)]
#[serde(rename_all = "snake_case")]
pub enum AllocationOverride {
    #[default]
    None,
    /// Forces actors with CrashPolicy::Sleep to pend instead of sleep.
    DontSleep { pending_timeout: Option<i64> },
}

Recommendations:

  • Add doc comments explaining when to use each variant
  • Document the units for pending_timeout (milliseconds, based on timestamp usage)
  • Explain what happens when pending_timeout is None vs Some(...)

5. Code duplication in timeout handler

Lines 634-690 in runtime.rs duplicate significant code from the Allocate signal handler (lines 571-605). The duplication includes:

  • UpdateRunnerInput activity
  • CommandStartActor signal construction
  • Error handling for the race condition

Recommendation: Consider extracting a helper function to reduce duplication and improve maintainability.


🔒 Security Considerations

No security concerns identified. The timeout mechanism properly prevents resource exhaustion from indefinitely pending actors.


⚡ Performance Considerations

The timeout adds minimal overhead:

  • listen_with_timeout is an efficient operation
  • The 60-second default seems reasonable for preventing indefinite waits
  • No blocking operations or busy-waiting introduced

Minor concern: The hard-coded 60-second timeout may be too long for latency-sensitive applications. Consider making this configurable per namespace or actor.


📋 Best Practices (per CLAUDE.md)

Follows conventions:

  • Proper use of structured logging with tracing
  • Lowercase log messages: "actor sleeping, waking"
  • Timestamp field naming: N/A (no new timestamp fields)
  • Proper use of anyhow::Result

Code style:

  • Consistent with existing patterns
  • Proper use of #[serde(default)] for backward compatibility

⚠️ Could improve:

  • Add ?actor_id to more log statements for consistency (e.g., line 635)
  • Consider adding a log entry when timeout is set vs not set

🎯 Recommendations Summary

High Priority:

  1. Clarify the intended behavior change in handle_stopped for StopCode::Error
  2. Make the 60-second timeout configurable or document the rationale
  3. Add documentation to AllocationOverride explaining units and behavior

Medium Priority:
4. Add test coverage for timeout scenarios
5. Consider extracting duplicated actor allocation code into a helper function

Low Priority:
6. Add more structured logging fields for better observability


Overall, this is a solid improvement that addresses a real issue with actors waiting indefinitely. The main concerns are around configurability, documentation, and the subtle behavior change in error handling. With the recommended changes, this would be ready to merge.

Approval status: ⚠️ Approve with comments (please address the handle_stopped behavior clarification)

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 8b84eb1 to 4f58c50 Compare November 20, 2025 20:55
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 20:55
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Thanks for this PR! I've reviewed the changes and have the following feedback:

Overview

This PR adds the ability to timeout force rescheduling when actors are in pending state. The implementation introduces an AllocationOverride enum and a configurable timeout mechanism for pending allocations.


✅ Strengths

  1. Clean abstraction: The AllocationOverride enum is well-designed and makes the intent clear at call sites
  2. Good defaults: Using #[serde(default)] ensures backward compatibility for existing workflows
  3. Proper timeout handling: The timeout logic handles all three cases (Allocate signal, Destroy signal, timeout) correctly
  4. Race condition handling: Lines 648-686 in runtime.rs properly handle the race where an actor was allocated after the timeout but before clearing from the queue

🔍 Issues and Suggestions

1. Typo in PR Title (Minor)

The PR title has a typo: "resceduling" should be "rescheduling"

2. Code Duplication in Timeout Handler (Moderate)

In runtime.rs:634-690, there is significant code duplication between the None (timeout) case and the Some(PendingAllocation::Allocate) case. Consider extracting the actor start logic into a helper function to reduce duplication.

3. Missing Error Context (Minor)

When timeout occurs (runtime.rs:635), the debug log says "timed out before actor allocated" but doesn't include how long the timeout was. Consider adding the timeout value to help with debugging.

4. Magic Number Should Be Configurable (Moderate)

In guard/src/routing/pegboard_gateway.rs:11, the 60-second timeout is hardcoded. Should this be configurable via the pegboard config? For different workloads, 60 seconds might be too short or too long.

5. Incomplete Pattern Match Simplification (Minor)

In mod.rs:568-578, the pattern match was simplified. The StoppedVariant::Normal now matches all stop codes, not just Ok. While this is likely the intended behavior, it's a behavioral change that should be explicitly called out in the PR description.

6. Removed Field Not Documented (Minor)

The wake_for_alarm field was removed from LifecycleState and all related logic. The PR description doesn't explain why this field became unnecessary. A brief explanation would help reviewers understand the refactor.


🔒 Security Considerations

✅ No security issues identified. The timeout mechanism properly prevents indefinite waits.


⚡ Performance Considerations

✅ The changes are performance-neutral. Adding a timeout doesn't add overhead when not used.


🧪 Test Coverage

⚠️ No tests found for the new timeout functionality. Consider adding:

  1. Integration test for timeout expiring before allocation
  2. Test for race condition where allocation happens just before timeout
  3. Test verifying the 60-second timeout in guard works correctly

📝 Documentation

The new AllocationOverride enum is documented with a clear comment. The pending_timeout field could use a doc comment explaining the units (milliseconds).


Recommendations

Must Fix:

  • None - the code is functional

Should Fix:

  • Add test coverage for timeout behavior
  • Consider making ACTOR_FORCE_WAKE_PENDING_TIMEOUT configurable
  • Document the behavioral change in StoppedVariant::Normal handling

Nice to Have:

  • Extract duplicated actor start logic into helper function
  • Add timeout value to debug log
  • Add doc comments to pending_timeout field

Overall, this is a solid implementation that properly handles a tricky edge case. The main gaps are in testing and documentation rather than the code itself.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Add ability to timeout force rescheduling pending state

Thanks for this PR! This is a solid improvement that addresses an important edge case in the actor lifecycle management.

Summary

This PR adds timeout handling for actors stuck in the pending allocation state, preventing indefinite waiting when waking sleeping actors. The implementation introduces a new AllocationOverride enum to control allocation behavior and adds a 60-second timeout for force-wake scenarios.

Strengths

  1. Well-structured refactoring: The introduction of AllocationOverride enum cleanly replaces the boolean force_allocate parameter, making the code more maintainable and extensible.

  2. Proper race condition handling: The timeout implementation correctly handles the race condition where an actor might be allocated while attempting to clear it from the queue (lines 646-686 in runtime.rs).

  3. Good use of structured logging: Follows the project's logging conventions with ?actor_id and lowercase messages.

  4. Proper serde defaults: Using #[serde(default)] on the Wake signal ensures backward compatibility.

@claude
Copy link

claude bot commented Nov 20, 2025

Code Quality Observations

1. Removed wake_for_alarm field - The removal of this field simplifies state management. However, verify that alarm-based wake behavior still works correctly. The fake signal now always uses AllocationOverride::DontSleep { pending_timeout: None } instead of distinguishing alarm-based wakes.

2. Simplified StoppedVariant handling - The refactor at lines 568-577 in mod.rs is cleaner and more explicit than the previous wildcard pattern.

Potential Issues

1. Missing documentation on timeout value - The ACTOR_FORCE_WAKE_PENDING_TIMEOUT constant (60 seconds) lacks justification. Add a comment explaining why 60 seconds was chosen and under what conditions this timeout triggers.

2. Code duplication in timeout handler - Lines 658-681 in runtime.rs duplicate the actor start command logic that appears earlier at lines 520-543. Consider extracting this into a helper function to reduce duplication.

@claude
Copy link

claude bot commented Nov 20, 2025

3. Inconsistent pending_timeout handling - The timeout is hardcoded in guard but AllocationOverride allows None. In guard/pegboard_gateway.rs it always sets Some(ACTOR_FORCE_WAKE_PENDING_TIMEOUT), while in mod.rs it uses None for alarm-based wakes. Question: Should alarm-based wakes also have a timeout? If actors can get stuck in pending state during regular wakes, they could also get stuck during alarm wakes.

4. Behavior change in alarm wakes - The fake signal for alarms (line 279 in mod.rs) now uses AllocationOverride::DontSleep { pending_timeout: None } instead of tracking wake_for_alarm. This means alarm-based wakes will now force actors to pend instead of potentially sleeping. Verify: Is this the intended behavior? Should alarms also allow actors to sleep if no capacity exists?

@claude
Copy link

claude bot commented Nov 20, 2025

Testing Concerns

Critical: No test files found in the pegboard package. Consider adding integration tests for:

  1. Timeout scenarios:

    • Actor pending for >60s during force wake
    • Race condition where allocation happens just before timeout
    • Multiple concurrent wake requests
  2. Allocation override behavior:

    • AllocationOverride::None allows sleeping
    • AllocationOverride::DontSleep with Some(timeout) times out correctly
    • AllocationOverride::DontSleep with None waits indefinitely
  3. State transitions:

    • Sleeping -> Pending -> Allocated (with timeout)
    • Sleeping -> Pending -> Timeout -> Sleep

@claude
Copy link

claude bot commented Nov 20, 2025

Security and Performance

Security: No concerns identified. The timeout prevents resource exhaustion by not allowing actors to wait indefinitely.

Performance:

  • Positive: The timeout prevents indefinite waiting, improving responsiveness
  • Watch: Multiple wake retries (line 198 in pegboard_gateway.rs) could amplify load if many actors are stuck pending

Style

All code follows the project conventions from CLAUDE.md (hard tabs, structured logging with ? prefix, lowercase log messages, proper imports).

Recommendations

  1. High priority: Add documentation for ACTOR_FORCE_WAKE_PENDING_TIMEOUT explaining the timeout rationale
  2. Medium priority: Refactor duplicated actor start command logic
  3. Medium priority: Consider adding timeout to alarm-based wakes for consistency
  4. Low priority: Add integration tests for timeout scenarios
  5. Clarification needed: Confirm the behavior change for alarm wakes is intentional

Overall Assessment

This is a well-implemented fix for an important edge case. The core logic is sound and properly handles race conditions. Main concerns are code duplication, missing documentation on timeout values, and potential behavior change in alarm wake handling. Once these clarifications are addressed, this will be a solid improvement to actor lifecycle reliability.

Approved with minor suggestions for improvement

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix Pegboard Actor Force Rescheduling Pending Timeout

Summary

This PR adds the ability to timeout while waiting for actor allocation during force wake/rescheduling operations. The main change introduces an AllocationOverride enum and timeout handling for pending actor allocations.


Code Quality & Best Practices

Positive Changes:

  1. Good refactoring: Replacing the boolean force_allocate parameter with the AllocationOverride enum is more expressive and maintainable
  2. Proper error handling: The timeout case (line 634 in runtime.rs) correctly handles both scenarios - when the actor was allocated during the race and when it wasn't
  3. Consistent pattern: The timeout handling mirrors the existing Destroy signal handling pattern
  4. Cleanup: Removal of wake_for_alarm field simplifies state management

Areas for Improvement:

  1. Typo in PR title: "resceduling" should be "rescheduling"

  2. Missing documentation: The AllocationOverride::DontSleep variant has a comment, but it would be helpful to document:

    • When pending_timeout: None vs Some(timeout) should be used
    • What happens when timeout is reached vs allocation succeeds

    Consider adding:

    /// Forces actors with CrashPolicy::Sleep to pend instead of sleep.
    /// - `pending_timeout: None` - Wait indefinitely for allocation
    /// - `pending_timeout: Some(duration)` - Timeout after duration, falling back to sleep
    DontSleep { pending_timeout: Option<i64> },
  3. Magic number: ACTOR_FORCE_WAKE_PENDING_TIMEOUT is hardcoded to 60 seconds in pegboard_gateway.rs:11. Consider:

    • Moving this to config or making it configurable
    • Adding a comment explaining why 60 seconds was chosen
    • Or at least documenting the rationale
  4. Inconsistent timeout handling: In runtime.rs lines 558-566, the timeout is only set when pending_timeout: Some(timeout). However, there's no logging when the timeout path is taken vs normal allocation. Consider adding structured logging:

    tracing::debug!(?actor_id, timeout_ms=?timeout, "waiting for actor allocation with timeout");

Potential Bugs & Issues

  1. Race condition handling looks correct (lines 647-689): The code properly handles the race where an actor might be allocated while we're trying to clear it from the queue. Good defensive programming.

  2. Question about behavior: When timeout occurs and cleared == true (line 687-689), the function returns SpawnActorOutput::Sleep. Is this the intended behavior? This means:

    • User sends request → actor wakes → can't allocate within 60s → goes back to sleep
    • User's request might fail/timeout on the guard side

    Should there be metrics or logging to track how often this timeout path is hit? It could indicate capacity issues.

  3. Potential consistency issue (mod.rs:279): The fake alarm signal now includes:

    Main::Wake(Wake { allocation_override: AllocationOverride::DontSleep { pending_timeout: None } })

    But there's no comment explaining why the alarm wake uses DontSleep with no timeout. Is this intentional? Should alarm wakes behave differently?


Performance Considerations

  1. Timeout adds latency awareness: The 60-second timeout is good for preventing indefinite hangs, but:

    • Consider if 60s is appropriate for all use cases (might be too long for interactive requests)
    • Guard-side timeout (ACTOR_READY_TIMEOUT = 10s) may fire before this 60s timeout
  2. Additional signal handling: The None case in the match adds a new code path. Under heavy load, how often will this timeout fire vs normal allocation? Consider adding metrics.


Security Concerns

No security issues identified. The changes are internal state management and don't affect authentication, authorization, or data validation.


Test Coverage

Missing tests:

  • No test coverage for the new timeout behavior
  • Consider adding integration tests for:
    • Actor allocation timeout with DontSleep { pending_timeout: Some(_) }
    • Race condition between timeout and allocation
    • Behavior when timeout occurs and actor falls back to sleep

Suggestion: Add a test case in the pegboard test suite that simulates no available runners and verifies the timeout behavior.


Additional Observations

  1. State field cleanup: Good removal of wake_for_alarm field (lines 37-40 in old runtime.rs). Less state = less complexity.

  2. Code duplication (runtime.rs:658-681): The CommandStartActor signal is duplicated in the timeout handling. Consider extracting to a helper function:

    async fn send_start_actor_command(
        ctx: &mut WorkflowCtx,
        input: &Input,
        generation: u32,
        runner_workflow_id: Id,
    ) -> Result<()>
  3. Logging consistency: Consider using consistent actor_id logging format throughout. Some places use ?actor_id, others use actor_id=?input.actor_id.


Recommendations

High Priority:

  1. Add metrics to track timeout frequency
  2. Document the timeout behavior and rationale
  3. Add test coverage for timeout scenarios

Medium Priority:

  1. Consider making timeout configurable or at least document why 60s
  2. Extract duplicated CommandStartActor logic
  3. Add debug logging when timeout path is taken

Low Priority:

  1. Fix typo in PR title
  2. Improve inline documentation for AllocationOverride

Verdict

This is a solid improvement that adds necessary timeout handling to prevent indefinite waiting. The implementation correctly handles race conditions and maintains state consistency. The main concerns are around observability (metrics/logging) and test coverage.

Approval Status: Approve with suggestions for follow-up improvements (particularly testing and metrics).

@claude
Copy link

claude bot commented Nov 20, 2025

Pull Request Review: Add Timeout for Force Rescheduling Pending State

Summary

This PR adds the ability to timeout force rescheduling when actors are in a pending allocation state. The implementation introduces an AllocationOverride enum to control allocation behavior and adds timeout functionality for pending allocations.

Code Quality ✅

Strengths:

  • Clean refactoring with well-defined enum pattern (AllocationOverride)
  • Consistent use throughout the codebase
  • Good separation of concerns between allocation logic and timeout handling
  • Proper use of #[serde(default)] for backwards compatibility

Improvements Needed:

  1. Typo in PR Title (minor): "resceduling" should be "rescheduling"

  2. Inconsistent Pattern Matching (runtime.rs:568-578):
    The code checks for DontSleep with timeout in a specific pattern, but there's potential for cleaner handling:

    // Current approach works but could be more explicit
    let signal = if let AllocationOverride::DontSleep {
        pending_timeout: Some(timeout),
    } = allocation_override
    {
        ctx.listen_with_timeout::<PendingAllocation>(timeout).await?
    } else {
        Some(ctx.listen::<PendingAllocation>().await?)
    };

    Consider extracting the timeout value to a variable earlier for clarity.

  3. Code Duplication (runtime.rs:634-690):
    The timeout handling branch (None => { ... }) duplicates significant logic from the Some(PendingAllocation::Allocate(sig)) branch (sending CommandStartActor, etc.). This could be refactored to reduce duplication and improve maintainability.

Potential Bugs 🔍

  1. Race Condition Handling (runtime.rs:648-656):
    When timeout occurs and cleared is false (actor was allocated), the code waits for the Allocate signal and sends the start command. However, this doesn't update the state's runner info like the normal allocation path does. Should call:

    ctx.activity(UpdateRunnerInput { ... }).await?;

    Actually fixed on review - The code DOES update the runner info, good catch in the implementation!

  2. Missing State Management (mod.rs:278-280):
    When creating a fake Wake signal for alarm timeouts, it always uses pending_timeout: None. Should this respect any previous timeout configuration?

  3. Removed State Field (runtime.rs:36-38):
    The wake_for_alarm field was removed from LifecycleState. Need to verify this doesn't break any existing workflows that may have this field persisted. The #[serde(default)] would help, but consider migration strategy.

Performance Considerations ⚡

  1. Constant Value (guard/src/routing/pegboard_gateway.rs:11):

    const ACTOR_FORCE_WAKE_PENDING_TIMEOUT: i64 = util::duration::seconds(60);

    Good use of const. Consider if this should be configurable via ctx.config() for different environments/testing.

  2. Timeout Duration: 60 seconds seems reasonable for production, but might be too long for tests. Consider making this configurable.

Security Concerns 🔒

No significant security concerns identified. The timeout mechanism properly handles:

  • Race conditions with cleared flag
  • Generation mismatches
  • Proper cleanup on timeout

Test Coverage 🧪

Missing Test Scenarios:

  1. Test case for timeout triggering during pending allocation
  2. Test case for race condition where allocation happens just before timeout
  3. Test case for AllocationOverride::DontSleep with and without timeout
  4. Test case ensuring wake_for_alarm removal doesn't break existing state deserialization

Recommendation: Add integration tests covering the timeout scenarios, especially:

  • Actor pending allocation that times out and falls back to sleep
  • Actor pending allocation that times out but gets allocated in the race window
  • Multiple wake attempts with different AllocationOverride values

Architecture & Design 📐

Positive:

  • The AllocationOverride enum is a good design pattern that's extensible
  • Proper handling of timeout vs. signal with Option return type
  • Good use of Rust's type system to encode behavior

Suggestions:

  1. Consider adding documentation comments for AllocationOverride::DontSleep.pending_timeout explaining:

    • What the timeout represents (milliseconds)
    • What happens when timeout is None vs Some
    • When to use each variant
  2. The constant in guard should reference the same config system used elsewhere for consistency

Style & Conventions ✨

Follows Repository Standards:

  • ✅ Proper use of structured logging with tracing
  • ✅ Lowercase log messages (e.g., "timed out before actor allocated")
  • ✅ Consistent error handling with Result<T>
  • ✅ No glob imports from anyhow

Minor Issues:

  1. Consider adding ? prefix to actor_id in new log statements for consistency (runtime.rs:635, already present in 608, so good!)

Recommendations

High Priority:

  1. Add test coverage for timeout scenarios
  2. Review if ACTOR_FORCE_WAKE_PENDING_TIMEOUT should be configurable
  3. Verify backwards compatibility with existing workflows that have serialized state with wake_for_alarm

Medium Priority:

  1. Refactor duplicated code in timeout handling branch
  2. Add documentation comments to AllocationOverride
  3. Fix typo in PR title

Low Priority:

  1. Consider extracting timeout pattern matching for readability

Verdict

Overall Assessment: APPROVE with suggestions

This is a solid implementation that addresses a real need (timing out force reschedules). The code quality is good, follows repository conventions, and handles edge cases properly. The main concerns are around test coverage and minor refactoring opportunities for maintainability.

The removal of wake_for_alarm is a good simplification, and the new AllocationOverride pattern is clean and extensible.


Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 22:32
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 8b84eb1 to 4f58c50 Compare November 20, 2025 22:32
@claude
Copy link

claude bot commented Nov 20, 2025

Code Review

Summary

This PR adds the ability to timeout and force reschedule actors that are stuck in the pending allocation state. The changes introduce an AllocationOverride enum to control allocation behavior and add timeout support when waiting for pending allocations.

Positive Changes

  1. Good enum design: The AllocationOverride enum is well-structured with clear semantics
  2. Proper timeout handling: The new timeout logic in spawn_actor (runtime.rs:634-690) correctly handles the race condition where an actor might be allocated while clearing from the queue
  3. Cleanup of unused state: Removing the wake_for_alarm field simplifies the state machine
  4. Consistent API: The changes propagate AllocationOverride consistently through all the relevant call sites

Issues and Concerns

1. Missing Test Coverage (Critical)

No tests found in the pegboard package. This PR adds complex timeout logic that should be tested:

  • Actor pending timeout triggers correctly
  • Race condition handling when actor is allocated during timeout
  • Wake signal with/without timeout behaves correctly
  • Backward compatibility of Wake signal deserialization

2. Timeout Value Not Configurable

Location: engine/packages/guard/src/routing/pegboard_gateway.rs:11

The 60-second timeout is hardcoded. Consider whether this should be configurable per-deployment or per-namespace, and add a comment explaining why 60 seconds was chosen.

3. Duplicated Code in Guard

Locations: engine/packages/guard/src/routing/pegboard_gateway.rs:172-180 and 201-209

The same Wake signal construction appears twice. Consider extracting to a helper function.

4. Potential Logic Issue in Alarm Wake

Location: engine/packages/pegboard/src/workflows/actor/mod.rs:280

When an alarm triggers to wake an actor, it uses AllocationOverride::DontSleep with pending_timeout: None. This means alarm-based wakes will force allocation without a timeout. If an alarm wakes a sleeping actor but no capacity is available, it will wait indefinitely in the pending state. Consider whether alarm wakes should also have a timeout.

5. Documentation Needed

The AllocationOverride enum should be documented:

  • When should DontSleep be used vs None?
  • What happens when pending_timeout expires?
  • What's the expected behavior for different crash policies?

6. Logging Consistency

Location: engine/packages/pegboard/src/workflows/actor/runtime.rs:635

Consider logging the timeout duration for debugging in the timeout message.

Performance Considerations

  • No performance regressions expected
  • The timeout prevents actors from waiting indefinitely in pending state, improving system responsiveness

Security Considerations

No security concerns identified.

Recommendations

Before merging:

  1. Add tests for the timeout logic
  2. Consider making the timeout configurable
  3. Add documentation for AllocationOverride
  4. Verify backward compatibility of Wake signal deserialization
  5. Decide if alarm-based wakes should also have timeouts

Nice to have:

  • Extract duplicated wake signal logic in guard
  • Add timeout value to log messages for observability

Overall Assessment

The PR is well-structured and addresses a real issue with actors getting stuck in pending state. The core logic is sound, but it would benefit from additional testing and documentation. The changes are backward-compatible and follow existing patterns.

Status: Recommend requesting changes for test coverage before merging.

@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Add ability to timeout force rescheduling pending state

Summary

This PR introduces a timeout mechanism for actors in the pending allocation state when force-waking, allowing them to fall back to sleep mode if allocation doesn't happen within the specified timeout (60 seconds). This is a valuable improvement for handling resource contention scenarios.

Code Quality & Best Practices ✅

Strengths:

  • Clean refactoring with the new AllocationOverride enum providing clear intent
  • Proper removal of the wake_for_alarm flag simplifies state management
  • Good use of Rust's pattern matching and type safety
  • Consistent with existing codebase patterns (workspace dependencies, error handling)
  • Follows CLAUDE.md conventions (lowercase log messages, structured logging)

Improvements:

  1. Line 280 in mod.rs: The fake Wake signal uses pending_timeout: None for alarm-based wakes. Consider adding a comment explaining why alarms don't need a timeout:

    // Alarm wakes don't need pending timeout since they're for scheduled wake-ups, not force allocation
    Main::Wake(Wake { allocation_override: AllocationOverride::DontSleep { pending_timeout: None } })
  2. Lines 634-690 in runtime.rs: The timeout handling logic has significant duplication with the Destroy signal handler (lines 607-632). Consider extracting the common "wait for late allocation" pattern into a helper function:

    async fn handle_late_allocation(ctx: &mut WorkflowCtx, input: &Input, generation: u32, ...) -> Result<Option<SpawnActorOutput>>

Potential Issues 🔍

  1. Race condition handling (runtime.rs:648-689): The code correctly handles the race where an actor gets allocated after the timeout but before clearing the pending queue. However, there's a subtle issue: if cleared is false, you wait for Allocate signal and start the actor, but you don't check if a Destroy signal was sent in the meantime. The timeout branch should use listen::<PendingAllocation>() like line 622 instead of listen::<Allocate>() on line 649.

    Current code:

    if !cleared {
        let sig = ctx.listen::<Allocate>().await?;  // ❌ Ignores Destroy signal

    Suggested fix:

    if !cleared {
        match ctx.listen::<PendingAllocation>().await? {
            PendingAllocation::Allocate(sig) => {
                // ... existing allocation handling ...
            }
            PendingAllocation::Destroy(_) => {
                return Ok(SpawnActorOutput::Destroy)
            }
        }
    }
  2. Timeout value hardcoded (pegboard_gateway.rs:11): The ACTOR_FORCE_WAKE_PENDING_TIMEOUT is hardcoded at 60 seconds. Consider making this configurable through the pegboard config for different deployment scenarios. Some environments might need shorter/longer timeouts.

  3. Simplified stop code handling (mod.rs:568-578): The change from matching only StopCode::Ok to matching all normal stops is correct, but the comment "Reset retry count on successful exit" at line 570 is now misleading since retry count is only reset for Ok exits.

Performance Considerations ⚡

  1. Positive: The timeout mechanism prevents actors from waiting indefinitely in the pending queue, which improves resource utilization
  2. Positive: Falling back to sleep after timeout is appropriate for actors with CrashPolicy::Sleep
  3. Consideration: 60 seconds is reasonable, but under heavy load this might cause actors to timeout and sleep when they could have been allocated shortly after. Monitor metrics to ensure this doesn't cause excessive wake/sleep cycles.

Security Concerns 🔒

No security issues identified. The changes are internal to the actor lifecycle management and don't expose new attack surfaces.

Test Coverage 🧪

Missing test coverage:

  • No tests found for the new timeout behavior
  • Should add tests for:
    1. Actor times out in pending state and falls back to sleep
    2. Actor gets allocated just after timeout (race condition)
    3. Destroy signal arrives during timeout handling
    4. Guard successfully wakes sleeping actor with force allocation

Test recommendation:

// Test actor timeout in pending state falls back to sleep
// Test actor allocated after timeout but before queue clear
// Test destroy signal during pending timeout

Since this is workflow engine code, you'd likely want integration tests that exercise the full actor lifecycle. Check if there are existing actor lifecycle tests in the pegboard test suite to add these scenarios.

Additional Observations

  1. Removed state field: The removal of wake_for_alarm (line 38-39 in mod.rs) is a good simplification. The allocation override pattern is more explicit and maintainable.

  2. Typo in PR title: "resceduling" should be "rescheduling"

  3. Documentation: Consider adding rustdoc comments to the AllocationOverride enum explaining when each variant should be used and what the pending_timeout parameter controls.

Verdict

This is a solid improvement to actor lifecycle management. The main concern is the potential race condition in the timeout handler (Issue #1 above). Please address that before merging. The code quality is otherwise excellent and follows project conventions well.

Recommended Actions:

  1. Fix the race condition in runtime.rs:648-689 (high priority)
  2. Update the misleading comment at mod.rs:570
  3. Consider making timeout configurable
  4. Add test coverage for timeout scenarios
  5. Fix typo in PR title

@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from c20b8cd to a78fb2c Compare November 20, 2025 22:38
@MasterPtato MasterPtato force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 4f58c50 to 8b84eb1 Compare November 20, 2025 22:38
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Fix Pegboard Timeout for Force Reschedule Pending State

Summary

This PR adds the ability to timeout when force rescheduling actors in a pending state, addressing scenarios where actors may be stuck waiting for allocation indefinitely.

Code Quality & Best Practices

Strengths ✅

  1. Well-structured enum for allocation override: The AllocationOverride enum (runtime.rs:475-482) is clean and self-documenting with good use of Rust's type system.

  2. Proper use of structured logging: The code follows the project's tracing patterns correctly, e.g., tracing::debug!(actor_id=?input.actor_id, "timed out before actor allocated") (runtime.rs:635).

  3. Good state management: The removal of the wake_for_alarm field simplifies the state machine, making it less error-prone.

  4. Consistent timeout handling: The new timeout logic in runtime.rs:558-566 properly uses listen_with_timeout and handles all three cases (Allocate signal, Destroy signal, timeout).

Areas for Improvement 🔧

  1. Magic number in guard: The 60-second timeout at pegboard_gateway.rs:11 is hardcoded. Consider making it configurable via the pegboard config or adding a comment explaining why 60 seconds was chosen.

  2. Duplicate code pattern: The timeout handling logic at runtime.rs:634-689 duplicates actor start logic that already exists at runtime.rs:571-605. Consider extracting this into a helper function to reduce duplication.

  3. Missing test coverage: There don't appear to be tests covering the timeout scenario when an actor is pending, the race condition handling when cleared is false, and the interaction between timeout and destroy signals.

Potential Bugs 🐛

  1. Race condition edge case (runtime.rs:648-686): When cleared is false and we wait for the Allocate signal, we don't check if a Destroy signal arrives. This could cause an actor to start even though it should be destroyed. Consider using the PendingAllocation join signal here instead of just listening for Allocate.

  2. Inconsistent timeout behavior: In the guard (pegboard_gateway.rs:173), the timeout is always set to 60 seconds, but in the fake Wake signal from an alarm (mod.rs:280), it's set to None. This inconsistency could lead to different behavior for user-initiated wakes vs alarm wakes. Is this intentional?

Performance Considerations 🚀

  1. Timeout duration: A 60-second timeout is quite long. If many actors timeout simultaneously, they'll all fall back to sleep mode, which could create thundering herd issues. Consider adding jitter to the timeout, using exponential backoff for retries, or making the timeout dynamic based on system load.

  2. Database transaction in timeout path: The ClearPendingAllocation activity (runtime.rs:772-798) runs a transaction that may conflict with allocation attempts. Under high load, this could create contention. The existing serializable read is good, but monitor for performance impact.

Recommendations 📋

High Priority:

  • Fix the potential race condition in the timeout handler
  • Add integration tests for the timeout scenario
  • Document why 60 seconds was chosen or make it configurable

Medium Priority:

  • Refactor duplicate actor start logic into a helper
  • Add metrics for timeout occurrences to monitor system health

Low Priority:

  • Add jitter to prevent thundering herd
  • Upgrade log level for timeout-to-sleep transitions

Overall Assessment

This is a solid PR that addresses a real issue with pending actors. The code quality is good and follows the project's conventions. The main concerns are around the potential race condition, duplicated code, and the hardcoded timeout value. With the suggested fixes, this will be a robust improvement to the actor lifecycle management.

Recommendation: Approve with minor changes requested (fix race condition, add tests).

@claude
Copy link

claude bot commented Nov 20, 2025

Code Review: Add ability to timeout force rescheduling pending state

Summary

This PR adds timeout functionality for actors that are stuck in a pending allocation state during force rescheduling. The main changes introduce an AllocationOverride enum that allows callers to specify whether to force allocation (preventing sleep) and optionally set a timeout for pending allocation.

✅ Strengths

  1. Well-structured refactoring: The introduction of AllocationOverride enum is a clean abstraction that replaces the boolean force_allocate parameter, making the code more extensible and self-documenting.

  2. Proper timeout handling: The timeout logic correctly handles the race condition where an actor might be allocated while we're trying to clear it from the pending queue (runtime.rs:646-689).

  3. Consistent signal handling: The Wake signal is properly extended with allocation_override field and includes proper serde defaults.

  4. Code deduplication: The timeout branch in runtime.rs:634-689 properly reuses allocation logic, reducing code duplication.

🔍 Potential Issues

1. Race condition with duplicate actor start commands (runtime.rs:649-682)

When a timeout occurs and the actor was already allocated (cleared == false), the code sends a CommandStartActor signal. However, this appears to duplicate logic from earlier in the function. If the actor was already allocated, it should have already received its start command from the allocator. This could result in:

  • Duplicate start commands being sent to the same actor
  • Potential state inconsistencies

Suggestion: Consider whether this branch should return SpawnActorOutput::Allocated directly without sending another start command, or add a comment explaining why a duplicate command is necessary.

2. Missing tracing for timeout scenario (runtime.rs:634)

The timeout log at line 635 is helpful, but there's no subsequent logging about whether the actor was actually cleared or had to wait for allocation. Adding more granular logging would help with debugging:

if !cleared {
    tracing::debug!(actor_id=?input.actor_id, "timeout occurred but actor was already allocated");
    // ... rest of logic
} else {
    tracing::debug!(actor_id=?input.actor_id, "actor successfully cleared from pending queue after timeout");
}

3. Timeout value is hardcoded (pegboard_gateway.rs:11)

The 60-second timeout ACTOR_FORCE_WAKE_PENDING_TIMEOUT is hardcoded. Consider whether this should be:

  • Configurable via environment/config
  • Different based on actor priority or type
  • Documented why 60 seconds was chosen

4. StoppedVariant pattern match simplification (mod.rs:568-578)

The change to the pattern match is good, but it removes the explicit _ catch-all branch. While this works because the enum is exhaustive, the original code was more defensive. Consider if there might be future variants added to StoppedVariant.

Current:

let force_reschedule = match &variant {
    StoppedVariant::Normal { code } => { /* ... */ false }
    StoppedVariant::Lost { force_reschedule } => *force_reschedule,
};

Previous (more defensive):

let force_reschedule = match &variant {
    StoppedVariant::Normal { code: protocol::StopCode::Ok } => { /* ... */ false }
    StoppedVariant::Lost { force_reschedule } => *force_reschedule,
    _ => false,
};

🎯 Best Practices & Style

  1. ✅ Proper error handling: Uses Result types and ? operator consistently
  2. ✅ Follows repository conventions: Uses structured logging with tracing::debug!(?actor_id, ...)
  3. ✅ Good use of enums: The AllocationOverride enum with serde defaults is well-designed
  4. ✅ Maintains consistency: All call sites to spawn_actor and reschedule_actor are updated consistently

🔒 Security Considerations

No security concerns identified. The timeout mechanism prevents actors from being stuck indefinitely, which is actually a good defensive measure against resource exhaustion.

⚡ Performance Considerations

  1. Positive impact: The timeout prevents actors from waiting indefinitely in pending state, improving responsiveness
  2. Minor concern: The timeout handling path includes an additional ctx.listen::<Allocate>() call in the race condition case (runtime.rs:649), which adds latency. However, this is unavoidable for correctness.

🧪 Test Coverage

Missing test coverage - The PR doesn't include tests for:

  • Timeout behavior when pending allocation takes too long
  • Race condition handling when timeout occurs just as allocation completes
  • Wake signal with different AllocationOverride values

Recommendation: Add integration tests to cover:

// 1. Actor times out in pending state and falls back to sleep
// 2. Actor times out but was already allocated (race condition)
// 3. Wake signal with timeout override wakes sleeping actor

📝 Documentation

Consider adding:

  • Doc comments on AllocationOverride explaining when to use each variant
  • Comments explaining why 60 seconds was chosen as the timeout
  • Comments on the race condition handling in the timeout path

Minor Observations

  1. Removed unused field (mod.rs:36-39): Good cleanup removing wake_for_alarm from LifecycleState
  2. Alarm wake behavior (mod.rs:280): The alarm-triggered wake now uses AllocationOverride::DontSleep { pending_timeout: None }, which is correct since alarm wakes should force allocation without timeout
  3. Guard integration (pegboard_gateway.rs:169-210): Properly passes timeout to both initial wake and retry wakes

Overall Assessment

Status: ✅ Approved with minor suggestions

This is a solid PR that adds important functionality for preventing actors from getting stuck in pending state. The code is well-structured and mostly follows best practices. The main concerns are around potential duplicate actor start commands in the timeout race condition path and missing test coverage.

Priority fixes:

  1. Verify the duplicate CommandStartActor in the timeout race condition handler
  2. Add test coverage for timeout scenarios

Nice-to-haves:

  1. Add more granular logging
  2. Make timeout configurable
  3. Add documentation comments

@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_stop_actors_from_destroying_on_goingaway branch from a78fb2c to c20b8cd Compare November 20, 2025 23:32
@NathanFlurry NathanFlurry force-pushed the 11-20-fix_pb_add_ability_to_timeout_force_resceduling_pending_state branch from 8b84eb1 to 4f58c50 Compare November 20, 2025 23:32
@claude
Copy link

claude bot commented Nov 20, 2025

PR Review: Add ability to timeout force rescheduling pending state

Summary

This PR adds timeout functionality for actors stuck in pending allocation when being force-woken. The changes introduce an AllocationOverride enum that allows specifying a timeout for pending allocations, preventing actors from waiting indefinitely when capacity is unavailable.


Strengths

  1. Clean abstraction: The AllocationOverride enum provides a clear API for controlling allocation behavior
  2. Proper timeout handling: The implementation correctly handles the race condition where an actor might be allocated just as the timeout expires (lines 648-686 in runtime.rs)
  3. Backwards compatibility: Using serde default ensures existing signals without the new field continue to work
  4. Consistent application: The timeout is applied uniformly across all wake scenarios (guard gateway, alarm-based wakes, rewakes)

Code Quality Issues

1. Title typo
The PR title has a typo: fix(pb): add ability to timeout force resceduling pending state
Should be: fix(pb): add ability to timeout force rescheduling pending state (with two c's)

2. Inconsistent timeout value
In pegboard_gateway.rs:11, the timeout is hardcoded to 60 seconds with no justification. This is inconsistent with other timeouts that are configurable via Pegboard config (e.g., actor_start_threshold, actor_stop_threshold).

Recommendation: Make this configurable via ctx.config().pegboard().actor_pending_wake_timeout()

3. Removed state field without cleanup
The wake_for_alarm field was removed from LifecycleState without migration logic. Impact is mitigated by serde default, but the field will silently be ignored during deserialization. Acceptable for boolean fields but worth documenting.

4. Logic change in stopped variant handling
Lines 568-578 in mod.rs now reset reschedule state only on StopCode::Ok. Previously the wildcard would match all Normal variants. This is actually an improvement (more precise) - reschedule state should only reset on successful exits, not crashes. Worth calling out as a subtle behavioral change.

5. Timeout behavior on sleep
In runtime.rs:687-689, when timeout occurs and actor isn't in queue, it returns SpawnActorOutput::Sleep. This means guard gets a timeout and user gets ActorReadyTimeout error. Consider adding a log message to distinguish timeout-triggered sleep from intentional sleep (no capacity).

6. Missing observability
The timeout path is a new failure mode but lacks counter for timeouts, log distinguishing timeout vs intentional sleep, and metric tracking timeout frequency.

@claude
Copy link

claude bot commented Nov 20, 2025

Security Concerns

None identified. The timeout prevents indefinite waiting, which is a reliability improvement.


Performance Considerations

  1. Positive impact: Timeout prevents actors from waiting indefinitely, improving user experience
  2. Rewake retry logic: The 16-retry limit in pegboard_gateway.rs:197 remains in place
  3. No additional allocations or significant overhead introduced

Test Coverage

Major concern: No tests added for the new timeout functionality.

Recommended tests:

  1. Test timeout triggers when actor stays pending beyond threshold
  2. Test race condition: actor allocated just as timeout expires
  3. Test timeout with destroy signal received during pending
  4. Test backward compatibility: signals without allocation_override field
  5. Integration test: end-to-end wake with timeout via guard

Suggestions

High Priority:

  1. Make timeout configurable via Pegboard config
  2. Add test coverage for timeout scenarios
  3. Add observability (logging/metrics) for timeout path

Medium Priority:
4. Fix PR title typo
5. Add documentation comment explaining timeout behavior on AllocationOverride::DontSleep

Low Priority:
6. Consider distinct return variant to differentiate timeout-sleep from capacity-sleep


Best Practices Adherence

  • Follows error handling patterns (anyhow errors)
  • Uses structured logging with ?actor_id
  • Lowercase log messages
  • Proper use of serde default for backward compatibility
  • Follows existing code patterns

Conclusion

This is a solid improvement that addresses a real operational issue (actors stuck pending). The core logic is sound and handles race conditions properly.

Main concerns:

  1. Lack of test coverage
  2. Hardcoded timeout value
  3. Missing observability

Recommendation: Request changes for test coverage and configuration. The functionality is good but needs testing before merge.

Overall assessment: Approve with requested changes

@claude
Copy link

claude bot commented Nov 21, 2025

Pull Request Review

Summary

This PR adds the ability to force-wake sleeping actors with a configurable timeout for the pending allocation state. The main change introduces an AllocationOverride enum that allows the Guard service to bypass sleep behavior and force actors into pending state with an optional timeout.


Strengths

  1. Well-structured refactoring: The introduction of AllocationOverride enum is a clean abstraction that replaces the boolean force_allocate parameter, making the code more maintainable and self-documenting.

  2. Good timeout handling: The timeout logic in spawn_actor (runtime.rs:558-566) properly handles the case where an actor times out in pending state and falls back to sleep.

  3. Race condition handling: The code at runtime.rs:647-687 correctly handles the race condition where an actor might be allocated between clearing from the queue and checking.

  4. State cleanup: Removal of wake_for_alarm field simplifies the state machine and reduces potential bugs from stale flags.

  5. Pattern matching improvement: The change in mod.rs:569-575 from matching only StopCode::Ok to matching all codes and resetting only on Ok is more correct.


Issues and Concerns

1. Typo in PR title

resceduling should be rescheduling

2. Code duplication (runtime.rs:634-690)

There is significant duplication between the None timeout case and the Some(PendingAllocation::Allocate) case. The entire CommandStartActor signal block is duplicated. Consider extracting this into a helper function to reduce duplication.

3. Unclear timeout value (pegboard_gateway.rs:11)

The constant ACTOR_FORCE_WAKE_PENDING_TIMEOUT is set to 60 seconds, but there is no documentation explaining why this specific value was chosen. Consider adding a comment explaining the rationale, making this configurable via the pegboard config, or discussing if 60 seconds is appropriate for all use cases.

4. Pattern matching inconsistency (mod.rs:569-575)

The new pattern match removes the explicit handling of non-Ok stop codes. While technically correct, consider adding a comment explaining that non-Ok stop codes intentionally do not reset the retry count.

5. Missing error handling for timeout

When timeout occurs (runtime.rs:634), the code logs and proceeds to sleep, but there is no mechanism to notify the caller (guard) that the wake request failed. The guard will wait indefinitely for the actor to become ready. Consider signaling back to the caller or emitting a metric when timeout occurs so operators can monitor this condition.

6. Inconsistent serde default usage

The Wake signal struct has serde default on allocation_override (mod.rs:759), which is good for backwards compatibility. However, verify this is intentional and document why existing Wake signals in flight should default to AllocationOverride::None.


Security Concerns

No critical security issues identified. The timeout mechanism prevents indefinite waiting which could be a DoS vector.


Performance Considerations

  1. Additional database activity: The new timeout path adds an extra ClearPendingAllocation activity call. This is acceptable but worth noting for high-frequency wake scenarios.

  2. Duplicate signal sends: When timing out and the actor was allocated (runtime.rs:658-681), we still send the start command. This is correct but adds latency.


Test Coverage

Issue: No test coverage is included for timeout behavior in pending state, race condition handling when timeout occurs mid-allocation, or guards wake behavior with the new timeout.

Recommendation: Add integration tests for these scenarios in a follow-up PR if not immediately.


Documentation

Missing documentation:

  1. No comments explaining the timeout behavior in AllocationOverride
  2. No update to any README or architecture docs about this new capability
  3. The pending_timeout field could use a doc comment explaining units and behavior

Suggestions

  1. Add tracing for metrics to help operators monitor timeout events
  2. Consider making timeout configurable per-request rather than hardcoding 60s
  3. Add documentation comments to the AllocationOverride enum

Final Verdict

Approval with minor concerns

The core implementation is solid and addresses a real need. The main issues are code duplication that should be refactored, missing test coverage, documentation gaps, and minor clarifications needed. The changes are backwards compatible and do not introduce security vulnerabilities. I would recommend addressing the code duplication before merging and adding tests in a follow-up PR if not immediately.


Generated with Claude Code

@claude
Copy link

claude bot commented Nov 21, 2025

PR Review: Add Timeout for Force Rescheduling Pending State

Overview

This PR adds the ability to timeout actors stuck in pending allocation state when force-woken, preventing indefinite hangs. The changes introduce an AllocationOverride enum and add timeout handling for pending allocations.

Critical Issues

🚨 Bug: Missing Command Transmission on Timeout Path

Location: engine/packages/pegboard/src/workflows/actor/runtime.rs:634-689

When a pending allocation times out and the actor was allocated during the race condition (i.e., !cleared), the code correctly waits for the Allocate signal and updates the runner state, but fails to send the CommandStartActor signal to the runner workflow.

Current code (lines 648-687):

if !cleared {
    let sig = ctx.listen::<Allocate>().await?;
    
    ctx.activity(UpdateRunnerInput {
        actor_id: input.actor_id,
        runner_id: sig.runner_id,
        runner_workflow_id: sig.runner_workflow_id,
    })
    .await?;
    
    ctx.signal(crate::workflows::runner::Command {
        inner: protocol::Command::CommandStartActor(
            protocol::CommandStartActor {
                // ... command details
            },
        ),
    })
    .to_workflow_id(sig.runner_workflow_id)
    .send()
    .await?;
    
    Ok(SpawnActorOutput::Allocated {
        runner_id: sig.runner_id,
        runner_workflow_id: sig.runner_workflow_id,
    })
} else {
    Ok(SpawnActorOutput::Sleep)
}

Wait, I need to re-check this - looking at the diff again, it appears the command IS being sent at lines 658-681. Let me verify...

Actually, upon closer inspection of the diff at lines 634-689, the CommandStartActor IS being sent in the timeout path. My apologies for the confusion. The code structure mirrors the destroy path correctly.

Let me revise my review:


Code Quality & Best Practices

Good: Cleaner State Management

The removal of wake_for_alarm simplifies state management. Instead of tracking whether a wake was from an alarm with a boolean flag, the code now uses the more explicit AllocationOverride enum pattern. This is more maintainable and type-safe.

Good: Type-Safe Allocation Override

The new AllocationOverride enum provides clear intent:

pub enum AllocationOverride {
    None,
    DontSleep { pending_timeout: Option<i64> },
}

This is better than passing boolean flags like the old force_allocate parameter.

⚠️ Code Duplication Concern

Location: runtime.rs:607-689

The timeout handling path (None branch) has significant code duplication with the Destroy path:

  • Both handle the race condition where !cleared means actor was allocated
  • Both wait for Allocate signal, update runner, and send CommandStartActor
  • Lines 649-686 are nearly identical to lines 622-630 (with differences in continuation vs early return)

Suggestion: Extract this into a helper function:

async fn handle_late_allocation(
    ctx: &mut WorkflowCtx,
    input: &Input,
    generation: u32,
    cleared: bool,
) -> Result<Option<SpawnActorOutput>>

This would reduce duplication and make the logic easier to maintain.

Performance Considerations

Appropriate Timeout Value

The 60-second timeout (ACTOR_FORCE_WAKE_PENDING_TIMEOUT) seems reasonable for preventing indefinite hangs while allowing enough time for allocation in busy periods.

💡 Consider Making Timeout Configurable

Currently hardcoded as a const. Consider whether this should be part of the Pegboard config for different deployment scenarios.

Security Concerns

No security issues identified. The timeout mechanism actually improves resilience against potential DoS scenarios where actors could get stuck indefinitely.

Test Coverage

⚠️ Missing Test Coverage

The PR doesn't include tests for:

  1. Timeout behavior when pending allocation exceeds 60 seconds
  2. Race condition handling when actor gets allocated after timeout but before queue clear
  3. Behavior when AllocationOverride::DontSleep is used with and without timeout

Recommendation: Add integration tests covering:

  • Actor times out while pending → should transition to Sleep state
  • Actor gets allocated during timeout race condition → should properly start
  • Multiple wake attempts with different override settings

Specific Code Review

mod.rs:568-578 - Simplified StoppedVariant Matching

let force_reschedule = match &variant {
    StoppedVariant::Normal { code } => {
        if let protocol::StopCode::Ok = code {
            state.reschedule_state = Default::default();
        }
        false
    }
    StoppedVariant::Lost { force_reschedule } => *force_reschedule,
};

This is cleaner than the previous version. The logic correctly only resets retry count on successful exit (StopCode::Ok). Consider adding a comment explaining this behavior:

// Only reset retry count on successful exit to maintain backoff for crashes
if let protocol::StopCode::Ok = code {
    state.reschedule_state = Default::default();
}

mod.rs:280 - Fake Wake Signal

Main::Wake(Wake { 
    allocation_override: AllocationOverride::DontSleep { 
        pending_timeout: None 
    } 
})

When the alarm fires, it creates a wake with DontSleep override. This is correct for alarm-based wakes (you don't want an alarm to immediately put the actor back to sleep), but consider documenting why pending_timeout: None is appropriate here (presumably because we want to wait indefinitely for alarm-based wakes).

guard/routing/pegboard_gateway.rs:11 - Magic Number

const ACTOR_FORCE_WAKE_PENDING_TIMEOUT: i64 = util::duration::seconds(60);

Good use of a named constant. Consider adding a comment explaining why 60 seconds was chosen.

Logging Patterns

Good Logging

The structured logging follows the repository conventions:

tracing::debug!(actor_id=?input.actor_id, "timed out before actor allocated");

Uses ?actor_id for debug formatting and lowercase messages.

Summary

Overall Assessment: This is a solid improvement that addresses a real issue (actors hanging in pending state). The code follows Rivet conventions well.

Must Fix Before Merge:

  • None (critical bug I initially identified was incorrect)

Should Fix:

  1. Reduce code duplication in timeout/destroy paths
  2. Add test coverage for timeout behavior and race conditions

Nice to Have:

  1. Make timeout configurable via Pegboard config
  2. Add explanatory comments for alarm wake behavior
  3. Document why 60 seconds was chosen for timeout

Recommendations:

  • Add integration/unit tests before merging
  • Consider extracting duplicated race-condition handling logic
  • Monitor metrics after deployment to validate timeout duration choice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants