-
Notifications
You must be signed in to change notification settings - Fork 135
fix(pb): add ability to timeout force resceduling pending state #3502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 11-20-fix_pb_stop_actors_from_destroying_on_goingaway
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
3 Skipped Deployments
|
Code Review: Timeout for Force Rescheduling Pending StateSummaryThis PR adds the ability to timeout when force-rescheduling actors in a pending state, preventing actors from waiting indefinitely for allocation. The changes introduce an ✅ Strengths
🔍 Issues & Concerns1. Critical: Hard-coded timeout constant in guard package// engine/packages/guard/src/routing/pegboard_gateway.rs:11
const ACTOR_FORCE_WAKE_PENDING_TIMEOUT: i64 = util::duration::seconds(60);Issue: This 60-second timeout is defined in the guard package rather than being configurable or centralized. Recommendations:
2. Potential issue: Change in
|
8b84eb1 to
4f58c50
Compare
a78fb2c to
c20b8cd
Compare
Code ReviewThanks for this PR! I've reviewed the changes and have the following feedback: OverviewThis PR adds the ability to timeout force rescheduling when actors are in pending state. The implementation introduces an ✅ Strengths
🔍 Issues and Suggestions1. Typo in PR Title (Minor)The PR title has a typo: "resceduling" should be "rescheduling" 2. Code Duplication in Timeout Handler (Moderate)In 3. Missing Error Context (Minor)When timeout occurs ( 4. Magic Number Should Be Configurable (Moderate)In 5. Incomplete Pattern Match Simplification (Minor)In 6. Removed Field Not Documented (Minor)The 🔒 Security Considerations✅ No security issues identified. The timeout mechanism properly prevents indefinite waits. ⚡ Performance Considerations✅ The changes are performance-neutral. Adding a timeout doesn't add overhead when not used. 🧪 Test Coverage
📝 DocumentationThe new RecommendationsMust Fix:
Should Fix:
Nice to Have:
Overall, this is a solid implementation that properly handles a tricky edge case. The main gaps are in testing and documentation rather than the code itself. |
PR Review: Add ability to timeout force rescheduling pending stateThanks for this PR! This is a solid improvement that addresses an important edge case in the actor lifecycle management. SummaryThis PR adds timeout handling for actors stuck in the pending allocation state, preventing indefinite waiting when waking sleeping actors. The implementation introduces a new AllocationOverride enum to control allocation behavior and adds a 60-second timeout for force-wake scenarios. Strengths
|
Code Quality Observations1. Removed wake_for_alarm field - The removal of this field simplifies state management. However, verify that alarm-based wake behavior still works correctly. The fake signal now always uses AllocationOverride::DontSleep { pending_timeout: None } instead of distinguishing alarm-based wakes. 2. Simplified StoppedVariant handling - The refactor at lines 568-577 in mod.rs is cleaner and more explicit than the previous wildcard pattern. Potential Issues1. Missing documentation on timeout value - The ACTOR_FORCE_WAKE_PENDING_TIMEOUT constant (60 seconds) lacks justification. Add a comment explaining why 60 seconds was chosen and under what conditions this timeout triggers. 2. Code duplication in timeout handler - Lines 658-681 in runtime.rs duplicate the actor start command logic that appears earlier at lines 520-543. Consider extracting this into a helper function to reduce duplication. |
|
3. Inconsistent pending_timeout handling - The timeout is hardcoded in guard but AllocationOverride allows None. In guard/pegboard_gateway.rs it always sets Some(ACTOR_FORCE_WAKE_PENDING_TIMEOUT), while in mod.rs it uses None for alarm-based wakes. Question: Should alarm-based wakes also have a timeout? If actors can get stuck in pending state during regular wakes, they could also get stuck during alarm wakes. 4. Behavior change in alarm wakes - The fake signal for alarms (line 279 in mod.rs) now uses AllocationOverride::DontSleep { pending_timeout: None } instead of tracking wake_for_alarm. This means alarm-based wakes will now force actors to pend instead of potentially sleeping. Verify: Is this the intended behavior? Should alarms also allow actors to sleep if no capacity exists? |
Testing ConcernsCritical: No test files found in the pegboard package. Consider adding integration tests for:
|
Security and PerformanceSecurity: No concerns identified. The timeout prevents resource exhaustion by not allowing actors to wait indefinitely. Performance:
StyleAll code follows the project conventions from CLAUDE.md (hard tabs, structured logging with ? prefix, lowercase log messages, proper imports). Recommendations
Overall AssessmentThis is a well-implemented fix for an important edge case. The core logic is sound and properly handles race conditions. Main concerns are code duplication, missing documentation on timeout values, and potential behavior change in alarm wake handling. Once these clarifications are addressed, this will be a solid improvement to actor lifecycle reliability. Approved with minor suggestions for improvement |
c20b8cd to
a78fb2c
Compare
4f58c50 to
8b84eb1
Compare
PR Review: Fix Pegboard Actor Force Rescheduling Pending TimeoutSummaryThis PR adds the ability to timeout while waiting for actor allocation during force wake/rescheduling operations. The main change introduces an Code Quality & Best PracticesPositive Changes:
Areas for Improvement:
Potential Bugs & Issues
Performance Considerations
Security ConcernsNo security issues identified. The changes are internal state management and don't affect authentication, authorization, or data validation. Test CoverageMissing tests:
Suggestion: Add a test case in the pegboard test suite that simulates no available runners and verifies the timeout behavior. Additional Observations
RecommendationsHigh Priority:
Medium Priority:
Low Priority:
VerdictThis is a solid improvement that adds necessary timeout handling to prevent indefinite waiting. The implementation correctly handles race conditions and maintains state consistency. The main concerns are around observability (metrics/logging) and test coverage. Approval Status: Approve with suggestions for follow-up improvements (particularly testing and metrics). |
Pull Request Review: Add Timeout for Force Rescheduling Pending StateSummaryThis PR adds the ability to timeout force rescheduling when actors are in a pending allocation state. The implementation introduces an Code Quality ✅Strengths:
Improvements Needed:
Potential Bugs 🔍
Performance Considerations ⚡
Security Concerns 🔒No significant security concerns identified. The timeout mechanism properly handles:
Test Coverage 🧪Missing Test Scenarios:
Recommendation: Add integration tests covering the timeout scenarios, especially:
Architecture & Design 📐Positive:
Suggestions:
Style & Conventions ✨Follows Repository Standards:
Minor Issues:
RecommendationsHigh Priority:
Medium Priority:
Low Priority:
VerdictOverall Assessment: APPROVE with suggestions ✅ This is a solid implementation that addresses a real need (timing out force reschedules). The code quality is good, follows repository conventions, and handles edge cases properly. The main concerns are around test coverage and minor refactoring opportunities for maintainability. The removal of Generated with Claude Code |
a78fb2c to
c20b8cd
Compare
8b84eb1 to
4f58c50
Compare
Code ReviewSummaryThis PR adds the ability to timeout and force reschedule actors that are stuck in the pending allocation state. The changes introduce an AllocationOverride enum to control allocation behavior and add timeout support when waiting for pending allocations. Positive Changes
Issues and Concerns1. Missing Test Coverage (Critical)No tests found in the pegboard package. This PR adds complex timeout logic that should be tested:
2. Timeout Value Not ConfigurableLocation: engine/packages/guard/src/routing/pegboard_gateway.rs:11 The 60-second timeout is hardcoded. Consider whether this should be configurable per-deployment or per-namespace, and add a comment explaining why 60 seconds was chosen. 3. Duplicated Code in GuardLocations: engine/packages/guard/src/routing/pegboard_gateway.rs:172-180 and 201-209 The same Wake signal construction appears twice. Consider extracting to a helper function. 4. Potential Logic Issue in Alarm WakeLocation: engine/packages/pegboard/src/workflows/actor/mod.rs:280 When an alarm triggers to wake an actor, it uses AllocationOverride::DontSleep with pending_timeout: None. This means alarm-based wakes will force allocation without a timeout. If an alarm wakes a sleeping actor but no capacity is available, it will wait indefinitely in the pending state. Consider whether alarm wakes should also have a timeout. 5. Documentation NeededThe AllocationOverride enum should be documented:
6. Logging ConsistencyLocation: engine/packages/pegboard/src/workflows/actor/runtime.rs:635 Consider logging the timeout duration for debugging in the timeout message. Performance Considerations
Security ConsiderationsNo security concerns identified. RecommendationsBefore merging:
Nice to have:
Overall AssessmentThe PR is well-structured and addresses a real issue with actors getting stuck in pending state. The core logic is sound, but it would benefit from additional testing and documentation. The changes are backward-compatible and follow existing patterns. Status: Recommend requesting changes for test coverage before merging. |
PR Review: Add ability to timeout force rescheduling pending stateSummaryThis PR introduces a timeout mechanism for actors in the pending allocation state when force-waking, allowing them to fall back to sleep mode if allocation doesn't happen within the specified timeout (60 seconds). This is a valuable improvement for handling resource contention scenarios. Code Quality & Best Practices ✅Strengths:
Improvements:
Potential Issues 🔍
Performance Considerations ⚡
Security Concerns 🔒No security issues identified. The changes are internal to the actor lifecycle management and don't expose new attack surfaces. Test Coverage 🧪Missing test coverage:
Test recommendation: // Test actor timeout in pending state falls back to sleep
// Test actor allocated after timeout but before queue clear
// Test destroy signal during pending timeoutSince this is workflow engine code, you'd likely want integration tests that exercise the full actor lifecycle. Check if there are existing actor lifecycle tests in the pegboard test suite to add these scenarios. Additional Observations
VerdictThis is a solid improvement to actor lifecycle management. The main concern is the potential race condition in the timeout handler (Issue #1 above). Please address that before merging. The code quality is otherwise excellent and follows project conventions well. Recommended Actions:
|
c20b8cd to
a78fb2c
Compare
4f58c50 to
8b84eb1
Compare
PR Review: Fix Pegboard Timeout for Force Reschedule Pending StateSummaryThis PR adds the ability to timeout when force rescheduling actors in a pending state, addressing scenarios where actors may be stuck waiting for allocation indefinitely. Code Quality & Best PracticesStrengths ✅
Areas for Improvement 🔧
Potential Bugs 🐛
Performance Considerations 🚀
Recommendations 📋High Priority:
Medium Priority:
Low Priority:
Overall AssessmentThis is a solid PR that addresses a real issue with pending actors. The code quality is good and follows the project's conventions. The main concerns are around the potential race condition, duplicated code, and the hardcoded timeout value. With the suggested fixes, this will be a robust improvement to the actor lifecycle management. Recommendation: Approve with minor changes requested (fix race condition, add tests). |
Code Review: Add ability to timeout force rescheduling pending stateSummaryThis PR adds timeout functionality for actors that are stuck in a pending allocation state during force rescheduling. The main changes introduce an ✅ Strengths
🔍 Potential Issues1. Race condition with duplicate actor start commands (runtime.rs:649-682)When a timeout occurs and the actor was already allocated (cleared == false), the code sends a
Suggestion: Consider whether this branch should return 2. Missing tracing for timeout scenario (runtime.rs:634)The timeout log at line 635 is helpful, but there's no subsequent logging about whether the actor was actually cleared or had to wait for allocation. Adding more granular logging would help with debugging: if !cleared {
tracing::debug!(actor_id=?input.actor_id, "timeout occurred but actor was already allocated");
// ... rest of logic
} else {
tracing::debug!(actor_id=?input.actor_id, "actor successfully cleared from pending queue after timeout");
}3. Timeout value is hardcoded (pegboard_gateway.rs:11)The 60-second timeout
4. StoppedVariant pattern match simplification (mod.rs:568-578)The change to the pattern match is good, but it removes the explicit Current: let force_reschedule = match &variant {
StoppedVariant::Normal { code } => { /* ... */ false }
StoppedVariant::Lost { force_reschedule } => *force_reschedule,
};Previous (more defensive): let force_reschedule = match &variant {
StoppedVariant::Normal { code: protocol::StopCode::Ok } => { /* ... */ false }
StoppedVariant::Lost { force_reschedule } => *force_reschedule,
_ => false,
};🎯 Best Practices & Style
🔒 Security ConsiderationsNo security concerns identified. The timeout mechanism prevents actors from being stuck indefinitely, which is actually a good defensive measure against resource exhaustion. ⚡ Performance Considerations
🧪 Test CoverageMissing test coverage - The PR doesn't include tests for:
Recommendation: Add integration tests to cover: // 1. Actor times out in pending state and falls back to sleep
// 2. Actor times out but was already allocated (race condition)
// 3. Wake signal with timeout override wakes sleeping actor📝 DocumentationConsider adding:
Minor Observations
Overall AssessmentStatus: ✅ Approved with minor suggestions This is a solid PR that adds important functionality for preventing actors from getting stuck in pending state. The code is well-structured and mostly follows best practices. The main concerns are around potential duplicate actor start commands in the timeout race condition path and missing test coverage. Priority fixes:
Nice-to-haves:
|
a78fb2c to
c20b8cd
Compare
8b84eb1 to
4f58c50
Compare
PR Review: Add ability to timeout force rescheduling pending stateSummaryThis PR adds timeout functionality for actors stuck in pending allocation when being force-woken. The changes introduce an AllocationOverride enum that allows specifying a timeout for pending allocations, preventing actors from waiting indefinitely when capacity is unavailable. Strengths
Code Quality Issues1. Title typo 2. Inconsistent timeout value Recommendation: Make this configurable via ctx.config().pegboard().actor_pending_wake_timeout() 3. Removed state field without cleanup 4. Logic change in stopped variant handling 5. Timeout behavior on sleep 6. Missing observability |
Security ConcernsNone identified. The timeout prevents indefinite waiting, which is a reliability improvement. Performance Considerations
Test CoverageMajor concern: No tests added for the new timeout functionality. Recommended tests:
SuggestionsHigh Priority:
Medium Priority: Low Priority: Best Practices Adherence
ConclusionThis is a solid improvement that addresses a real operational issue (actors stuck pending). The core logic is sound and handles race conditions properly. Main concerns:
Recommendation: Request changes for test coverage and configuration. The functionality is good but needs testing before merge. Overall assessment: Approve with requested changes |
c20b8cd to
a78fb2c
Compare
4f58c50 to
8b84eb1
Compare
Pull Request ReviewSummaryThis PR adds the ability to force-wake sleeping actors with a configurable timeout for the pending allocation state. The main change introduces an AllocationOverride enum that allows the Guard service to bypass sleep behavior and force actors into pending state with an optional timeout. Strengths
Issues and Concerns1. Typo in PR titleresceduling should be rescheduling 2. Code duplication (runtime.rs:634-690)There is significant duplication between the None timeout case and the Some(PendingAllocation::Allocate) case. The entire CommandStartActor signal block is duplicated. Consider extracting this into a helper function to reduce duplication. 3. Unclear timeout value (pegboard_gateway.rs:11)The constant ACTOR_FORCE_WAKE_PENDING_TIMEOUT is set to 60 seconds, but there is no documentation explaining why this specific value was chosen. Consider adding a comment explaining the rationale, making this configurable via the pegboard config, or discussing if 60 seconds is appropriate for all use cases. 4. Pattern matching inconsistency (mod.rs:569-575)The new pattern match removes the explicit handling of non-Ok stop codes. While technically correct, consider adding a comment explaining that non-Ok stop codes intentionally do not reset the retry count. 5. Missing error handling for timeoutWhen timeout occurs (runtime.rs:634), the code logs and proceeds to sleep, but there is no mechanism to notify the caller (guard) that the wake request failed. The guard will wait indefinitely for the actor to become ready. Consider signaling back to the caller or emitting a metric when timeout occurs so operators can monitor this condition. 6. Inconsistent serde default usageThe Wake signal struct has serde default on allocation_override (mod.rs:759), which is good for backwards compatibility. However, verify this is intentional and document why existing Wake signals in flight should default to AllocationOverride::None. Security ConcernsNo critical security issues identified. The timeout mechanism prevents indefinite waiting which could be a DoS vector. Performance Considerations
Test CoverageIssue: No test coverage is included for timeout behavior in pending state, race condition handling when timeout occurs mid-allocation, or guards wake behavior with the new timeout. Recommendation: Add integration tests for these scenarios in a follow-up PR if not immediately. DocumentationMissing documentation:
Suggestions
Final VerdictApproval with minor concerns The core implementation is solid and addresses a real need. The main issues are code duplication that should be refactored, missing test coverage, documentation gaps, and minor clarifications needed. The changes are backwards compatible and do not introduce security vulnerabilities. I would recommend addressing the code duplication before merging and adding tests in a follow-up PR if not immediately. Generated with Claude Code |
PR Review: Add Timeout for Force Rescheduling Pending StateOverviewThis PR adds the ability to timeout actors stuck in pending allocation state when force-woken, preventing indefinite hangs. The changes introduce an Critical Issues🚨 Bug: Missing Command Transmission on Timeout PathLocation: When a pending allocation times out and the actor was allocated during the race condition (i.e., Current code (lines 648-687): if !cleared {
let sig = ctx.listen::<Allocate>().await?;
ctx.activity(UpdateRunnerInput {
actor_id: input.actor_id,
runner_id: sig.runner_id,
runner_workflow_id: sig.runner_workflow_id,
})
.await?;
ctx.signal(crate::workflows::runner::Command {
inner: protocol::Command::CommandStartActor(
protocol::CommandStartActor {
// ... command details
},
),
})
.to_workflow_id(sig.runner_workflow_id)
.send()
.await?;
Ok(SpawnActorOutput::Allocated {
runner_id: sig.runner_id,
runner_workflow_id: sig.runner_workflow_id,
})
} else {
Ok(SpawnActorOutput::Sleep)
}Wait, I need to re-check this - looking at the diff again, it appears the command IS being sent at lines 658-681. Let me verify... Actually, upon closer inspection of the diff at lines 634-689, the Let me revise my review: Code Quality & Best Practices✅ Good: Cleaner State ManagementThe removal of ✅ Good: Type-Safe Allocation OverrideThe new pub enum AllocationOverride {
None,
DontSleep { pending_timeout: Option<i64> },
}This is better than passing boolean flags like the old
|

No description provided.