Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Jan 12, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f7e33e7 to b613527 Compare January 12, 2026 19:11
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 9297e74 to 442651f Compare January 12, 2026 19:11
@claude
Copy link

claude bot commented Jan 12, 2026

Code Review: Set failure reason regardless of crash policy

Summary

This PR fixes a logic issue where failure reasons were only being set for certain crash policies, when they should be set unconditionally. The changes improve error tracking and debugging by ensuring failure reasons are always recorded when actors crash or are lost.

Changes Overview

  1. Restructured failure reason handling (mod.rs:879-919) - Moved failure reason setting to happen immediately when an actor stops, before crash policy handling
  2. Removed conditional setting (mod.rs:1073) - Removed the duplicate failure reason setting that was only happening in the CrashPolicy::Sleep path
  3. Fixed workflow debug query (debug.rs:670) - Changed from read() to read_opt() to handle missing workflows gracefully
  4. Removed premature clearing (runtime.rs:1131) - Removed line that was clearing failure_reason when actor started (now handled in update_runner instead)

Positive Aspects

Correct logic flow - The restructuring makes the code much clearer: failure reasons are set early regardless of what happens next

Better error visibility - Failure reasons are now tracked consistently across all crash policies (Restart, Sleep, and Complete)

Improved workflow debug - The fix in debug.rs prevents panics when querying workflows that no longer exist

Proper failure reason precedence - Runner failures continue to take precedence over actor crashes (via is_runner_failure() check in set_failure_reason)

Issues & Concerns

1. Missing Test Coverage ⚠️

This PR modifies critical actor lifecycle logic but does not include any tests. Consider adding tests for:

  • Actor crash with CrashPolicy::Restart should set failure reason
  • Actor crash with CrashPolicy::Sleep should set failure reason
  • Actor crash with CrashPolicy::Complete should set failure reason
  • Verify runner failures are not overwritten by subsequent crash failures
  • Verify failure reason is cleared when actor successfully reallocates

2. Potential State Inconsistency ⚠️

Location: runtime.rs:1131

The removal of state.failure_reason = None in set_started() changes when failure reasons are cleared. Previously they were cleared when the actor started; now they are only cleared in update_runner() (lines 137, 478).

Question: What happens if an actor:

  1. Crashes (sets failure_reason)
  2. Restarts and gets allocated (update_runner clears failure_reason ✅)
  3. Starts successfully (set_started no longer clears failure_reason)

Is there a scenario where set_started is called without update_runner being called first? If so, the failure reason from a previous crash could persist incorrectly.

3. Code Structure: Pattern Matching Clarity 💭

Location: mod.rs:879-919

The new pattern matching structure is more explicit but slightly verbose. Consider if this structure is clearer:

let force_reschedule = match &variant {
    StoppedVariant::Normal { code, message } => {
        match code {
            protocol::mk2::StopCode::Ok => {
                // Reset retry count on successful exit.
                state.reschedule_state = Default::default();
            }
            protocol::mk2::StopCode::Error => {
                ctx.v(3)
                    .activity(runtime::SetFailureReasonInput {
                        failure_reason: FailureReason::Crashed {
                            message: message.clone(),
                        },
                    })
                    .await?;
            }
        }
        false
    }
    StoppedVariant::Lost {
        force_reschedule,
        failure_reason,
    } => {
        // Set runner failure reason if actor was lost unexpectedly.
        if let Some(failure_reason) = failure_reason {
            ctx.v(3)
                .activity(runtime::SetFailureReasonInput {
                    failure_reason: failure_reason.clone(),
                })
                .await?;
        }
        *force_reschedule
    }
};

This groups the Normal variants together and avoids repeating false. However, the current structure is acceptable if you prefer the explicit separation.

4. Minor: Logging Consistency 💭

Location: debug.rs:692

The new warning uses a different format than typical tracing patterns in the codebase:

tracing::warn!("workflow {workflow_id} does not exist");

Based on CLAUDE.md conventions, consider using structured logging:

tracing::warn!(?workflow_id, "workflow does not exist");

Questions for Clarification

  1. Failure reason lifecycle: Can you confirm that update_runner() is always called before set_started() during allocation? If not, we may need to keep the failure_reason = None line in set_started().

  2. Testing strategy: Are there existing integration tests that cover these actor lifecycle paths? If so, have they been run to verify the changes?

  3. Deployment risk: This changes core actor lifecycle logic. What is the rollback plan if unexpected issues arise in production?

Recommendations

Must Address:

  • Verify the failure_reason clearing logic - ensure there is no scenario where stale failure reasons persist

Should Address:

  • Add test coverage for failure reason tracking across crash policies
  • Update the warning log to use structured logging format

Optional:

  • Consider simplifying the pattern matching structure (see suggestion above)

Conclusion

This is a good fix that addresses a real issue with incomplete failure tracking. The main concern is ensuring the failure reason clearing logic is correct after removing the line from set_started(). With test coverage and verification of the state lifecycle, this should be safe to merge.

Recommendation: Request confirmation on the failure reason clearing logic before merging.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Jan 12, 2026

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3835

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3835

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3835

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3835

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3835

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3835

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3835

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3835

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3835

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3835

commit: 250667d

@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b613527 to b4daf93 Compare January 13, 2026 00:27
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 7915d6f to 258ae43 Compare January 13, 2026 00:44
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch 2 times, most recently from 0b9ca2f to 1a08758 Compare January 13, 2026 01:01
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 42cfc6d to 506742d Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 1a08758 to 5b1c724 Compare January 13, 2026 01:29
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 506742d to ca4cbfc Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 5b1c724 to 8152462 Compare January 14, 2026 02:05
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ca4cbfc to 8102256 Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 8152462 to 82ed36d Compare January 14, 2026 03:06
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 82ed36d to f5a4f68 Compare January 14, 2026 19:47
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 4b907cb to dc985c0 Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f5a4f68 to 9e4ae6f Compare January 14, 2026 20:01
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from dc985c0 to f3f7d72 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 9e4ae6f to 595d7b8 Compare January 14, 2026 22:52
@MasterPtato MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 595d7b8 to 250667d Compare January 14, 2026 23:02
@MasterPtato MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from f3f7d72 to ae1c40c Compare January 14, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants