fix: set failure reason regardless of crash policy #3835

MasterPtato · 2026-01-12T19:09:00Z

No description provided.

MasterPtato · 2026-01-12T19:09:10Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-01-12T19:11:36Z

Code Review: Set failure reason regardless of crash policy

Summary

This PR fixes a logic issue where failure reasons were only being set for certain crash policies, when they should be set unconditionally. The changes improve error tracking and debugging by ensuring failure reasons are always recorded when actors crash or are lost.

Changes Overview

Restructured failure reason handling (mod.rs:879-919) - Moved failure reason setting to happen immediately when an actor stops, before crash policy handling
Removed conditional setting (mod.rs:1073) - Removed the duplicate failure reason setting that was only happening in the CrashPolicy::Sleep path
Fixed workflow debug query (debug.rs:670) - Changed from read() to read_opt() to handle missing workflows gracefully
Removed premature clearing (runtime.rs:1131) - Removed line that was clearing failure_reason when actor started (now handled in update_runner instead)

Positive Aspects

✅ Correct logic flow - The restructuring makes the code much clearer: failure reasons are set early regardless of what happens next

✅ Better error visibility - Failure reasons are now tracked consistently across all crash policies (Restart, Sleep, and Complete)

✅ Improved workflow debug - The fix in debug.rs prevents panics when querying workflows that no longer exist

✅ Proper failure reason precedence - Runner failures continue to take precedence over actor crashes (via is_runner_failure() check in set_failure_reason)

Issues & Concerns

1. Missing Test Coverage ⚠️

This PR modifies critical actor lifecycle logic but does not include any tests. Consider adding tests for:

Actor crash with CrashPolicy::Restart should set failure reason
Actor crash with CrashPolicy::Sleep should set failure reason
Actor crash with CrashPolicy::Complete should set failure reason
Verify runner failures are not overwritten by subsequent crash failures
Verify failure reason is cleared when actor successfully reallocates

2. Potential State Inconsistency ⚠️

Location: runtime.rs:1131

The removal of state.failure_reason = None in set_started() changes when failure reasons are cleared. Previously they were cleared when the actor started; now they are only cleared in update_runner() (lines 137, 478).

Question: What happens if an actor:

Crashes (sets failure_reason)
Restarts and gets allocated (update_runner clears failure_reason ✅)
Starts successfully (set_started no longer clears failure_reason)

Is there a scenario where set_started is called without update_runner being called first? If so, the failure reason from a previous crash could persist incorrectly.

3. Code Structure: Pattern Matching Clarity 💭

Location: mod.rs:879-919

The new pattern matching structure is more explicit but slightly verbose. Consider if this structure is clearer:

let force_reschedule = match &variant {
    StoppedVariant::Normal { code, message } => {
        match code {
            protocol::mk2::StopCode::Ok => {
                // Reset retry count on successful exit.
                state.reschedule_state = Default::default();
            }
            protocol::mk2::StopCode::Error => {
                ctx.v(3)
                    .activity(runtime::SetFailureReasonInput {
                        failure_reason: FailureReason::Crashed {
                            message: message.clone(),
                        },
                    })
                    .await?;
            }
        }
        false
    }
    StoppedVariant::Lost {
        force_reschedule,
        failure_reason,
    } => {
        // Set runner failure reason if actor was lost unexpectedly.
        if let Some(failure_reason) = failure_reason {
            ctx.v(3)
                .activity(runtime::SetFailureReasonInput {
                    failure_reason: failure_reason.clone(),
                })
                .await?;
        }
        *force_reschedule
    }
};

This groups the Normal variants together and avoids repeating false. However, the current structure is acceptable if you prefer the explicit separation.

4. Minor: Logging Consistency 💭

Location: debug.rs:692

The new warning uses a different format than typical tracing patterns in the codebase:

tracing::warn!("workflow {workflow_id} does not exist");

Based on CLAUDE.md conventions, consider using structured logging:

tracing::warn!(?workflow_id, "workflow does not exist");

Questions for Clarification

Failure reason lifecycle: Can you confirm that update_runner() is always called before set_started() during allocation? If not, we may need to keep the failure_reason = None line in set_started().
Testing strategy: Are there existing integration tests that cover these actor lifecycle paths? If so, have they been run to verify the changes?
Deployment risk: This changes core actor lifecycle logic. What is the rollback plan if unexpected issues arise in production?

Recommendations

Must Address:

Verify the failure_reason clearing logic - ensure there is no scenario where stale failure reasons persist

Should Address:

Add test coverage for failure reason tracking across crash policies
Update the warning log to use structured logging format

Optional:

Consider simplifying the pattern matching structure (see suggestion above)

Conclusion

This is a good fix that addresses a real issue with incomplete failure tracking. The main concern is ensuring the failure reason clearing logic is correct after removing the line from set_started(). With test coverage and verification of the state lifecycle, this should be safe to merge.

Recommendation: Request confirmation on the failure reason clearing logic before merging.

pkg-pr-new · 2026-01-12T19:11:57Z

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3835

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3835

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3835

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3835

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3835

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3835

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3835

@rivetkit/virtual-websocket

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/virtual-websocket@3835

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3835

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3835

commit: 250667d

MasterPtato requested a review from NathanFlurry January 12, 2026 19:09

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f7e33e7 to b613527 Compare January 12, 2026 19:11

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 9297e74 to 442651f Compare January 12, 2026 19:11

This was referenced Jan 12, 2026

fix(pb): have restart policy actors sleep after pending for too long #3837

Open

chore: clean up errors #3846

Open

fix(epoxy): add metrics #3847

Open

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from b613527 to b4daf93 Compare January 13, 2026 00:27

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 7915d6f to 258ae43 Compare January 13, 2026 00:44

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch 2 times, most recently from 0b9ca2f to 1a08758 Compare January 13, 2026 01:01

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 42cfc6d to 506742d Compare January 13, 2026 01:29

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 1a08758 to 5b1c724 Compare January 13, 2026 01:29

This was referenced Jan 13, 2026

fix: add ready_chunks to worker bumps, bumps per tick metric #3855

Open

fix(guard): clean up guard #3862

Open

fix(otel): enrich http traces #3878

Open

chore(gas): add overview and history docs #3879

Open

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from 506742d to ca4cbfc Compare January 14, 2026 02:05

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 5b1c724 to 8152462 Compare January 14, 2026 02:05

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from ca4cbfc to 8102256 Compare January 14, 2026 03:06

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 8152462 to 82ed36d Compare January 14, 2026 03:06

NathanFlurry approved these changes Jan 14, 2026

View reviewed changes

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 82ed36d to f5a4f68 Compare January 14, 2026 19:47

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch 2 times, most recently from 4b907cb to dc985c0 Compare January 14, 2026 20:01

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from f5a4f68 to 9e4ae6f Compare January 14, 2026 20:01

This was referenced Jan 14, 2026

fix(engine-runner): handle shutdown close codes correctly #3896

Open

fix(pegboard): cache none runner pool errors too #3899

Open

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from dc985c0 to f3f7d72 Compare January 14, 2026 22:52

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 9e4ae6f to 595d7b8 Compare January 14, 2026 22:52

fix: set failure reason regardless of crash policy

250667d

MasterPtato force-pushed the 01-12-fix_set_failure_reason_regardless_of_crash_policy branch from 595d7b8 to 250667d Compare January 14, 2026 23:02

MasterPtato force-pushed the 01-09-feat_add_metrics_to_pb_guard branch from f3f7d72 to ae1c40c Compare January 14, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: set failure reason regardless of crash policy #3835

fix: set failure reason regardless of crash policy #3835

MasterPtato commented Jan 12, 2026

Uh oh!

MasterPtato commented Jan 12, 2026 •

edited

Loading

Uh oh!

claude bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

pkg-pr-new bot commented Jan 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: set failure reason regardless of crash policy #3835

Are you sure you want to change the base?

fix: set failure reason regardless of crash policy #3835

Conversation

MasterPtato commented Jan 12, 2026

Uh oh!

MasterPtato commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Set failure reason regardless of crash policy

Summary

Changes Overview

Positive Aspects

Issues & Concerns

1. Missing Test Coverage ⚠️

2. Potential State Inconsistency ⚠️

3. Code Structure: Pattern Matching Clarity 💭

4. Minor: Logging Consistency 💭

Questions for Clarification

Recommendations

Conclusion

Uh oh!

pkg-pr-new bot commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MasterPtato commented Jan 12, 2026 •

edited

Loading

claude bot commented Jan 12, 2026 •

edited

Loading

pkg-pr-new bot commented Jan 12, 2026 •

edited

Loading