feat(engine): return error for pending actors #3730

NathanFlurry · 2026-01-01T23:25:06Z

No description provided.

vercel · 2026-01-01T23:25:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
rivetkit-serverless	Error		Jan 3, 2026 0:06am

3 Skipped Deployments

Project	Deployment	Review	Updated (UTC)
rivet-cloud	Ignored		Jan 3, 2026 0:06am
rivet-inspector	Ignored	Preview	Jan 3, 2026 0:06am
rivet-site	Ignored	Preview	Jan 3, 2026 0:06am

NathanFlurry · 2026-01-01T23:25:14Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-01-01T23:29:46Z

PR Review: feat(engine): return error for pending actors

Summary

This PR implements a comprehensive error tracking and reporting system for actor scheduling failures. It introduces a two-layer error architecture with enriched error details for serverless endpoints and a fail-fast mechanism in Guard to prevent long waits when runner pools are experiencing issues.

Architecture & Design ✅

Excellent documentation: The ACTOR_ERRORS.md file provides clear context with a helpful Mermaid diagram explaining the error flow.

Clean separation of concerns: The two-layer error representation is well-designed:

Internal FailureReason enum in workflow state (simple, storage-efficient)
API-facing ActorError enum with enriched context (user-friendly)

Runner pool error tracker: The dedicated workflow for tracking serverless pool health is a smart architectural choice. The hysteresis logic (3 consecutive successes to clear) prevents flapping.

Code Quality

Strengths ✅

Error Handling: Custom error types follow the project's RivetError pattern correctly
Caching Strategy: Short 500ms TTL in get_error.rs:32 is appropriate for rapidly changing error states
Test Coverage: Comprehensive integration tests covering multiple error scenarios
TypeScript Integration: Client-side error types properly match the Rust API types

Issues & Recommendations

1. Missing Error Staleness Check ⚠️

File: engine/packages/pegboard/src/ops/runner_config/get_error.rs:111

The code returns active errors without checking staleness. According to the documentation (ACTOR_ERRORS.md:15), "Errors older than 5 minutes are considered stale and ignored." However, this staleness check is not implemented.

Recommendation: Add timestamp validation:

if let Some(active_error) = &state.active_error {
    let now = util::timestamp::now();
    let error_age_ms = now - active_error.timestamp;
    if error_age_ms < 300_000 { // 5 minutes
        result.push(RunnerPoolErrorEntry { ... });
    }
}

2. TypeScript Error Type Mismatch ⚠️

The TypeScript types don't perfectly match the Rust types:

TypeScript (rivetkit-typescript/packages/rivetkit/src/client/errors.ts:64-67):

export type ActorErrorDetails =
    | { serverless_error: ServerlessConnectionError }
    | { no_capacity: { runner_name: string } }
    | { runner_no_response: { runner_id: string } };

Rust (engine/packages/types/src/actor/error.rs:7-25):

pub enum ActorError {
    RunnerPoolError(RunnerPoolError),  // Not "serverless_error"
    NoCapacity,  // No nested object
    RunnerNoResponse { runner_id: Id },
    Crashed { message: Option<String> },  // Missing in TypeScript
}

Issues:

RunnerPoolError is named serverless_error in TypeScript (misleading, as it's not serverless-specific)
TypeScript NoCapacity has a nested object with runner_name, but Rust doesn't
Crashed variant is missing from TypeScript types

Recommendation: Align the types exactly, or document why they diverge.

3. Potential Race Condition in Guard Fail-Fast 🔍

File: engine/packages/guard/src/routing/pegboard_gateway.rs:255-259

res = &mut pool_error_check_fut => {
    if res? {
        return Err(errors::ActorRunnerFailed { actor_id }.build());
    }
}

Issue: If the actor becomes ready while we're checking pool errors, we might incorrectly return ActorRunnerFailed instead of the successful runner_id.

Recommendation: The pool_error_check_fut should be cancelled/ignored once ready_sub fires. The current implementation via tokio::select! should handle this correctly (first branch to complete wins), but this assumes the ready check completes before the error check in the race case.

4. Inconsistent Error Information 🔍

File: engine/packages/types/src/actor/error.rs:33

The NoCapacity variant doesn't include which runner was requested:

NoCapacity,  // Missing context

Recommendation: Add runner_name_selector to NoCapacity for better debugging:

NoCapacity { runner_name: String },

This would match the TypeScript error type which already includes it.

Performance Considerations

Positive ✅

Efficient caching: The 500ms TTL prevents excessive workflow queries
Batch processing: Error tracker processes up to 100 signals at once

Minor Concerns 🔍

N+1 Query Pattern: In api-peer/src/runner_configs.rs, the code fetches runner configs, then separately fetches pool errors. This is acceptable since both operations are cached.

Security Concerns

Minor 🔍

Error Message Leakage: ServerlessHttpError includes the full response body, which could leak sensitive information from serverless endpoints. Consider truncating or sanitizing error bodies.

Testing

Excellent Coverage ✅

Mock serverless servers for different error scenarios
Guard fail-fast behavior validation
Error tracker workflow state management
Integration tests for the full error flow

Missing Tests 🔍

Staleness logic: No tests verify that errors older than 5 minutes are filtered out
Error clearing: Tests verify error setting but not the "3 consecutive successes" clearing logic

Documentation

Strengths ✅

Comprehensive ACTOR_ERRORS.md with flow diagram
Well-commented code explaining complex logic
Clear error messages for users

Improvements 🔍

Add doc comments to the get_error operation explaining staleness behavior
Document the guard fail-fast delay rationale (why 1 second initial delay?)

Minor Issues

Typo: rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:141 - Missing 'h' in comment URL

Recommendations Summary

High Priority

⚠️ Implement staleness checking in get_error.rs - This is documented but not implemented
⚠️ Align Rust and TypeScript error types - Prevents runtime errors from type mismatches

Medium Priority

🔍 Add runner_name to NoCapacity error variant
🔍 Add test coverage for staleness and error clearing logic
🔍 Consider sanitizing serverless error response bodies

Low Priority

📝 Fix typo in TypeScript comment
🔍 Make timing constants configurable

Conclusion

This is a well-designed feature with excellent documentation and comprehensive testing. The two-layer error architecture is clean, and the runner pool error tracker is a smart solution for monitoring serverless health.

Main concerns:

Missing staleness check implementation (documented but not coded)
Type mismatches between Rust and TypeScript
Minor race condition consideration in guard fail-fast logic

Overall: Approve with minor changes recommended ✅

The implementation is production-ready after addressing the staleness check issue. The other recommendations are improvements but not blockers.

engine/packages/pegboard/src/ops/runner_config/get_error.rs

engine/packages/pegboard/src/workflows/actor/mod.rs

engine/packages/pegboard/src/workflows/actor/setup.rs

engine/packages/pegboard/src/workflows/serverless/conn.rs

engine/packages/pegboard/src/workflows/runner_pool_error_tracker.rs

engine/packages/pegboard/src/workflows/serverless/conn.rs

engine/packages/pegboard/src/ops/actor/get_for_gateway.rs

engine/packages/pegboard/src/workflows/actor/setup.rs

engine/packages/pegboard/src/workflows/serverless/conn.rs

This was referenced Jan 1, 2026

chore(rivetkit): clean up logging configuration #3672

Open

fix(rivetkit): fix c.client not respecting correct origin #3668

Open

chore(docs): document abort signal #3673

Open

This was referenced Jan 1, 2026

fix(rivetkit): allow providing no args to createRivetKit #3667

Open

chore(rivetkit): specialize context types for better compiler errors #3669

Open

chore(rivetkit): RIVET_EXPOSE_ERRORS #3671

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless January 1, 2026 23:25 Failure

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 27bd42b to fa230af Compare January 2, 2026 09:20

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 09:21 Failure

MasterPtato requested changes Jan 2, 2026

View reviewed changes

Copilot AI mentioned this pull request Jan 2, 2026

Add error indicator column to runner pool table #3735

Draft

NathanFlurry changed the base branch from 12-23-chore_docs_document_abort_signal to graphite-base/3730 January 2, 2026 22:30

NathanFlurry force-pushed the graphite-base/3730 branch from 4072138 to d3bc69d Compare January 2, 2026 22:30

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from fa230af to 7039e59 Compare January 2, 2026 22:30

NathanFlurry changed the base branch from graphite-base/3730 to 01-02-chore_gas_add_support_for_find_workflows_ January 2, 2026 22:31

NathanFlurry mentioned this pull request Jan 2, 2026

chore(gas): add support for find_workflows #3739

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 22:32 Failure

jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 2, 2026 23:33

jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 7039e59 to 653f032 Compare January 2, 2026 23:33

This was referenced Jan 2, 2026

chore: update engine-full sdk #3740

Open

feat(dash): show runner pool errors #3741

Open

fix(dash): use ineligible icon for runners #3742

Open

vercel bot had a problem deploying to Preview – rivetkit-serverless January 2, 2026 23:34 Failure

NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 79307ef to d3bc69d Compare January 2, 2026 23:59

NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 653f032 to 48e0be0 Compare January 2, 2026 23:59

jog1t mentioned this pull request Jan 2, 2026

feat(dash): surface actor errors in ui #3743

Draft

vercel bot had a problem deploying to Preview – rivetkit-serverless January 3, 2026 00:00 Failure

NathanFlurry requested a review from MasterPtato January 3, 2026 00:02

feat(engine): return error for pending actors

d0406b1

jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 48e0be0 to d0406b1 Compare January 3, 2026 00:05

jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 3, 2026 00:05

vercel bot had a problem deploying to Preview – rivetkit-serverless January 3, 2026 00:06 Failure

MasterPtato requested changes Jan 3, 2026

View reviewed changes

MasterPtato reviewed Jan 3, 2026

View reviewed changes

engine/packages/pegboard/src/workflows/serverless/conn.rs Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(engine): return error for pending actors #3730

feat(engine): return error for pending actors #3730

NathanFlurry commented Jan 1, 2026

Uh oh!

vercel bot commented Jan 1, 2026 •

edited

Loading

Uh oh!

NathanFlurry commented Jan 1, 2026 •

edited by jog1t

Loading

Uh oh!

claude bot commented Jan 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(engine): return error for pending actors #3730

Are you sure you want to change the base?

feat(engine): return error for pending actors #3730

Conversation

NathanFlurry commented Jan 1, 2026

Uh oh!

vercel bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanFlurry commented Jan 1, 2026 • edited by jog1t Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

claude bot commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat(engine): return error for pending actors

Summary

Architecture & Design ✅

Code Quality

Strengths ✅

Issues & Recommendations

1. Missing Error Staleness Check ⚠️

2. TypeScript Error Type Mismatch ⚠️

3. Potential Race Condition in Guard Fail-Fast 🔍

4. Inconsistent Error Information 🔍

Performance Considerations

Positive ✅

Minor Concerns 🔍

Security Concerns

Minor 🔍

Testing

Excellent Coverage ✅

Missing Tests 🔍

Documentation

Strengths ✅

Improvements 🔍

Minor Issues

Recommendations Summary

High Priority

Medium Priority

Low Priority

Conclusion

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vercel bot commented Jan 1, 2026 •

edited

Loading

NathanFlurry commented Jan 1, 2026 •

edited by jog1t

Loading

claude bot commented Jan 1, 2026 •

edited

Loading