Skip to content

Conversation

@NathanFlurry
Copy link
Member

No description provided.

@vercel
Copy link

vercel bot commented Jan 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rivetkit-serverless Error Error Jan 3, 2026 0:06am
3 Skipped Deployments
Project Deployment Review Updated (UTC)
rivet-cloud Ignored Ignored Jan 3, 2026 0:06am
rivet-inspector Ignored Ignored Preview Jan 3, 2026 0:06am
rivet-site Ignored Ignored Preview Jan 3, 2026 0:06am

Copy link
Member Author

NathanFlurry commented Jan 1, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Jan 1, 2026

PR Review: feat(engine): return error for pending actors

Summary

This PR implements a comprehensive error tracking and reporting system for actor scheduling failures. It introduces a two-layer error architecture with enriched error details for serverless endpoints and a fail-fast mechanism in Guard to prevent long waits when runner pools are experiencing issues.

Architecture & Design ✅

Excellent documentation: The ACTOR_ERRORS.md file provides clear context with a helpful Mermaid diagram explaining the error flow.

Clean separation of concerns: The two-layer error representation is well-designed:

  • Internal FailureReason enum in workflow state (simple, storage-efficient)
  • API-facing ActorError enum with enriched context (user-friendly)

Runner pool error tracker: The dedicated workflow for tracking serverless pool health is a smart architectural choice. The hysteresis logic (3 consecutive successes to clear) prevents flapping.

Code Quality

Strengths ✅

  1. Error Handling: Custom error types follow the project's RivetError pattern correctly
  2. Caching Strategy: Short 500ms TTL in get_error.rs:32 is appropriate for rapidly changing error states
  3. Test Coverage: Comprehensive integration tests covering multiple error scenarios
  4. TypeScript Integration: Client-side error types properly match the Rust API types

Issues & Recommendations

1. Missing Error Staleness Check ⚠️

File: engine/packages/pegboard/src/ops/runner_config/get_error.rs:111

The code returns active errors without checking staleness. According to the documentation (ACTOR_ERRORS.md:15), "Errors older than 5 minutes are considered stale and ignored." However, this staleness check is not implemented.

Recommendation: Add timestamp validation:

if let Some(active_error) = &state.active_error {
    let now = util::timestamp::now();
    let error_age_ms = now - active_error.timestamp;
    if error_age_ms < 300_000 { // 5 minutes
        result.push(RunnerPoolErrorEntry { ... });
    }
}

2. TypeScript Error Type Mismatch ⚠️

The TypeScript types don't perfectly match the Rust types:

TypeScript (rivetkit-typescript/packages/rivetkit/src/client/errors.ts:64-67):

export type ActorErrorDetails =
    | { serverless_error: ServerlessConnectionError }
    | { no_capacity: { runner_name: string } }
    | { runner_no_response: { runner_id: string } };

Rust (engine/packages/types/src/actor/error.rs:7-25):

pub enum ActorError {
    RunnerPoolError(RunnerPoolError),  // Not "serverless_error"
    NoCapacity,  // No nested object
    RunnerNoResponse { runner_id: Id },
    Crashed { message: Option<String> },  // Missing in TypeScript
}

Issues:

  1. RunnerPoolError is named serverless_error in TypeScript (misleading, as it's not serverless-specific)
  2. TypeScript NoCapacity has a nested object with runner_name, but Rust doesn't
  3. Crashed variant is missing from TypeScript types

Recommendation: Align the types exactly, or document why they diverge.

3. Potential Race Condition in Guard Fail-Fast 🔍

File: engine/packages/guard/src/routing/pegboard_gateway.rs:255-259

res = &mut pool_error_check_fut => {
    if res? {
        return Err(errors::ActorRunnerFailed { actor_id }.build());
    }
}

Issue: If the actor becomes ready while we're checking pool errors, we might incorrectly return ActorRunnerFailed instead of the successful runner_id.

Recommendation: The pool_error_check_fut should be cancelled/ignored once ready_sub fires. The current implementation via tokio::select! should handle this correctly (first branch to complete wins), but this assumes the ready check completes before the error check in the race case.

4. Inconsistent Error Information 🔍

File: engine/packages/types/src/actor/error.rs:33

The NoCapacity variant doesn't include which runner was requested:

NoCapacity,  // Missing context

Recommendation: Add runner_name_selector to NoCapacity for better debugging:

NoCapacity { runner_name: String },

This would match the TypeScript error type which already includes it.

Performance Considerations

Positive ✅

  1. Efficient caching: The 500ms TTL prevents excessive workflow queries
  2. Batch processing: Error tracker processes up to 100 signals at once

Minor Concerns 🔍

  1. N+1 Query Pattern: In api-peer/src/runner_configs.rs, the code fetches runner configs, then separately fetches pool errors. This is acceptable since both operations are cached.

Security Concerns

Minor 🔍

  1. Error Message Leakage: ServerlessHttpError includes the full response body, which could leak sensitive information from serverless endpoints. Consider truncating or sanitizing error bodies.

Testing

Excellent Coverage ✅

  • Mock serverless servers for different error scenarios
  • Guard fail-fast behavior validation
  • Error tracker workflow state management
  • Integration tests for the full error flow

Missing Tests 🔍

  1. Staleness logic: No tests verify that errors older than 5 minutes are filtered out
  2. Error clearing: Tests verify error setting but not the "3 consecutive successes" clearing logic

Documentation

Strengths ✅

  • Comprehensive ACTOR_ERRORS.md with flow diagram
  • Well-commented code explaining complex logic
  • Clear error messages for users

Improvements 🔍

  1. Add doc comments to the get_error operation explaining staleness behavior
  2. Document the guard fail-fast delay rationale (why 1 second initial delay?)

Minor Issues

  1. Typo: rivetkit-typescript/packages/rivetkit/src/client/actor-conn.ts:141 - Missing 'h' in comment URL

Recommendations Summary

High Priority

  1. ⚠️ Implement staleness checking in get_error.rs - This is documented but not implemented
  2. ⚠️ Align Rust and TypeScript error types - Prevents runtime errors from type mismatches

Medium Priority

  1. 🔍 Add runner_name to NoCapacity error variant
  2. 🔍 Add test coverage for staleness and error clearing logic
  3. 🔍 Consider sanitizing serverless error response bodies

Low Priority

  1. 📝 Fix typo in TypeScript comment
  2. 🔍 Make timing constants configurable

Conclusion

This is a well-designed feature with excellent documentation and comprehensive testing. The two-layer error architecture is clean, and the runner pool error tracker is a smart solution for monitoring serverless health.

Main concerns:

  1. Missing staleness check implementation (documented but not coded)
  2. Type mismatches between Rust and TypeScript
  3. Minor race condition consideration in guard fail-fast logic

Overall: Approve with minor changes recommended ✅

The implementation is production-ready after addressing the staleness check issue. The other recommendations are improvements but not blockers.

@NathanFlurry NathanFlurry changed the base branch from 12-23-chore_docs_document_abort_signal to graphite-base/3730 January 2, 2026 22:30
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from fa230af to 7039e59 Compare January 2, 2026 22:30
@NathanFlurry NathanFlurry changed the base branch from graphite-base/3730 to 01-02-chore_gas_add_support_for_find_workflows_ January 2, 2026 22:31
@jog1t jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 2, 2026 23:33
@jog1t jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 7039e59 to 653f032 Compare January 2, 2026 23:33
@NathanFlurry NathanFlurry force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from 79307ef to d3bc69d Compare January 2, 2026 23:59
@NathanFlurry NathanFlurry force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 653f032 to 48e0be0 Compare January 2, 2026 23:59
@jog1t jog1t force-pushed the 12-23-feat_engine_return_error_for_pending_actors branch from 48e0be0 to d0406b1 Compare January 3, 2026 00:05
@jog1t jog1t force-pushed the 01-02-chore_gas_add_support_for_find_workflows_ branch from d3bc69d to 79307ef Compare January 3, 2026 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants