Skip to content

Conversation

@graphite-app
Copy link
Contributor

@graphite-app graphite-app bot commented Oct 24, 2025

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

@vercel
Copy link

vercel bot commented Oct 24, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Oct 24, 2025 6:47am
rivet-inspector Ignored Ignored Preview Oct 24, 2025 6:47am
rivet-site Ignored Ignored Preview Oct 24, 2025 6:47am

@graphite-app graphite-app bot closed this Oct 24, 2025
@graphite-app graphite-app bot deleted the gtmq_spec_7f4c8b_1761288462971-68c21977-5c6a-49f7-bb03-4bcb5893317b branch October 24, 2025 06:49
@claude
Copy link

claude bot commented Oct 24, 2025

Code Review for PR #3257

This PR combines 4 separate PRs (#3237, #3247, #3248, #3253) focused on engine runner improvements, logging additions, actor sleep edge case handling, and parallelization of runner config endpoints.

✅ Strengths

  1. Parallelization Improvements - The conversion to parallel processing in runner_configs/upsert.rs and runner_configs/delete.rs using buffer_unordered(16) is excellent for performance. This should significantly reduce latency when operating across multiple datacenters.

  2. Better Error Handling - The actor sleep edge case handling in pegboard_gateway.rs with retry logic (up to 3 attempts) is a solid improvement for reliability.

  3. Improved Logging - Added instrumentation spans like custom_instrument(tracing::info_span!("fetch_metadata_request")) and structured logging improvements follow the project's conventions well.

  4. Type Safety - Added Clone derives where needed for parallelization without compromising type safety.

⚠️ Issues & Concerns

1. Typo in Log Message (pegboard_gateway.rs:152)

"actor stopped while we were waiting for it to beocme ready, attempting rewake"

Should be "become" not "beocme".

2. Inconsistent Error Handling in Parallel Operations (api-public/src/runner_configs/upsert.rs:154)

The comment states:

// NOTE: We must error when any peer request fails, not all

This is correct behavior for try_collect, but the comment in api-util/lib.rs:160 has different semantics:

// Error only if all requests failed

These two approaches are inconsistent. For critical operations like runner config updates, failing fast (any error) is correct, but this inconsistency should be documented or reconsidered.

3. Missing Error Context (api-public/src/runner_configs/upsert.rs:123)

anyhow::Ok(response.endpoint_config_changed)

While this works, consider adding context about which datacenter failed if the request errors out, to aid debugging in multi-DC setups.

4. Race Condition Risk (guard/src/routing/pegboard_gateway.rs:144-172)

The retry logic for stopped actors has a potential race condition:

  • Actor stops while waiting for ready
  • We send wake signal
  • Actor might be destroyed between checking and waking

The code handles WorkflowNotFound gracefully, but there's no check if the actor was destroyed between iterations. Consider subscribing to destroy events within the retry loop.

5. Magic Number (guard/src/routing/pegboard_gateway.rs:151)

if wake_retries < 3 {

The value 3 should be a named constant like MAX_WAKE_RETRIES for better maintainability.

6. Breaking Change (sdks/typescript/runner/src/mod.ts:48-49)

fetch: (runner: Runner, actorId: string, request: Request) => Promise<Response>;
websocket?: (runner: Runner, actorId: string, ws: any, request: Request) => Promise<void>;

The addition of the runner parameter to these callbacks is a breaking API change. Ensure this is documented in release notes and migration guides.

7. Unused Code Removal (engine/src/util/wf/signal.rs:57-60)

The diff shows duplicate code was removed, which is good, but ensure this wasn't masking a bug where the conditional check was intended.

🔒 Security Considerations

  1. Header Validation - The change from X_RIVET_NAMESPACE_ID to X_RIVET_NAMESPACE_NAME in pegboard-serverless/src/lib.rs:29 is good, but the deprecated header is still supported. Ensure there's a plan to remove the deprecated header in a future version.

  2. No Direct Security Issues - No obvious security vulnerabilities introduced.

📊 Performance

The parallelization changes should significantly improve performance for multi-datacenter operations. The use of buffer_unordered(16) is appropriate for I/O-bound operations.

🧪 Testing

Consider adding tests for:

  1. Actor wake retry logic edge cases
  2. Parallel datacenter operations with partial failures
  3. The race condition scenario mentioned above

📝 Documentation

  • The TypeScript SDK breaking changes need documentation
  • Consider adding comments explaining the retry logic and why 3 attempts was chosen
  • Document the different error handling strategies between upsert and the generic request_all_datacenters_aggregated function

Minor Suggestions

  1. In pegboard/src/workflows/actor/mod.rs:295-298, consider extracting the tuple unpacking to improve readability:
let Some(runner_id) = state.runner_id else { ... };
let Some(runner_workflow_id) = state.runner_workflow_id else { ... };
  1. The error rename from WorkflowStopped to WorkflowEvicted is semantically clearer ✅

Overall, this is solid work with meaningful improvements to reliability and performance. The issues noted are mostly minor and should be addressed before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant