[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

graphite-app · 2025-10-24T06:47:45Z

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

vercel · 2025-10-24T06:47:50Z

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments

Project	Deployment	Preview	Updated (UTC)
rivet-cloud	Ignored		Oct 24, 2025 6:47am
rivet-inspector	Ignored	Preview	Oct 24, 2025 6:47am
rivet-site	Ignored	Preview	Oct 24, 2025 6:47am

claude · 2025-10-24T06:49:20Z

Code Review for PR #3257

This PR combines 4 separate PRs (#3237, #3247, #3248, #3253) focused on engine runner improvements, logging additions, actor sleep edge case handling, and parallelization of runner config endpoints.

✅ Strengths

Parallelization Improvements - The conversion to parallel processing in runner_configs/upsert.rs and runner_configs/delete.rs using buffer_unordered(16) is excellent for performance. This should significantly reduce latency when operating across multiple datacenters.
Better Error Handling - The actor sleep edge case handling in pegboard_gateway.rs with retry logic (up to 3 attempts) is a solid improvement for reliability.
Improved Logging - Added instrumentation spans like custom_instrument(tracing::info_span!("fetch_metadata_request")) and structured logging improvements follow the project's conventions well.
Type Safety - Added Clone derives where needed for parallelization without compromising type safety.

⚠️ Issues & Concerns

1. Typo in Log Message (`pegboard_gateway.rs:152`)

"actor stopped while we were waiting for it to beocme ready, attempting rewake"

Should be "become" not "beocme".

2. Inconsistent Error Handling in Parallel Operations (`api-public/src/runner_configs/upsert.rs:154`)

The comment states:

// NOTE: We must error when any peer request fails, not all

This is correct behavior for try_collect, but the comment in api-util/lib.rs:160 has different semantics:

// Error only if all requests failed

These two approaches are inconsistent. For critical operations like runner config updates, failing fast (any error) is correct, but this inconsistency should be documented or reconsidered.

3. Missing Error Context (`api-public/src/runner_configs/upsert.rs:123`)

anyhow::Ok(response.endpoint_config_changed)

While this works, consider adding context about which datacenter failed if the request errors out, to aid debugging in multi-DC setups.

4. Race Condition Risk (`guard/src/routing/pegboard_gateway.rs:144-172`)

The retry logic for stopped actors has a potential race condition:

Actor stops while waiting for ready
We send wake signal
Actor might be destroyed between checking and waking

The code handles WorkflowNotFound gracefully, but there's no check if the actor was destroyed between iterations. Consider subscribing to destroy events within the retry loop.

5. Magic Number (`guard/src/routing/pegboard_gateway.rs:151`)

if wake_retries < 3 {

The value 3 should be a named constant like MAX_WAKE_RETRIES for better maintainability.

6. Breaking Change (`sdks/typescript/runner/src/mod.ts:48-49`)

fetch: (runner: Runner, actorId: string, request: Request) => Promise<Response>;
websocket?: (runner: Runner, actorId: string, ws: any, request: Request) => Promise<void>;

The addition of the runner parameter to these callbacks is a breaking API change. Ensure this is documented in release notes and migration guides.

7. Unused Code Removal (`engine/src/util/wf/signal.rs:57-60`)

The diff shows duplicate code was removed, which is good, but ensure this wasn't masking a bug where the conditional check was intended.

🔒 Security Considerations

Header Validation - The change from X_RIVET_NAMESPACE_ID to X_RIVET_NAMESPACE_NAME in pegboard-serverless/src/lib.rs:29 is good, but the deprecated header is still supported. Ensure there's a plan to remove the deprecated header in a future version.
No Direct Security Issues - No obvious security vulnerabilities introduced.

📊 Performance

The parallelization changes should significantly improve performance for multi-datacenter operations. The use of buffer_unordered(16) is appropriate for I/O-bound operations.

🧪 Testing

Consider adding tests for:

Actor wake retry logic edge cases
Parallel datacenter operations with partial failures
The race condition scenario mentioned above

📝 Documentation

The TypeScript SDK breaking changes need documentation
Consider adding comments explaining the retry logic and why 3 attempts was chosen
Document the different error handling strategies between upsert and the generic request_all_datacenters_aggregated function

Minor Suggestions

In pegboard/src/workflows/actor/mod.rs:295-298, consider extracting the tuple unpacking to improve readability:

let Some(runner_id) = state.runner_id else { ... };
let Some(runner_workflow_id) = state.runner_workflow_id else { ... };

The error rename from WorkflowStopped to WorkflowEvicted is semantically clearer ✅

Overall, this is solid work with meaningful improvements to reliability and performance. The issues noted are mostly minor and should be addressed before merging.

MasterPtato added 4 commits October 24, 2025 06:47

chore: engine runner improvements (#3237)

cdbe39a

chore: add logs (#3247)

3d7a3d3

fix: handle edge cases with gateway + actor sleep (#3248)

edd820d

fix: parallelize runner config endpoints (#3253)

a82d479

graphite-app bot closed this Oct 24, 2025

graphite-app bot deleted the gtmq_spec_7f4c8b_1761288462971-68c21977-5c6a-49f7-bb03-4bcb5893317b branch October 24, 2025 06:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

Uh oh!

graphite-app bot commented Oct 24, 2025

Uh oh!

vercel bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

Uh oh!

Conversation

graphite-app bot commented Oct 24, 2025

Uh oh!

vercel bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Oct 24, 2025

Code Review for PR #3257

✅ Strengths

⚠️ Issues & Concerns

1. Typo in Log Message (pegboard_gateway.rs:152)

2. Inconsistent Error Handling in Parallel Operations (api-public/src/runner_configs/upsert.rs:154)

3. Missing Error Context (api-public/src/runner_configs/upsert.rs:123)

4. Race Condition Risk (guard/src/routing/pegboard_gateway.rs:144-172)

5. Magic Number (guard/src/routing/pegboard_gateway.rs:151)

6. Breaking Change (sdks/typescript/runner/src/mod.ts:48-49)

7. Unused Code Removal (engine/src/util/wf/signal.rs:57-60)

🔒 Security Considerations

📊 Performance

🧪 Testing

📝 Documentation

Minor Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Oct 24, 2025 •

edited

Loading

1. Typo in Log Message (`pegboard_gateway.rs:152`)

2. Inconsistent Error Handling in Parallel Operations (`api-public/src/runner_configs/upsert.rs:154`)

3. Missing Error Context (`api-public/src/runner_configs/upsert.rs:123`)

4. Race Condition Risk (`guard/src/routing/pegboard_gateway.rs:144-172`)

5. Magic Number (`guard/src/routing/pegboard_gateway.rs:151`)

6. Breaking Change (`sdks/typescript/runner/src/mod.ts:48-49`)

7. Unused Code Removal (`engine/src/util/wf/signal.rs:57-60`)