fix: parallelize runner config endpoints #3253

MasterPtato · 2025-10-23T22:54:13Z

No description provided.

MasterPtato · 2025-10-23T22:54:20Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix: parallelize runner config endpoints #3253 👈 (View in Graphite)
fix: handle edge cases with gateway + actor sleep #3248
chore: add logs #3247
chore: engine runner improvements #3237
main

How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

_{An organization admin has enabled the Graphite Merge Queue in this repository.} _{Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.}

This stack of pull requests is managed by Graphite. Learn more about stacking.

vercel · 2025-10-23T22:54:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments

Project	Deployment	Updated (UTC)
rivet-cloud	Ignored	Oct 23, 2025 10:54pm
rivet-inspector	Ignored	Oct 23, 2025 10:54pm
rivet-site	Ignored	Oct 23, 2025 10:54pm

claude · 2025-10-23T22:55:49Z

Pull Request Review: Parallelize Runner Config Endpoints

Summary

This PR successfully parallelizes datacenter operations in runner config endpoints (upsert and delete) using futures_util::stream with buffer_unordered(16). The changes improve performance by making concurrent requests to multiple datacenters instead of sequential requests.

Code Quality ✅

Strengths

Consistent Pattern Applied: The parallelization pattern is consistently applied across both delete.rs and upsert.rs, and matches the existing pattern in api-util/src/lib.rs
Proper Clone Derives: Added Clone derive to DeleteQuery and DeletePath structs in api-peer/src/runner_configs.rs:124,134 to enable sharing across async tasks
Clear Intent: The inline comment on line 78 and 154 clarifies error handling behavior: "NOTE: We must error when any peer request fails, not all"
Observability: Added tracing instrumentation to HTTP requests in runner_configs/utils.rs:92 and namespace/ops/resolve_for_name_global.rs:38 for better debugging

Refactoring Quality

The transformation from sequential to parallel follows a clean pattern:

Sequential loop → futures_util::stream::iter()
Move cloning inside the async closure to avoid lifetime issues
Use buffer_unordered(16) for concurrent execution (matching api-util)
Use try_collect::<Vec<_>>() for proper error propagation

Potential Issues 🔍

1. Error Handling Inconsistency (Minor)

Location: upsert.rs:148 and delete.rs:72

The error handling differs between branches:

// In upsert.rs:123 (with runner_config)
anyhow::Ok(response.endpoint_config_changed)

// In upsert.rs:148 (without runner_config)  
Ok(false)

While both work, using anyhow::Ok consistently would be clearer. The Ok(false) on line 148 relies on type inference to resolve to anyhow::Result, which is less explicit.

Recommendation: Use anyhow::Ok(false) on line 148 for consistency.

2. Unnecessary Cloning in upsert.rs (Performance)

Location: upsert.rs:89-95

let dcs = ctx
    .config()
    .topology()
    .datacenters
    .iter()
    .map(|dc| (dc.clone(), body.datacenters.remove(&dc.name)))
    .collect::<Vec<_>>();

This creates an intermediate Vec that clones all datacenters, then immediately consumes it. This could be streamlined to:

futures_util::stream::iter(
    ctx.config()
        .topology()
        .datacenters
        .iter()
        .map(|dc| (dc.clone(), body.datacenters.remove(&dc.name)))
)

However, there's a subtle issue: body.datacenters.remove() mutates body, which may require the current approach. The pattern is acceptable but worth noting.

3. Context vs with_context Usage (Style)

Location: api-util/src/lib.rs:163

Changed from:

return Err(res).with_context(|| "all datacenter requests failed");

to:

return Err(res).context("all datacenter requests failed");

This is an improvement! Using .context() for static strings is more efficient than .with_context(|| ...) which creates a closure. Good catch!

Performance Considerations ✅

Positive Impact

Concurrency: The buffer_unordered(16) allows up to 16 concurrent datacenter requests, which should significantly reduce latency for multi-datacenter deployments
No Blocking: Changed from sequential blocking to parallel async execution
Consistent Buffer Size: Uses the same buffer size (16) as the existing fanout_to_datacenters function in api-util

Considerations

Network Pressure: With 16 concurrent requests, this could create spikes in network traffic. This seems reasonable for datacenter-to-datacenter communication and matches existing patterns
Memory Usage: Collecting results into Vec<_> is fine for a small number of datacenters

Security Concerns 🔒

No Issues Found

✅ Authentication check (ctx.auth().await?) remains at the start of both functions
✅ Error propagation with ? ensures failures are handled properly
✅ No sensitive data exposed in new tracing spans

Test Coverage ⚠️

Missing Tests

I couldn't find existing tests for the runner config endpoints. While this PR doesn't introduce new functionality (only parallelization), it would be beneficial to have:

Integration tests verifying concurrent datacenter operations
Tests ensuring error handling when some (but not all) datacenters fail
Tests confirming the endpoint_config_changed aggregation logic works correctly

Recommendation: Consider adding integration tests in a follow-up PR to cover these critical multi-datacenter workflows.

Best Practices Alignment ✅

Follows CLAUDE.md Guidelines

✅ Uses anyhow::* import pattern (line 1 in both files)
✅ Proper error handling with Result<T> return types
✅ Structured logging with tracing (existing #[tracing::instrument] preserved)
✅ Hard tabs for formatting (consistent with rustfmt.toml)

Minor Observations

Import Organization (delete.rs:1-3, upsert.rs:1-5): Imports are well-organized and include only what's needed
Code Duplication: The parallel pattern appears three times now (delete, upsert, fanout_to_datacenters). Consider extracting a helper function if this pattern grows further, but current duplication is acceptable given the slight variations

Summary

This is a well-executed performance improvement that follows existing patterns in the codebase. The changes are clean, consistent, and maintain proper error handling.

Recommendations Priority

Low Priority: Use anyhow::Ok consistently in upsert.rs:148
Low Priority: Add integration tests in a follow-up PR
Nice to Have: Consider a helper function if this pattern needs to be used elsewhere

Overall: Approved ✅

Great work parallelizing these endpoints! The performance improvement should be significant for multi-datacenter deployments.

graphite-app · 2025-10-24T06:47:15Z

Merge activity

Oct 24, 6:47 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
Oct 24, 6:47 AM UTC: CI is running for this pull request on a draft pull request (#3257) due to your merge queue CI optimization settings.
Oct 24, 6:48 AM UTC: Merged by the Graphite merge queue via draft PR: #3257.

fix: parallelize runner config endpoints

6822b41

MasterPtato requested a review from NathanFlurry October 23, 2025 22:54

This was referenced Oct 23, 2025

chore: add logs #3247

Closed

fix: handle edge cases with gateway + actor sleep #3248

Closed

chore: engine runner improvements #3237

Closed

NathanFlurry approved these changes Oct 24, 2025

View reviewed changes

graphite-app bot pushed a commit that referenced this pull request Oct 24, 2025

fix: parallelize runner config endpoints (#3253)

a82d479

graphite-app bot closed this Oct 24, 2025

graphite-app bot deleted the 10-23-fix_parallelize_runner_config_endpoints branch October 24, 2025 06:48

claude bot mentioned this pull request Oct 24, 2025

[Graphite MQ] Draft PR GROUP:spec_7f4c8b (PRs 3237, 3247, 3248, 3253) #3257

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: parallelize runner config endpoints #3253

fix: parallelize runner config endpoints #3253

Uh oh!

MasterPtato commented Oct 23, 2025

Uh oh!

MasterPtato commented Oct 23, 2025 •

edited

Loading

Uh oh!

vercel bot commented Oct 23, 2025

Uh oh!

claude bot commented Oct 23, 2025

Uh oh!

graphite-app bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix: parallelize runner config endpoints #3253

fix: parallelize runner config endpoints #3253

Uh oh!

Conversation

MasterPtato commented Oct 23, 2025

Uh oh!

MasterPtato commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use the Graphite Merge Queue

Uh oh!

vercel bot commented Oct 23, 2025

Uh oh!

claude bot commented Oct 23, 2025

Pull Request Review: Parallelize Runner Config Endpoints

Summary

Code Quality ✅

Strengths

Refactoring Quality

Potential Issues 🔍

1. Error Handling Inconsistency (Minor)

2. Unnecessary Cloning in upsert.rs (Performance)

3. Context vs with_context Usage (Style)

Performance Considerations ✅

Positive Impact

Considerations

Security Concerns 🔒

No Issues Found

Test Coverage ⚠️

Missing Tests

Best Practices Alignment ✅

Follows CLAUDE.md Guidelines

Minor Observations

Summary

Recommendations Priority

Uh oh!

graphite-app bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MasterPtato commented Oct 23, 2025 •

edited

Loading

graphite-app bot commented Oct 24, 2025 •

edited

Loading