Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

Copy link
Contributor Author

MasterPtato commented Oct 23, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@vercel
Copy link

vercel bot commented Oct 23, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Oct 23, 2025 10:54pm
rivet-inspector Ignored Ignored Oct 23, 2025 10:54pm
rivet-site Ignored Ignored Oct 23, 2025 10:54pm

@claude
Copy link

claude bot commented Oct 23, 2025

Pull Request Review: Parallelize Runner Config Endpoints

Summary

This PR successfully parallelizes datacenter operations in runner config endpoints (upsert and delete) using futures_util::stream with buffer_unordered(16). The changes improve performance by making concurrent requests to multiple datacenters instead of sequential requests.

Code Quality ✅

Strengths

  1. Consistent Pattern Applied: The parallelization pattern is consistently applied across both delete.rs and upsert.rs, and matches the existing pattern in api-util/src/lib.rs
  2. Proper Clone Derives: Added Clone derive to DeleteQuery and DeletePath structs in api-peer/src/runner_configs.rs:124,134 to enable sharing across async tasks
  3. Clear Intent: The inline comment on line 78 and 154 clarifies error handling behavior: "NOTE: We must error when any peer request fails, not all"
  4. Observability: Added tracing instrumentation to HTTP requests in runner_configs/utils.rs:92 and namespace/ops/resolve_for_name_global.rs:38 for better debugging

Refactoring Quality

The transformation from sequential to parallel follows a clean pattern:

  • Sequential loop → futures_util::stream::iter()
  • Move cloning inside the async closure to avoid lifetime issues
  • Use buffer_unordered(16) for concurrent execution (matching api-util)
  • Use try_collect::<Vec<_>>() for proper error propagation

Potential Issues 🔍

1. Error Handling Inconsistency (Minor)

Location: upsert.rs:148 and delete.rs:72

The error handling differs between branches:

// In upsert.rs:123 (with runner_config)
anyhow::Ok(response.endpoint_config_changed)

// In upsert.rs:148 (without runner_config)  
Ok(false)

While both work, using anyhow::Ok consistently would be clearer. The Ok(false) on line 148 relies on type inference to resolve to anyhow::Result, which is less explicit.

Recommendation: Use anyhow::Ok(false) on line 148 for consistency.

2. Unnecessary Cloning in upsert.rs (Performance)

Location: upsert.rs:89-95

let dcs = ctx
    .config()
    .topology()
    .datacenters
    .iter()
    .map(|dc| (dc.clone(), body.datacenters.remove(&dc.name)))
    .collect::<Vec<_>>();

This creates an intermediate Vec that clones all datacenters, then immediately consumes it. This could be streamlined to:

futures_util::stream::iter(
    ctx.config()
        .topology()
        .datacenters
        .iter()
        .map(|dc| (dc.clone(), body.datacenters.remove(&dc.name)))
)

However, there's a subtle issue: body.datacenters.remove() mutates body, which may require the current approach. The pattern is acceptable but worth noting.

3. Context vs with_context Usage (Style)

Location: api-util/src/lib.rs:163

Changed from:

return Err(res).with_context(|| "all datacenter requests failed");

to:

return Err(res).context("all datacenter requests failed");

This is an improvement! Using .context() for static strings is more efficient than .with_context(|| ...) which creates a closure. Good catch!

Performance Considerations ✅

Positive Impact

  1. Concurrency: The buffer_unordered(16) allows up to 16 concurrent datacenter requests, which should significantly reduce latency for multi-datacenter deployments
  2. No Blocking: Changed from sequential blocking to parallel async execution
  3. Consistent Buffer Size: Uses the same buffer size (16) as the existing fanout_to_datacenters function in api-util

Considerations

  • Network Pressure: With 16 concurrent requests, this could create spikes in network traffic. This seems reasonable for datacenter-to-datacenter communication and matches existing patterns
  • Memory Usage: Collecting results into Vec<_> is fine for a small number of datacenters

Security Concerns 🔒

No Issues Found

  1. ✅ Authentication check (ctx.auth().await?) remains at the start of both functions
  2. ✅ Error propagation with ? ensures failures are handled properly
  3. ✅ No sensitive data exposed in new tracing spans

Test Coverage ⚠️

Missing Tests

I couldn't find existing tests for the runner config endpoints. While this PR doesn't introduce new functionality (only parallelization), it would be beneficial to have:

  • Integration tests verifying concurrent datacenter operations
  • Tests ensuring error handling when some (but not all) datacenters fail
  • Tests confirming the endpoint_config_changed aggregation logic works correctly

Recommendation: Consider adding integration tests in a follow-up PR to cover these critical multi-datacenter workflows.

Best Practices Alignment ✅

Follows CLAUDE.md Guidelines

  1. ✅ Uses anyhow::* import pattern (line 1 in both files)
  2. ✅ Proper error handling with Result<T> return types
  3. ✅ Structured logging with tracing (existing #[tracing::instrument] preserved)
  4. ✅ Hard tabs for formatting (consistent with rustfmt.toml)

Minor Observations

  1. Import Organization (delete.rs:1-3, upsert.rs:1-5): Imports are well-organized and include only what's needed
  2. Code Duplication: The parallel pattern appears three times now (delete, upsert, fanout_to_datacenters). Consider extracting a helper function if this pattern grows further, but current duplication is acceptable given the slight variations

Summary

This is a well-executed performance improvement that follows existing patterns in the codebase. The changes are clean, consistent, and maintain proper error handling.

Recommendations Priority

  1. Low Priority: Use anyhow::Ok consistently in upsert.rs:148
  2. Low Priority: Add integration tests in a follow-up PR
  3. Nice to Have: Consider a helper function if this pattern needs to be used elsewhere

Overall: Approved

Great work parallelizing these endpoints! The performance improvement should be significant for multi-datacenter deployments.

@graphite-app
Copy link
Contributor

graphite-app bot commented Oct 24, 2025

Merge activity

  • Oct 24, 6:47 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Oct 24, 6:47 AM UTC: CI is running for this pull request on a draft pull request (#3257) due to your merge queue CI optimization settings.
  • Oct 24, 6:48 AM UTC: Merged by the Graphite merge queue via draft PR: #3257.

graphite-app bot pushed a commit that referenced this pull request Oct 24, 2025
@graphite-app graphite-app bot closed this Oct 24, 2025
@graphite-app graphite-app bot deleted the 10-23-fix_parallelize_runner_config_endpoints branch October 24, 2025 06:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants