Skip to content

Conversation

@graphite-app
Copy link
Contributor

@graphite-app graphite-app bot commented Dec 12, 2025

This draft PR was created by the Graphite merge queue.
Trunk will be fast forwarded to the HEAD of this PR when CI passes, and the original PRs will be closed.

The following PRs are included in this draft PR:

@vercel
Copy link

vercel bot commented Dec 12, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Dec 12, 2025 1:58am
rivet-inspector Ignored Ignored Preview Dec 12, 2025 1:58am
rivet-site Ignored Ignored Preview Dec 12, 2025 1:58am
rivetkit-serverless Skipped Skipped Dec 12, 2025 1:58am

@vercel vercel bot temporarily deployed to Preview – rivetkit-serverless December 12, 2025 01:58 Inactive
@graphite-app graphite-app bot closed this Dec 12, 2025
@vercel vercel bot temporarily deployed to Production – rivetkit-serverless December 12, 2025 01:58 Inactive
@graphite-app graphite-app bot deleted the gtmq_spec_590012_1765504676576-f4911fcb-efdc-4398-80a9-7d33b63e64cd branch December 12, 2025 01:59
@claude
Copy link

claude bot commented Dec 12, 2025

Pull Request Review

This PR combines three related changes (#3609, #3610, #3611) focused on improving cross-datacenter request handling for actors and gateway routing. Overall, the changes are well-structured and address important routing issues.

✅ Strengths

  1. Good Code Reuse: Moving shared types to api-types and the get_or_create implementation to api-peer follows good architectural patterns
  2. Improved Error Handling: Enhanced logging and error context throughout, particularly in api-util and guard-core
  3. Better Separation of Concerns: The refactoring clearly separates datacenter selection logic from the actual get-or-create operation
  4. Consistent Error Messages: The logging improvements use structured logging correctly per CLAUDE.md guidelines

🔍 Code Quality Observations

1. Duplicate Code in extract_duplicate_key_error (engine/packages/api-peer/src/actors/get_or_create.rs:101-139)

The extract_duplicate_key_error function is duplicated from api-public/src/actors/utils.rs. This should be extracted to a shared location:

  • Consider moving to api-util or creating a shared module
  • The function is identical in both places, which creates maintenance burden

2. Path Parameter Naming Inconsistency (engine/packages/guard/src/routing/pegboard_gateway.rs)

Good improvement introducing original_path and stripped_path for clarity:

  • original_path: Full path including actor ID prefix
  • stripped_path: Path with actor prefix removed

However, in route_request_inner (line 147), only original_path is used when routing to peer datacenters, while stripped_path is used for local routing (line 260). This seems correct but deserves a comment explaining why.

Suggestion: Add a comment explaining:

// Use original_path for peer DC routing since the peer will strip the prefix
path: original_path.to_owned(),
// Use stripped_path for local gateway since prefix already removed
stripped_path.to_string(),

3. Error Context Improvements (engine/packages/api-util/src/lib.rs)

Excellent improvements to error handling:

  • Line 183-188: Early validation of response status with detailed error messages
  • Line 210-214, 216-219: Better context on parse failures including ray_id and body

One minor suggestion: The send_request helper (lines 13-21) is good, but consider whether timeout errors should be handled specially.

4. Logging Improvements (engine/packages/guard-core/src/proxy_service.rs)

Great improvements to WebSocket error logging:

  • Lines 1493-1497: Changed from debug to warn level - appropriate for errors
  • Lines 1571-1577, 2107-2113: Consistent error logging before closing WebSocket

Minor issue: Line 1494 uses tracing::warn!(?err, "...") consistently, which is correct per CLAUDE.md.

⚠️ Potential Issues

1. Race Condition in get_or_create (engine/packages/api-peer/src/actors/get_or_create.rs:23-38)

The check-then-create pattern has a race condition:

let existing = ctx.op(pegboard::ops::actor::get_for_key::Input { ... }).await?;
if let Some(actor) = existing.actor {
    return Ok(GetOrCreateResponse { actor, created: false });
}
// Actor doesn't exist, create it
let actor_id = Id::new_v1(ctx.config().dc_label());
match ctx.op(pegboard::ops::actor::create::Input { ... }).await { ... }

Issue: Between checking (line 24) and creating (line 43), another request could create the same actor. The duplicate key error handling (lines 64-90) mitigates this, but it's not ideal.

Recommendation: Consider documenting this is expected behavior, or explore whether the database operation can be atomic. The current approach works but relies on error recovery rather than prevention.

2. Missing datacenter Field Validation (engine/packages/api-types/src/actors/get_or_create.rs:15)

The comment says "Ignored in api-peer" but there's no validation that it's actually ignored:

// Ignored in api-peer
pub datacenter: Option<String>,

In api-peer/src/actors/get_or_create.rs, line 56 sets datacenter_name: None, which is correct. However, if a client sends this to api-peer directly (bypassing api-public), it will silently ignore the datacenter preference.

Recommendation: Either document this as expected behavior in the function doc comment, or add a warning log if datacenter is set when called via api-peer.

🎯 Performance Considerations

  1. String Cloning: Lines 18, 27, 28, 47, 48, 50 in get_or_create.rs clone strings multiple times. Consider if references could be used instead, though the impact is likely minimal.

  2. Double Parsing: In extract_duplicate_key_error, the actor_id is parsed from string twice (lines 111, 130). This is acceptable given it's an error path.

🔒 Security Considerations

  1. Token Validation: Line 115 in pegboard_gateway.rs has a comment "NOTE: Token validation implemented in EE". This is fine for the open-source version but ensure the EE implementation is in place.

  2. Error Message Exposure: The improved error messages in api-util (lines 185-186, 212, 218) expose internal details like ray_id and response bodies. This is appropriate for debugging but ensure sensitive data isn't logged.

🧪 Test Coverage

  1. Test Updates: The test updates in parse_actor_path.rs correctly reflect the field rename from remaining_path to stripped_path. All tests updated consistently.

  2. Missing Tests: No new tests for the get_or_create cross-datacenter routing logic. Consider adding integration tests for:

    • get_or_create with existing actor in remote DC
    • get_or_create with duplicate key race condition
    • Error handling when remote DC is unavailable

📝 Minor Issues

  1. OpenAPI Schema: Adding "additionalProperties": false to ActorsGetOrCreateResponse is good for API validation.

  2. Import Organization: All imports are at the top of files as required by CLAUDE.md ✓

  3. Structured Logging: All new logging uses structured format correctly (e.g., tracing::warn!(?err, "message")) ✓

  4. Hard Tabs: Assuming rustfmt is handling this correctly ✓

📋 Recommendations Summary

High Priority:

  1. Extract extract_duplicate_key_error to shared location to eliminate duplication
  2. Document the race condition behavior in get_or_create or make it atomic
  3. Add integration tests for cross-datacenter get_or_create scenarios

Medium Priority:

  1. Add explanatory comments for why original_path vs stripped_path are used in different contexts
  2. Document or validate the datacenter field behavior in api-peer

Low Priority:

  1. Consider minor optimizations to reduce string cloning where possible

✨ Overall Assessment

This is a solid PR that improves cross-datacenter routing reliability and observability. The code follows Rivet conventions well, and the error handling improvements are valuable. The main concerns are code duplication and the check-then-create race condition, which should be addressed before merging.

Recommendation: Approve with minor changes requested for code deduplication and documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants