Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 5, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Comments Updated (UTC)
rivetkit-serverless Ready Ready Preview Comment Nov 6, 2025 5:37am
3 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 6, 2025 5:37am
rivet-inspector Ignored Ignored Preview Nov 6, 2025 5:37am
rivet-site Ignored Ignored Preview Nov 6, 2025 5:37am

Copy link
Contributor Author

MasterPtato commented Nov 5, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Nov 5, 2025

Pull Request Review: Fix Boot Dupe Runners

Summary

This PR addresses duplicate runner boot issues by implementing an eviction mechanism using publish/subscribe patterns. The changes prevent race conditions when multiple runners with the same ID or namespace+name+key combination attempt to connect.

Code Quality & Best Practices

✅ Strengths

  1. Race Condition Prevention: The subscribe-then-publish pattern (lines 103-130 in pegboard-runner/src/lib.rs) correctly prevents race conditions by subscribing before publishing eviction messages, ensuring the new connection receives confirmation.

  2. Proper Error Handling: New WsError::Eviction error type provides clear semantics for eviction scenarios (errors.rs:10-13).

  3. Structured Concurrency: The refactored task lifecycle uses tokio::join! instead of tokio::select!, allowing all tasks to complete gracefully before determining the overall result (lines 163-205).

  4. Abort Coordination: Watch channels are used for coordinating task aborts, preventing orphaned tasks.

⚠️ Concerns & Issues

1. Critical: Race Condition in Eviction Logic

Location: pegboard-runner/src/lib.rs:103-130

tokio::try_join!(
    async {
        ups.publish(&eviction_topic, &[], PublishOpts::broadcast()).await?;
        eviction_sub.next().await  // Skips first message
    },
    async {
        ups.publish(&eviction_topic2, &[], PublishOpts::broadcast()).await?;
        eviction_sub2.next().await  // Skips first message
    },
)?;

Issue: This assumes that the first message received will always be the self-published message, but in a distributed system with multiple runners connecting simultaneously, there's no guarantee. If another runner publishes an eviction message between your subscribe and publish, you might skip the wrong message.

Recommendation: Use a correlation ID or timestamp to identify and skip only the self-published message, or use a different mechanism like comparing message metadata.

2. Potential Denial of Service

Location: pegboard-runner/src/lib.rs:103-130

The eviction mechanism could be exploited to repeatedly evict legitimate runners by continuously reconnecting with the same credentials. Consider:

  • Adding rate limiting for evictions
  • Implementing a grace period or backoff strategy
  • Logging suspicious eviction patterns for monitoring

3. Inconsistent Error Prioritization

Location: pegboard-runner/src/lib.rs:211-220 vs pegboard-gateway/src/lib.rs:486-493

In pegboard-runner, the error priority order was changed to prioritize the first task's error:

// Prefer error
(Err(err), _, _) => Err(err),
(_, Err(err), _) => Err(err),
(_, _, Err(err)) => Err(err),

However, in pegboard-gateway, only the order was swapped without clear reasoning:

(Err(err), _) => Err(err),  // Previously was second
(_, Err(err)) => Err(err),  // Previously was first

Recommendation: Document why specific error priorities matter, or make the behavior consistent (e.g., prefer the first error that occurs chronologically, not positionally).

4. Memory Leak Risk in Active Requests

Location: pegboard-runner/src/tunnel_to_ws_task.rs & ws_to_tunnel_task.rs

When tasks are aborted via the watch channel, there's no explicit cleanup of conn.active_tunnel_requests or conn.active_ws_requests. If a task aborts mid-request processing, these HashMaps could retain stale entries.

Recommendation: Add explicit cleanup in abort paths or implement Drop handlers for request tracking.

5. Deprecated Field Without Migration Path

Location: pegboard/src/workflows/runner.rs:459

struct InitOutput {
    /// Deprecated.
    evict_workflow_id: Option<Id>,
}

This field is now always None but still exists in the struct. If this is part of a serialized workflow state, old workflows may have non-None values that are silently ignored.

Recommendation: Either:

  • Add a code comment explaining the migration strategy
  • Add a warning log if a non-None value is encountered
  • Schedule the field for removal in a future version

Performance Considerations

🔍 Observations

  1. Additional PubSub Subscriptions: Each runner now subscribes to 3 topics instead of 1 (receiver + 2 eviction topics). This increases overhead but is acceptable for the functionality gained.

  2. Synchronous Eviction Check: The try_join! on lines 103-130 blocks runner initialization until both eviction messages are published and received. This adds latency to the connection handshake.

Recommendation: Profile the connection establishment time to ensure it remains within acceptable bounds (< 100ms ideally).

  1. Event Filtering: The change in workflows/runner.rs:151-155 filters events twice:
let new_events = events.iter().filter(|event| event.index > state.last_event_idx);
// ... later ...
events: events.into_iter().filter(|event| event.index > state.last_event_idx).collect(),

Recommendation: Store the filtered events once to avoid redundant iteration.

Security Concerns

🔒 Issues

  1. Eviction Topic Predictability: The eviction topics use predictable patterns:
    • pegboard.runner.eviction-by-id.{runner_id}
    • pegboard.runner.eviction-by-name.{namespace_id}.{runner_name}.{runner_key}

If these topics are not properly secured in the pubsub system, an attacker could subscribe to or publish on these topics to cause disruption.

Recommendation: Ensure pubsub access controls are properly configured to prevent unauthorized access to eviction topics.

  1. Missing Authentication Context: The eviction mechanism doesn't verify that the eviction request comes from a legitimate source. The code publishes empty payloads (&[]), so there's no way to validate authenticity.

Recommendation: Consider including signed payloads or using a separate authenticated channel for evictions.

Test Coverage

❌ Missing Tests

The PR doesn't include any test additions for:

  • The eviction mechanism
  • Race condition prevention
  • Concurrent runner connections with duplicate IDs
  • Concurrent runner connections with duplicate names
  • Edge cases in task abort coordination

Recommendation: Add integration tests covering:

#[tokio::test]
async fn test_duplicate_runner_id_eviction() { /* ... */ }

#[tokio::test]
async fn test_duplicate_runner_name_eviction() { /* ... */ }

#[tokio::test]
async fn test_simultaneous_connections_same_credentials() { /* ... */ }

Additional Observations

Minor Issues

  1. Typo Fix: types.ts:3 fixes rivetNamepacerivetNamespace (good catch!)

  2. Formatting Change: test-runner/src/index.ts:163 adds unnecessary whitespace in empty function body. Prefer keeping it as () => {} for consistency.

  3. Unnecessary UPS Context Call: In lib.rs:277, the code now uses the locally cached ups variable instead of calling self.ctx.ups() again - good optimization!

  4. UniversalPubSub Documentation: The updated comment in universalpubsub/src/driver/mod.rs:16-19 better clarifies the OneSubscriber behavior - excellent improvement.

Recommendations Summary

Must Fix

  1. ⚠️ Address the race condition in eviction message skipping (use correlation IDs)
  2. ⚠️ Clarify and document error prioritization logic
  3. ⚠️ Add test coverage for the eviction mechanism

Should Fix

  1. Implement cleanup for active requests on task abort
  2. Add DoS protection (rate limiting) for evictions
  3. Fix double-filtering of events in workflow runner
  4. Document or remove deprecated evict_workflow_id field

Nice to Have

  1. Add security considerations for eviction topics to documentation
  2. Profile connection establishment performance
  3. Add monitoring/alerting for suspicious eviction patterns

Conclusion

This PR makes significant improvements to prevent duplicate runner boots, but has some critical issues around race conditions and security that should be addressed before merging. The overall architecture is sound, but the implementation needs refinement to be production-ready.

Recommendation: Request changes, particularly for the eviction message skipping logic and test coverage.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 5, 2025

More templates

@rivetkit/actor

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/actor@3377

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3377

@rivetkit/core

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/core@3377

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3377

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3377

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3377

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3377

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3377

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3377

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3377

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3377

commit: 4e625cf

create_ts: ctx.create_ts(),
})
.await?;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't removing this cause history diversion?

}
}

pub struct RunnerEvictionByNameSubject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is misleading, isn't this ByKey? the name is just like the "pool"

.subscribe(&topic)
.await
.with_context(|| format!("failed to subscribe to runner receiver topic: {}", topic))?;
let mut eviction_sub = ups.subscribe(&eviction_topic).await.with_context(|| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join sub futs

let (ws_to_tunnel_abort_tx, ws_to_tunnel_abort_rx) = watch::channel(());
let (ping_abort_tx, ping_abort_rx) = watch::channel(());

let tunnel_to_ws = tokio::spawn(tunnel_to_ws_task::task(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo this naming is confusing. most of this file is tunnel logic, but it's also where all of the other ws<->pb forwarding logic happens.

@claude
Copy link

claude bot commented Nov 6, 2025

PR Review: Fix Boot Dupe Runners

I've completed a comprehensive review of this PR. Overall, the implementation addresses the duplicate runner issue with a robust eviction mechanism. Here are my findings:

✅ Strengths

  1. Race Condition Prevention: The publish-subscribe-then-consume pattern (lines 121-133 in lib.rs) effectively prevents race conditions by:

    • Subscribing to eviction topics first
    • Publishing eviction messages
    • Consuming the self-sent messages before proceeding
  2. Dual Eviction Strategy: Using both ID-based and name-based eviction topics provides comprehensive coverage for duplicate detection

  3. Clean Task Lifecycle Management: The refactored task orchestration using watch channels and LifecycleResult enum is much cleaner than the previous tokio::select! + .abort() pattern

  4. Error Handling: Proper error propagation and logging throughout

🔍 Issues & Concerns

High Priority

  1. Potential Deadlock in Task Completion Logic (lines 165-211 in lib.rs)

    The abort logic has a subtle issue. Each task's completion handler sends abort signals to the other tasks, but there's duplication of the abort senders:

    let tunnel_to_ws_abort_tx2 = tunnel_to_ws_abort_tx.clone();
    let ws_to_tunnel_abort_tx2 = ws_to_tunnel_abort_tx.clone();
    let ping_abort_tx2 = ping_abort_tx.clone();

    Then in the completion handlers, both the original and cloned versions are used inconsistently. For example:

    • Line 173: ping_abort_tx.send(())
    • Line 174: ws_to_tunnel_abort_tx.send(())
    • Line 188: ping_abort_tx2.send(())
    • Line 189: tunnel_to_ws_abort_tx.send(())

    Issue: In the ws_to_tunnel completion handler (line 189), you're sending to tunnel_to_ws_abort_tx instead of tunnel_to_ws_abort_tx2. This means the clones aren't being used consistently.

    Recommendation: Simplify by removing the _tx2 clones and just use the original senders in all three handlers. The current pattern is unnecessarily complex.

  2. Eviction Message Handling After Initial Connection (lines 33-40 in tunnel_to_ws_task.rs and lines 38-40 in ws_to_tunnel_task.rs)

    After the initial eviction messages are consumed (lines 126, 131), the eviction_sub and eviction_sub2 subscribers remain active in the tasks. This means:

    • Any subsequent eviction message published to these topics will immediately terminate the connection
    • This is likely intentional, but there's no documentation explaining this behavior

    Question: Is this the intended behavior? If another runner with the same ID/name connects later, should it evict this runner?

  3. PublishOpts Usage Inconsistency

    • Line 123, 129: PublishOpts::broadcast() for eviction messages
    • Line 278: PublishOpts::one() for close messages to gateway
    • Line 391 in ws_to_tunnel_task.rs: PublishOpts::one() for tunnel messages

    The documentation in universalpubsub/src/driver/mod.rs warns:

    "This should not be used if there will ever be more than one subscription at a time to the given topic on a global scale."

    Concern: Are we certain that gateway reply topics will only have one subscriber? If multiple subscribers exist, messages might be delivered to the wrong gateway.

Medium Priority

  1. Incomplete Result Pattern Matching (lines 214-224 in lib.rs)

    The lifecycle result determination logic has this pattern:

    match (tunnel_to_ws_res, ws_to_tunnel_res, ping_res) {
        (Err(err), _, _) => Err(err),
        (_, Err(err), _) => Err(err),
        (_, _, Err(err)) => Err(err),
        (Ok(res), Ok(LifecycleResult::Aborted), _) => Ok(res),
        (Ok(LifecycleResult::Aborted), Ok(res), _) => Ok(res),
        (res, _, _) => res,  // "Unlikely case"

    Issue: The "unlikely case" fallback doesn't handle all combinations. For example:

    • (Ok(Aborted), Ok(Aborted), Ok(Closed)) → Returns Ok(Aborted) from first position
    • What if ping completes normally but both others are aborted?

    Recommendation: Make the pattern matching exhaustive or add documentation explaining why these cases are impossible.

  2. Workflow State Cleanup (lines 473-495 in workflows/runner.rs)

    The init activity was modified to remove the eviction logic, but:

    • The InitOutput.evict_workflow_id field is now always None but still exists (marked as "Deprecated")
    • The runner-by-key slot is still written to the database but never read

    Question: Is there a follow-up PR to remove the deprecated field and the unused database write? This creates technical debt.

  3. Event Filtering Logic (lines 150-156, 183-190 in workflows/runner.rs)

    Events are filtered twice:

    let new_events = events
        .iter()
        .filter(|event| event.index > state.last_event_idx);
    
    // ... later ...
    
    events: events
        .into_iter()
        .filter(|event| event.index > state.last_event_idx)
        .collect(),

    Issue: This filters the same events twice with identical logic. The first filter uses .iter() (borrowed), the second uses .into_iter() (owned).

    Recommendation: Filter once and collect into a Vec, then use that for both the signal forwarding loop and the database insert.

Low Priority

  1. Error Prioritization Order (line 39 in pegboard-gateway/src/lib.rs)

    The error preference was swapped:

    -	(_, Err(err)) => Err(err),
    	(Err(err), _) => Err(err),
    +	(_, Err(err)) => Err(err),

    Question: What was the reason for this change? The commit message doesn't explain it, and it seems unrelated to fixing duplicate runners. Was there a specific error that was being masked?

  2. Module Renaming Without Functional Changes

    • client_to_pubsub_task.rsws_to_tunnel_task.rs
    • pubsub_to_client_task.rstunnel_to_ws_task.rs

    While these names are more descriptive, the rename makes the diff harder to review. Consider whether this refactoring should have been in a separate commit.

🧪 Testing Recommendations

  1. Race Condition Test: Verify that rapid reconnections of the same runner don't create duplicate registrations
  2. Eviction Test: Test that old runners are properly evicted when a new one connects
  3. Error Path Test: Verify all three tasks handle errors correctly and abort the others
  4. Load Test: Test behavior under high connection/disconnection rates

📝 Style & Documentation

  1. Missing Documentation: The eviction mechanism is complex but lacks comments explaining the race condition it prevents
  2. Logging Consistency: Good use of structured logging with tracing::debug!(?res, ...)
  3. Code Style: Follows the repository's conventions (hard tabs, error handling patterns)

🔒 Security Considerations

  1. Eviction Authority: The current design allows any runner with knowledge of namespace_id + runner_name + runner_key to evict existing runners. Ensure this is the intended behavior and these values are properly authenticated upstream.

Minor Issues

  • Typo Fix: rivetNamepacerivetNamespace in types.ts:3
  • Unrelated Change: examples/next-js/src/rivet/registry.ts adds sleepTimeout: 2000 - seems unrelated to this PR

Summary

The core eviction mechanism is sound, but there are several implementation details that need attention:

  1. Must Fix: Simplify the abort sender logic (potential inconsistency)
  2. Should Fix: Document the eviction behavior and clarify the lifecycle result matching
  3. Nice to Have: Clean up the deprecated evict_workflow_id field and optimize event filtering

Please address the high-priority issues before merging. I'm happy to discuss any of these points further!

@graphite-app
Copy link
Contributor

graphite-app bot commented Nov 6, 2025

Merge activity

  • Nov 6, 5:47 AM UTC: NathanFlurry added this pull request to the Graphite merge queue.
  • Nov 6, 5:48 AM UTC: CI is running for this pull request on a draft pull request (#3395) due to your merge queue CI optimization settings.
  • Nov 6, 5:50 AM UTC: Merged by the Graphite merge queue via draft PR: #3395.

graphite-app bot pushed a commit that referenced this pull request Nov 6, 2025
@graphite-app graphite-app bot closed this Nov 6, 2025
@graphite-app graphite-app bot deleted the 11-05-fix_boot_dupe_runners branch November 6, 2025 05:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants