Skip to content

Conversation

@MasterPtato
Copy link
Contributor

No description provided.

@vercel
Copy link

vercel bot commented Nov 18, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

4 Skipped Deployments
Project Deployment Preview Comments Updated (UTC)
rivet-cloud Ignored Ignored Nov 18, 2025 2:30am
rivet-inspector Ignored Ignored Nov 18, 2025 2:30am
rivet-site Ignored Ignored Nov 18, 2025 2:30am
rivetkit-serverless Skipped Skipped Nov 18, 2025 2:30am

Copy link
Contributor Author

MasterPtato commented Nov 18, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Nov 18, 2025

More templates

@rivetkit/cloudflare-workers

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/cloudflare-workers@3480

@rivetkit/db

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/db@3480

@rivetkit/framework-base

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/framework-base@3480

@rivetkit/next-js

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/next-js@3480

@rivetkit/react

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/react@3480

rivetkit

pnpm add https://pkg.pr.new/rivet-dev/rivet/rivetkit@3480

@rivetkit/sql-loader

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/sql-loader@3480

@rivetkit/engine-runner

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner@3480

@rivetkit/engine-runner-protocol

pnpm add https://pkg.pr.new/rivet-dev/rivet/@rivetkit/engine-runner-protocol@3480

commit: b180eca

@claude
Copy link

claude bot commented Nov 18, 2025

Code Review: Rewrite runner workflow to handle batch signals

Summary

This PR introduces runner2.rs as a rewrite of the runner workflow to handle batch signal processing. The key improvement is switching from processing signals one-at-a-time to processing them in batches using listen_n_with_timeout instead of listen_with_timeout.

Positive Aspects

Architecture & Design

  • Batch Processing: The switch to listen_n_with_timeout with a batch size of 1024 is excellent for performance and throughput
  • State Separation: Good separation between State (persistent workflow state) and LifecycleState (loop iteration state)
  • Signal Batching: Well-designed batching logic that accumulates signals before processing (lines 94-183)
  • Message Batching: Efficient batching of outbound messages before sending (line 185)

Error Handling

  • Graceful Degradation: Proper use of .graceful_not_found() when signaling actor workflows (lines 163, 247, 362)
  • Race Condition Handling: Good handling of draining runners receiving start actor commands (lines 139-172)

Code Quality

  • Clear Comments: Good inline documentation explaining non-obvious behavior (lines 236, 819, 967, 985)
  • Consistent Patterns: Activities follow consistent naming and structure
  • Logging: Appropriate use of structured logging with tracing::warn!

Issues & Concerns

Critical Issues

1. Unused Import (line 11)

use vbare::OwnedVersionedData;

This import is not used anywhere in the file and should be removed. The project uses workspace dependencies, so unnecessary imports should be cleaned up.

Fix: Remove line 11

2. Potential Division by Zero (lines 620, 1057)

let remaining_millislots = (remaining_slots * 1000) / input.total_slots;

If input.total_slots is 0, this will panic. While this may be prevented by validation elsewhere, defensive coding suggests adding a check or assertion.

Recommendation: Add validation or use checked_div with proper error handling

3. Missing Signal Handler?

The Main signal enum at line 1170 includes CheckQueue, Command, Forward, and Stop, but I notice the batch processing loop handles these signals. However, there's no validation that all signal variants are handled - if a new signal type is added to Main, the compiler won't force an update here since the match isn't exhaustive on the enum.

Recommendation: Consider if this is intentional or if the signal handling should be refactored for better compile-time safety

Performance Considerations

4. Sequential Signal Processing (lines 239-256)

// NOTE: This should not be parallelized because signals should be sent in order
// Forward to actor workflows
// Process events
for event in &events {
    // ... sends signal to actor workflow
}

While the comment explains this must be sequential, this could become a bottleneck with many events. Each signal send is an async operation that must complete before the next begins.

Consider:

  • Is strict ordering truly required for all events, or just events for the same actor?
  • Could you batch events by actor_id and parallelize across different actors while maintaining order per-actor?

5. Sequential Allocation Signals (lines 315-321)

for alloc in res.allocations {
    ctx.signal(alloc.signal)
        .to_workflow::<crate::workflows::actor::Workflow>()
        .tag("actor_id", alloc.actor_id)
        .send()
        .await?;
}

Similar to #4, these allocations are sent sequentially but could potentially be parallelized since they're going to different actors.

Recommendation: Use futures::future::try_join_all or similar to parallelize these independent operations

6. Message-by-Message Publishing (lines 1140-1147)

for message in &input.messages {
    let message_serialized = versioned::ToClient::wrap_latest(message.clone())
        .serialize_with_embedded_version(PROTOCOL_VERSION)?;
    
    ctx.ups()?
        .publish(&receiver_subject, &message_serialized, PublishOpts::one())
        .await?;
}

Each message is published individually in a loop. If the pubsub system supports batch publishing, this could be optimized.

Recommendation: Check if ups() supports batch publishing and utilize it

Code Quality Issues

7. Commented Out Code (lines 36, 819-825)

// events: Vec<EventRow>,
// TODO: Storing events is disabled for now, otherwise state will grow indefinitely

While the TODO explains why events aren't stored, the commented code should either be removed or the decision should be finalized.

Recommendation: Either implement a proper solution (e.g., event truncation/archival) or remove the dead code

8. Deprecated Field (line 455-456)

#[derive(Debug, Serialize, Deserialize)]
struct InitOutput {
    /// Deprecated.
    evict_workflow_id: Option<Id>,
}

If this field is deprecated, there should be a plan to remove it. Is this for backwards compatibility? Should it have a timeline for removal?

Recommendation: Add context on when this can be removed or if it's needed for backward compatibility

9. Clone Overhead (line 70, 284)

let input = input.clone();  // line 70
commands: commands.clone(),  // line 284

The input is cloned on every loop iteration. While Input is relatively small, commands could grow large and is cloned even when it might not be needed (the clone happens before the activity, but the activity might not use all fields).

Recommendation: Consider if these clones are necessary or if references could be used

Testing & Documentation

10. No Tests Included

This is a significant rewrite (1176 lines) that changes core behavior from single-signal to batch-signal processing, but no tests are included in the PR.

Critical Recommendation: Add tests covering:

  • Batch signal processing with various batch sizes
  • Timeout behavior when receiving fewer signals than batch size
  • Race conditions (draining runner receiving start commands)
  • Event ordering guarantees
  • Command acknowledgment logic
  • State persistence and recovery

11. Missing Migration Path

The file is named runner2.rs suggesting this runs alongside the old runner.rs.

Questions:

  • How will existing runner workflows migrate from runner to runner2?
  • Are both implementations expected to run concurrently?
  • Is there a feature flag or rollout plan?
  • Will runner.rs be deprecated?

Recommendation: Document the migration strategy

Minor Issues

12. Inconsistent Naming (lines 427-442)

LifecycleState::new() could use Default trait instead of a custom constructor since it just initializes with default values.

Recommendation:

impl Default for LifecycleState {
    fn default() -> Self { ... }
}

13. Magic Number (line 16)

const EVENT_ACK_BATCH_SIZE: i64 = 500;

Good use of a constant, but consider if this should be configurable via the config system like other thresholds.

Security Concerns

14. Unbounded Batch Size

While the batch size is limited to 1024 signals (line 77), each signal could contain variable-sized data (e.g., prepopulate_actor_names in init, metadata). This could potentially lead to memory issues if many large signals arrive.

Recommendation: Consider adding size-based limits in addition to count-based limits

15. Error Message Information Disclosure (lines 130-133)

bail!(
    "received message that should not be sent to runner workflow: {:?}",
    sig.inner
)

Logging the full message content in an error could potentially expose sensitive information.

Recommendation: Log message type only, not full content, or ensure sensitive fields are redacted

Performance Metrics

The batch processing approach should provide significant performance improvements:

  • Reduced syscalls: Batching reduces context switches
  • Better throughput: Processing 1024 signals vs 1 signal per iteration
  • Reduced activity overhead: Batching messages reduces workflow activity invocations

Recommendation: Add metrics to track:

  • Average batch sizes received
  • Processing latency per batch
  • Time spent waiting vs processing
  • Signal queue depths

Summary & Recommendations

Must Fix Before Merge

  1. Remove unused import vbare::OwnedVersionedData (line 11)
  2. Add division-by-zero safety for total_slots calculations (lines 620, 1057)
  3. Add comprehensive tests for batch signal processing behavior
  4. Document migration strategy from runner.rs to runner2.rs

Should Fix Before Merge

  1. Parallelize independent signal sends (allocation signals, lines 315-321)
  2. Review event processing ordering requirements - consider per-actor parallelization
  3. Resolve the commented-out event storage code - implement or remove
  4. Add performance metrics for batch processing

Consider for Future

  1. Make EVENT_ACK_BATCH_SIZE configurable
  2. Implement batch publishing if supported by pubsub system
  3. Add size-based limits in addition to count-based batch limits
  4. Use Default trait for LifecycleState

Conclusion

This is a well-architected improvement to the runner workflow that should significantly improve performance through batch processing. The code quality is generally good with clear separation of concerns and proper error handling. However, the lack of tests for such a critical component is concerning, and the migration strategy needs clarification. With the critical issues addressed, this will be a solid improvement to the system.

Great work on the batch processing implementation! The signal accumulation logic is particularly well done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants