Skip to content

fix: add global envoy to envoy client, remove crash policy from actors v2#4593

Merged
MasterPtato merged 1 commit intomainfrom
04-08-fix_add_global_envoy_to_envoy_client_remove_crash_policy_from_actors_v2
Apr 8, 2026
Merged

fix: add global envoy to envoy client, remove crash policy from actors v2#4593
MasterPtato merged 1 commit intomainfrom
04-08-fix_add_global_envoy_to_envoy_client_remove_crash_policy_from_actors_v2

Conversation

@MasterPtato
Copy link
Copy Markdown
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@MasterPtato MasterPtato requested a review from NathanFlurry April 8, 2026 22:46
@railway-app
Copy link
Copy Markdown

railway-app bot commented Apr 8, 2026

🚅 Deployed to the rivet-pr-4593 environment in rivet-frontend

Service Status Web Updated (UTC)
website ❌ Build Failed (View Logs) Web Apr 8, 2026 at 10:47 pm
mcp-hub ✅ Success (View Logs) Web Apr 8, 2026 at 10:47 pm
frontend-cloud ❌ Build Failed (View Logs) Web Apr 8, 2026 at 10:47 pm
frontend-inspector ❌ Build Failed (View Logs) Web Apr 8, 2026 at 10:47 pm
ladle ❌ Build Failed (View Logs) Web Apr 8, 2026 at 10:47 pm
kitchen-sink ❌ Build Failed (View Logs) Web Apr 8, 2026 at 10:47 pm

@railway-app railway-app bot temporarily deployed to rivet-frontend / rivet-pr-4593 April 8, 2026 22:46 Destroyed
Copy link
Copy Markdown
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude bot commented Apr 8, 2026

PR Review: fix: add global envoy to envoy client, remove crash policy from actors v2

Summary

This PR does two things: (1) introduces a global singleton envoy via OnceLock, and (2) removes per-actor crash_policy from actors v2, hardcoding Sleep behavior. The code is generally clean but there are a few issues worth addressing.


Issues

started() regression — potential hang after reconnect

File: engine/sdks/rust/envoy-client/src/handle.rs

The simplification from the poll-loop to a single changed().await changes semantics in a way that can hang:

// New
pub async fn started(&self) {
    let _ = self.started_rx.clone().changed().await;
}

clone() on a watch::Receiver marks the current value as already seen. If started() is called after the envoy has already connected (e.g. on a handle obtained from GLOBAL_ENVOY after startup), changed() will wait for the next send() — which only happens on reconnect — and block indefinitely. The old loop checked the boolean value first:

while !*rx.borrow_and_update() { ... }

With the () type this cannot be done the same way, but the intent should be preserved. One option: use has_changed() + changed() on the original receiver (not a clone), or restructure the channel to be a oneshot.

Silent config discard on repeated global calls

File: engine/sdks/rust/envoy-client/src/envoy.rs

GLOBAL_ENVOY
    .get_or_init(|| start_envoy_sync_inner(config))
    .clone()

If start_envoy_sync is called a second time with a different config (e.g. different endpoint, pool name), the new config is silently discarded and the original global handle is returned. This is a footgun. At minimum, add a debug-level log warning that the config was ignored, or document this contract clearly (a debug_assert! comparing key fields could help catch misconfiguration in tests).

Signal handler spawned per non-global envoy

File: engine/sdks/rust/envoy-client/src/envoy.rs

let handle2 = handle.clone();
tokio::spawn(async move {
    let _ = tokio::signal::ctrl_c().await;
    handle2.shutdown(false);
});

This is spawned inside start_envoy_sync_inner, which is called for every not_global: true envoy. If a test or process creates multiple non-global envoys, it accumulates N Ctrl+C listeners. Each will call shutdown on its respective handle, which is probably benign, but the extra tasks are noise. Consider only installing the signal handler once (e.g. guard with its own OnceLock<()> or scope it to the global path only).

Additionally, libraries generally should not intercept process signals. The signal handler here would interfere with any application-level Ctrl+C handling (e.g. graceful HTTP server shutdown). Consider exposing a shutdown_on_signal helper that the caller can opt into instead.

not_global is a negative boolean flag

File: engine/sdks/rust/envoy-client/src/config.rs

not_global: bool is harder to read than a positive name. not_global: false (the common case) reads as a double negative. Consider renaming to something like isolated: bool or per_instance: bool where true means "don't share the global instance".

is_shutdown is ineffective in the serverless SSE path

File: engine/sdks/rust/test-envoy/src/server.rs

In start_serverless, create_envoy is called with not_global: false, so it returns the global envoy handle (with the original callbacks from startup). The new DefaultTestCallbacks and its is_shutdown arc are created fresh but never wired to the global envoy's actual shutdown. The SSE loop polling is_shutdown.load() will never see true via this mechanism. This is a test server so it's not critical, but the dead code is misleading.


Minor Notes

  • SleepIntent { rewake_after_stop: false } behavior change: In the old code both true and false mapped to Decision::Reallocate. The new code maps false to Decision::Sleep. This looks intentional (the field name implies it), but it's a behavioral change worth calling out explicitly in the PR description.

  • CrashPolicy import in util.rs: CrashPolicy is now imported only to hardcode CrashPolicy::Sleep. If this is a permanent change, consider whether the type is still meaningful in the public API or if it should be phased out.

  • rivet-util-serde split: Clean factoring — no concerns.

  • events.rs refactor: The two-pass cleanup loop elimination is an improvement. Looks correct.


Overall

The changes are well-scoped. The global envoy pattern is a reasonable optimization, but the started() hang regression and signal handler scope are the two items I'd want resolved before merge.

Copy link
Copy Markdown
Contributor Author

MasterPtato commented Apr 8, 2026

Merge activity

  • Apr 8, 11:24 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Apr 8, 11:25 PM UTC: @MasterPtato merged this pull request with Graphite.

@MasterPtato MasterPtato merged commit 7ab128e into main Apr 8, 2026
7 of 20 checks passed
@MasterPtato MasterPtato deleted the 04-08-fix_add_global_envoy_to_envoy_client_remove_crash_policy_from_actors_v2 branch April 8, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant