Skip to content

fix(pegboard): refresh runner config after envoy connect#4778

Draft
NathanFlurry wants to merge 1 commit intodriver-fixes/gasoline-skip-corrupt-active-workerfrom
driver-fixes/refresh-runner-config-after-envoy-connect
Draft

fix(pegboard): refresh runner config after envoy connect#4778
NathanFlurry wants to merge 1 commit intodriver-fixes/gasoline-skip-corrupt-active-workerfrom
driver-fixes/refresh-runner-config-after-envoy-connect

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 26, 2026

PR Review: fix(pegboard): refresh runner config after envoy connect

This is a focused, two-file fix that addresses a stale-cache bug where a serverful envoy reconnection updates the runner-config protocol version in UDB (inside the init_conn transaction at line 227 of conn.rs) but the cache layer was not invalidated, leaving the first actor-create request reading a stale pre-mk2 config.

Summary of Changes

File Change
engine/packages/pegboard-envoy/src/conn.rs Calls purge_runner_config_caches after the init_conn UDB transaction commits
engine/packages/pegboard/src/ops/actor/create.rs Sets bypass_cache: true on the runner_config::get call that decides mk1 vs mk2 dispatch

Correctness

conn.rs — cache purge placement

The purge is placed after the tokio::try_join! that commits the UDB transaction (which writes ProtocolVersionKey), and before missed_commands are sent. This ordering is correct: the cache is purged only once the new protocol version is durably written, ensuring any subsequent reader gets fresh data.

The purge is applied unconditionally regardless of is_serverless. The code comment says "serverful pools update protocol version directly from the envoy connection path" — that context only applies to serverful pools. For serverless pools the metadata_poller workflow is the source of truth and already calls purge_runner_config_caches after updating the protocol version. The unconditional purge here is harmless for serverless (a no-op redundant purge), but an if !is_serverless guard would make the intent clearer and avoid the extra cache purge work on every serverless reconnect.

create.rsbypass_cache: true

The change from bypass_cache: false to bypass_cache: true ensures that every actor creation pays one UDB read. This is a conservative but correct choice: it closes the race where a fresh envoy connection updates the protocol version immediately before the first pegboard_actor_create call, but the cached value causes mk1 dispatch for what should be an mk2 actor.

Two minor concerns:

  1. Performance regression at scale. The pegboard_runner_config_get operation is called inside a tokio::try_join! alongside six subscription setups. With bypass_cache: false (TTL = 5 s), this was a cheap in-process cache hit. With bypass_cache: true, every actor creation now incurs a full UDB round-trip. For high-throughput workloads this could be noticeable. The combination of the cache purge in conn.rs (which clears the stale entry on reconnect) plus keeping bypass_cache: false in create.rs would achieve the same correctness while keeping the fast path for subsequent actor creates after the cache is already fresh. Worth considering whether bypass_cache: true is necessary given the purge is already added in conn.rs.

  2. The fix is now slightly over-specified. Both changes together ensure correctness, but they are redundant: if conn.rs purges the cache on reconnect, create.rs can continue using the (now-fresh) cache. If create.rs always bypasses the cache, the conn.rs purge only helps list_runner_config_enabled_dcs consumers. The PR description doesn't explain the intended long-term ownership of this invariant.


Code Quality

Comments are clear and follow conventions. Both comment blocks are complete sentences that explain the "why," consistent with the CLAUDE.md guideline. No fragments or em dashes.

conn.rs comment (lines 282–284) is accurate but slightly imprecise — it says "serverful pools" but the purge applies unconditionally. Consider either scoping the purge with if !is_serverless { ... } or adjusting the comment to match the unconditional behavior.

No logging added. A tracing::debug! at the purge site (with namespace_id and pool_name as structured fields) would make it easier to confirm the purge ran in production traces. This is a minor suggestion, not a blocker.


Style / Conventions

  • No println! / eprintln! introduced.
  • No new Mutex<HashMap> patterns.
  • anyhow / .context() used correctly (the ? operator propagates the purge_runner_config_caches error via .await?).
  • Hard tabs preserved in Rust files.
  • No new _ => fallthrough arms added.

Verdict

The fix is logically sound and closes the race condition. The two concerns worth discussing before merge:

  1. Consider scoping the conn.rs purge to !is_serverless to match the stated intent and avoid unnecessary cache churn on serverless reconnects.
  2. Evaluate whether bypass_cache: true in create.rs is necessary given the purge already covers the staleness window, or whether the performance trade-off is intentional and acceptable.

Neither is a blocker, but the second point in particular could be a latency regression under load. If the team has measured this or considers the UDB round-trip cost acceptable, the PR is good to go.

@github-actions
Copy link
Copy Markdown
Contributor

Preview packages published to npm

Install with:

npm install rivetkit@pr-4778

All packages published as 0.0.0-pr.4778.331136c with tag pr-4778.

Engine binary is shipped via @rivetkit/engine-cli on linux-x64-musl, linux-arm64-musl, darwin-x64, and darwin-arm64. Windows users should use the release installer or set RIVET_ENGINE_BINARY.

Docker images:

docker pull rivetdev/engine:slim-331136c
docker pull rivetdev/engine:full-331136c
Individual packages
npm install rivetkit@pr-4778
npm install @rivetkit/react@pr-4778
npm install @rivetkit/rivetkit-napi@pr-4778
npm install @rivetkit/workflow-engine@pr-4778

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant