Skip to content

fix: various serverless bug fixes#4313

Open
MasterPtato wants to merge 1 commit into02-20-fix_guard_runner_enforce_http_body_size_and_ws_msg_sizefrom
02-25-fix_various_serverless_bug_fixes
Open

fix: various serverless bug fixes#4313
MasterPtato wants to merge 1 commit into02-20-fix_guard_runner_enforce_http_body_size_and_ws_msg_sizefrom
02-25-fix_various_serverless_bug_fixes

Conversation

@MasterPtato
Copy link
Contributor

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link

railway-app bot commented Feb 26, 2026

🚅 Deployed to the rivet-pr-4313 environment in rivet-frontend

Service Status Web Updated (UTC)
website 😴 Sleeping (View Logs) Web Feb 26, 2026 at 1:25 am
frontend-inspector ❌ Build Failed (View Logs) Web Feb 26, 2026 at 1:13 am
ladle ❌ Build Failed (View Logs) Web Feb 26, 2026 at 1:13 am
mcp-hub ✅ Success (View Logs) Web Feb 26, 2026 at 1:13 am
frontend-cloud ❌ Build Failed (View Logs) Web Feb 26, 2026 at 1:13 am

Copy link
Contributor Author

MasterPtato commented Feb 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more


How to use the Graphite Merge Queue

Add the label merge-queue to this PR to add it to the merge queue.

You must have a Graphite account in order to use the merge queue. Sign up using this link.

An organization admin has enabled the Graphite Merge Queue in this repository.

Please do not merge from GitHub as this will restart CI on PRs being processed by the merge queue.

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link

claude bot commented Feb 26, 2026

PR Review: fix: various serverless bug fixes

Solid set of targeted bug fixes for the serverless runner path. Most changes are correct and address real issues. A few things worth discussing below.


Correctness: Good Fixes

1. GetTs activity for workflow determinism (actor/mod.rs)

Moving util::timestamp::now() into a GetTs activity (under ctx.v(3)) is the correct fix. Inline now() calls inside a workflow loop are non-deterministic and will produce different values on replay, which can corrupt alarm state. Wrapping it as an activity ensures the same timestamp is returned during replay. Additionally, the now variable was previously fetched unconditionally before the signal loop but only used in the Main::Wake handler — the new placement inside the handler is more precise and avoids an unnecessary activity call on every iteration.

2. drain_runner signal target (serverless/conn.rs)

Changing from runner::Stop to runner2::Stop looks like a genuine bug fix. The serverless drain path was targeting mk1 runners instead of mk2. The fallback-to-mk1 pattern elsewhere in the file confirms this was likely broken.

3. clear_pending_allocation API fix (actor/runtime.rs)

Switching from manual subspace().pack(&key) + tx.get()/tx.clear() to tx.with_subspace() + tx.exists()/tx.delete() is a cleaner use of the typed DB API and avoids a class of subtle key-encoding bugs.

4. Load shedding curve adjustment (runtime.rs, config-schema.json)

The 60%→70% / 80%→90% shift gives more headroom before shedding begins. Config and schema are in sync.


Issues

1. Duplicate Bump signal logic (actor/runtime.rs, lines ~887-910 and ~998-1020)

The two Bump signal blocks (one in the Destroy branch, one in the Sleep branch after timeout) are nearly identical — same signal, same tag construction, same error handling pattern. This is a maintenance risk: any future change to the error handling or signal parameters needs to be applied in both places. Consider extracting a helper function for the bump + WorkflowNotFound-tolerant error handling pattern.

2. Removed debug log reduces observability (serverless/conn.rs, line ~517)

The removed "reached drain grace period before runner shut down" log was useful for diagnosing drain timing issues in production. The empty match arm now silently swallows the grace period expiry. A tracing::debug! would be preferable to no log at all.

3. attempted field naming is ambiguous (runner2.rs)

attempted is set to pending_actors.len() before any filtering/selection occurs during allocation. The name implies actors that were actively tried for allocation, but it captures all pending actors at that snapshot. A short comment clarifying the semantics (e.g., "count of pending actors at the start of this allocation round") would help. The #[serde(default)] for backward compatibility is correct.

4. listen_n reduced from 512 to 256 without explanation (runner_pool.rs)

The signal batch size dropped from 512 to 256. For high-traffic namespaces with many concurrent actor lifecycle events, this could increase loop iterations and latency. Is there a specific reason (memory pressure, replay cost) for the reduction? A comment would help future readers.


Minor Notes

  • commit_interval(5): Persisting every 5 iterations is a reasonable I/O optimization. Fine as long as the pool loop is idempotent enough to tolerate replaying up to 4 un-committed iterations on crash, which appears to be the case given the signal-driven structure.
  • /health endpoint in test-runner: Straightforward addition, no concerns.
  • Removed load test http_req_duration thresholds: Loosening test constraints may hide regressions. If the previous bounds were too tight for the CI environment, consider loose bounds (e.g. p(95)<15000) rather than removing them entirely.

@MasterPtato MasterPtato marked this pull request as ready for review February 26, 2026 01:41
@MasterPtato MasterPtato force-pushed the 02-25-fix_various_serverless_bug_fixes branch from 460c180 to 1f8ab8e Compare February 26, 2026 02:14
@MasterPtato MasterPtato force-pushed the 02-20-fix_guard_runner_enforce_http_body_size_and_ws_msg_size branch from e89cab8 to a7065e4 Compare February 26, 2026 02:14
@MasterPtato MasterPtato force-pushed the 02-25-fix_various_serverless_bug_fixes branch from 1f8ab8e to ea391b1 Compare February 26, 2026 19:42
@MasterPtato MasterPtato force-pushed the 02-20-fix_guard_runner_enforce_http_body_size_and_ws_msg_size branch from a7065e4 to 6df70f8 Compare February 26, 2026 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant