WIP: counter latency monitor by abcxff · Pull Request #5093 · rivet-dev/rivet

abcxff · 2026-05-22T04:17:35Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

railway-app · 2026-05-22T04:17:42Z

🚅 Deployed to the rivet-pr-5093 environment in rivet-frontend

Service	Status	Web	Updated (UTC)
kitchen-sink	❌ Build Failed (View Logs)	Web	May 22, 2026 at 5:09 am
ladle	❌ Build Failed (View Logs)	Web	May 22, 2026 at 4:19 am
frontend-cloud	❌ Build Failed (View Logs)	Web	May 22, 2026 at 4:19 am
mcp-hub	✅ Success (View Logs)	Web	May 22, 2026 at 4:18 am
frontend-inspector	❌ Build Failed (View Logs)	Web	May 22, 2026 at 4:18 am
website	❌ Build Failed (View Logs)	Web	May 22, 2026 at 4:18 am

abcxff · 2026-05-22T04:18:05Z

fix(rivetkit): wire CancellationToken through waitForNamesAvailable to eliminate KV busy-polling #5063 : 2 dependent PRs (#5050 , #5089 )
fix(rivetkit): fix BigInt serialization in inspector workflow history and state deserialization #5058
WIP: counter latency monitor #5093 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-22T05:17:54Z

Code Review: PR #5093 — WIP: counter latency monitor

State: Draft. The PR is explicitly marked WIP; this review is advisory.

Summary of Changes

This is a large, multi-concern PR (~28K additions) that bundles:

A new Rust counter-latency benchmarking tool (examples/kitchen-sink/scripts/counter-latency/)
Significant VFS (SQLite) hardening: generation-fenced transport, retain_read_cache flag, has_readable_page skip for read-ahead, head-fence validation split
New VfsContext::Drop summary log, promoted debug → info VFS logs
Actor metrics refactor (labels simplified to actor_name, new in-memory state tracking instead of global retained map)
New /metrics Prometheus endpoint in rivetkit-core and NAPI bindings
New registry.routes API (health, metadata, prometheusMetrics) in TypeScript
Envoy ping health semantics change: last_ping_ts now starts at 0 (unhealthy) instead of now() (healthy by default)
Shutdown grace period now driven by engine-provided actor_stop_threshold from protocol metadata
Metric renames: pegboard_envoy_* → envoy_*, sqlite_commit_envoy_* → envoy_sqlite_commit_*, actor_kv_* → pegboard_actor_kv_*; removal of rivet_ Prometheus registry prefix
mpsc::channel(N) → mpsc::unbounded_channel() across gateway, universaldb, guard-core
Several new VFS correctness tests including warm-PIDX hazard repro attempts and batch-atomic fence tests

Critical Issues

1. /metrics endpoint has no authorization (security)

handle_metrics_fetch in rivetkit-rust/packages/rivetkit-core/src/registry/http.rs calls render_prometheus_metrics() directly without invoking authorize_metrics_request. The authorize_metrics_request function and RIVETKIT_METRICS_ENABLED / RIVETKIT_METRICS_TOKEN env vars exist in metrics_endpoint.rs but are never called from handle_metrics_fetch.

The same applies in the NAPI metrics() method in rivetkit-typescript/packages/rivetkit-napi/src/registry.rs: it renders metrics without checking any token.

The prometheusMetrics() TypeScript route accepts a Request argument (ostensibly to read an Authorization header) but the argument is named _request and discarded. The token check infrastructure exists but is wired to nothing.

In practice, Prometheus metrics are scraped from internal network paths, but the endpoint is exposed on the same port as user traffic and actor requests. Any HTTP request to /metrics on an actor port returns all metrics without credentials. If authorize_metrics_request is the intended gate, it must be called before rendering.

2. Renamed Prometheus metrics are a breaking change for deployed monitoring

The PR renames ~20 metrics:

pegboard_envoy_connection_total → envoy_connection_total
sqlite_commit_envoy_dispatch_duration_seconds → envoy_sqlite_commit_dispatch_duration_seconds
actor_kv_operation_duration_seconds → pegboard_actor_kv_operation_duration_seconds
(etc.)

Additionally, the Registry::new_custom call that previously set Some("rivet".to_string()) as the global prefix is changed to None, dropping the rivet_ prefix from all metrics in the shared registry.

These are breaking changes for any deployed Grafana dashboards, alerting rules, or Prometheus recording rules that reference the old names. The diff updates SQLITE_METRICS.md but does not mention the prefix removal or provide a migration table. There should be either a transition period with both old and new names, or explicit documentation that these dashboards/alerts need updating as part of this merge.

Code Quality Issues

3. eprintln! in VFS tests bypasses structured logging

The three warm-PIDX hazard repro tests use eprintln! for diagnostic output. Per CLAUDE.md, eprintln! should not be used for logging even in test code; tracing::info! / tracing::warn! should be used. The same tests also already use tracing::warn! for summary logging, making the eprintln! calls inconsistent.

4. Per-page-fetch logs promoted from debug to info

In engine/packages/depot-client/src/vfs.rs, the per-get_pages-call log ("vfs get_pages fetch") and per-commit log are promoted from debug to info. These fire on every page fetch during every SQLite operation. On a loaded actor that runs frequent SQL, this will produce very high-volume info logs. The structured fields are valuable for debugging; keep them at debug or add a sampling condition.

5. GenerationFencedTransport::get_pages silently overrides user-provided expected_generation

request.expected_generation.get_or_insert(self.generation);

get_or_insert only sets the field if it is None. If a caller explicitly sets expected_generation to a different value, the fence silently skips. The semantics should be documented with a comment explaining "callers may not override the fenced generation" or an explicit assert/error to prevent future silent bypasses.

6. validate_sqlite_actor_for_request falls back to weaker validation when generation is absent

When expected_generation is None (old clients that don't send a generation), the request falls back to the non-generation validate_sqlite_actor() check. There should be explicit documentation that the fallback is intentional and describes the security model for pre-generation clients.

7. unbounded_channel migration removes backpressure

The PR migrates mpsc::channel(128) to mpsc::unbounded_channel() in pegboard-gateway, pegboard-gateway2, and universaldb transaction drivers. The original caps were presumably chosen deliberately. If backpressure is not needed (e.g. because the sender detects consumer death via a drop_rx watch), that should be documented.

Potential Bugs

8. is_ping_healthy returns false until the first ping arrives — startup health probes may fail

Previously, last_ping_ts was initialized to now() so a fresh envoy was healthy during the ping handshake window. Now it is initialized to 0 and is_ping_healthy returns false when last == 0. During the window between envoy creation and first engine ping (which can be seconds), any health check probe will receive 503. Depending on the container host's readiness-probe behavior, this could trigger premature recycling of a healthy starting container. Consider adding a startup grace window.

9. Shutdown grace period fallback of 30 minutes

let stop_threshold = handle
    .get_protocol_metadata()
    .await
    .map(|x| x.actor_stop_threshold)
    .unwrap_or(30 * 60 * 1000);

If get_protocol_metadata returns None, the drain timeout is 30 minutes — a silent, very long wait compared to the prior 20-second SHUTDOWN_DRAIN_TIMEOUT. This is probably intentional (matching Pegboard's hard stop threshold), but it needs an explanatory comment.

10. VFS generation parsed from name string after registration

In SqliteVfs::register_with_transport_and_initial_pages, the generation is extracted from the VFS name string via rsplit_once("-g"). If the naming convention changes or a test uses a VFS name without a -g suffix, generation silently becomes None instead of an error. Consider asserting the parse succeeds in non-test paths or passing the generation as an explicit parameter.

Design / Architecture Notes

11. Warm-PIDX hazard tests are reproduction attempts, not regression guards

All three warm-PIDX tests end with tracing::info!("... no malformed DB observed on this run") when no corruption is detected — the tests always pass whether or not the bug manifests. These should either: (a) be moved to a dedicated debug/exploration module with #[ignore] until the underlying fix is confirmed, or (b) be replaced by a deterministic repro once the root cause is fixed at the Depot layer.

12. Committed build artifact dist-server/server.mjs

The PR adds examples/kitchen-sink/dist-server/server.mjs (13K+ lines) and its .map file as tracked files. Built artifacts should not be committed; the file should be generated as part of the Docker build step and excluded via .gitignore.

13. actor_name label cardinality

The metrics refactor changes label set from (actor_id_gen, actor_key, envoy_key) to just (actor_name). This is a good cardinality improvement. However, actor_name is still user-controlled at registration time — if users register many actor names, each distinct name creates a new label set. A note or bound on expected cardinality would help operators.

Minor Issues

disabled_by_default is a near-duplicate of enabled_by_default, differing only in the default value. A single parse_bool_env(var, default) -> bool helper would remove the duplication.
// TODO: Move into envoy-client since timing out has to do with protocol compliance in rivetkit-core/src/registry/mod.rs should be tracked as a proper todo item (.agent/todo/) rather than left as an inline comment.
The _request parameter in prometheusMetricsRoute(_request?: Request) confirms the auth gap noted in finding [SVC-2555] Set up issue templates #1 — the underscore signals intentional discard.

railway-app Bot had a problem deploying to rivet-frontend / rivet-pr-5093 May 22, 2026 04:17 Failure

abcxff mentioned this pull request May 22, 2026

feat(workflow-engine): add retryOnTimeout opt-in for step timeout retries #5089

Draft

11 tasks

NathanFlurry and others added 21 commits May 22, 2026 04:59

WIP: counter latency monitor

8e664fc

feat(rivetkit-core): expose metrics endpoint

c5e2cf1

fix(sqlite): fence remote requests by generation

1455f67

fix(runtime): avoid bounded dispatch channel stalls

88102f1

fix(rivetkit): expose client actor metadata

a8c93fa

feat(kitchen-sink): add load testing harness

12f4efc

docs(serverless): document timeout tuning

063f2e1

feat(rivetkit): expose low-cardinality metrics routes

9b85cf3

feat(rivetkit): add current actor metrics

5bed104

fix(rivetkit): require engine ping for health

dc3f075

fix(wasm): point wasm-pack build to new wasm-bindgen repo

b717224

fix(envoy-client): dont abort drain after 20s

04d580f

fix(rivetkit-core): decrement active actor metrics

909edbc

fix(rivetkit): exit pid1 after signal shutdown

ad401cc

fix(rivetkit): use engine actor stop threshold for shutdown

1debedc

feat(kitchen-sink): rust counter-latency harness

a777712

chore(kitchen-sink): refresh bench + smoke scripts

c73cd5d

chore(kitchen-sink): counter actor + sigterm probe tweaks

407633a

feat(kitchen-sink): ws-ping fast-path on tunnel-stress + load-test-agent

58f88e0

test(depot-client): stale vfs cache reads fail closed

e902ea3

test(depot-client): head fence read poisons vfs

b62b00b

NathanFlurry and others added 10 commits May 22, 2026 05:04

test(depot-client): vfs stale page cache writer

989dc51

test(depot-client): delayed read ahead stale pages

955920a

test(depot-client): startup preload stale pages

3162460

test(rivetkit-core): sqlite lifecycle fuzz harness

ab06f95

chore(kitchen-sink): agent load test

260927e

test(depot-client): batch atomic cap repro

d1e97b6

test(depot-client): warm pidx stale read rmw repro

6cebc46

test(depot-client): natural warm pidx repro

fab8ec5

test(depot-client): natural reopen warm pidx repro

27cbff1

fix(rivetkit): Check for runner config before upserting

49e17ec

abcxff force-pushed the compute/nathan/2.3.0-rc branch from 9a2f692 to 49e17ec Compare May 22, 2026 05:08

railway-app Bot had a problem deploying to rivet-frontend / rivet-pr-5093 May 22, 2026 05:08 Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: counter latency monitor#5093

WIP: counter latency monitor#5093
abcxff wants to merge 31 commits into
mainfrom
compute/nathan/2.3.0-rc

abcxff commented May 22, 2026

Uh oh!

railway-app Bot commented May 22, 2026 •

edited

Loading

Uh oh!

abcxff commented May 22, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abcxff commented May 22, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

railway-app Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abcxff commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 22, 2026

Code Review: PR #5093 — WIP: counter latency monitor

Summary of Changes

Critical Issues

Code Quality Issues

Potential Bugs

Design / Architecture Notes

Minor Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

railway-app Bot commented May 22, 2026 •

edited

Loading

abcxff commented May 22, 2026 •

edited

Loading