Skip to content

WIP: counter latency monitor#5093

Draft
abcxff wants to merge 31 commits into
mainfrom
compute/nathan/2.3.0-rc
Draft

WIP: counter latency monitor#5093
abcxff wants to merge 31 commits into
mainfrom
compute/nathan/2.3.0-rc

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 22, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 22, 2026

🚅 Deployed to the rivet-pr-5093 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 22, 2026 at 5:09 am
ladle ❌ Build Failed (View Logs) Web May 22, 2026 at 4:19 am
frontend-cloud ❌ Build Failed (View Logs) Web May 22, 2026 at 4:19 am
mcp-hub ✅ Success (View Logs) Web May 22, 2026 at 4:18 am
frontend-inspector ❌ Build Failed (View Logs) Web May 22, 2026 at 4:18 am
website ❌ Build Failed (View Logs) Web May 22, 2026 at 4:18 am

Copy link
Copy Markdown
Contributor Author

abcxff commented May 22, 2026

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Code Review: PR #5093 — WIP: counter latency monitor

State: Draft. The PR is explicitly marked WIP; this review is advisory.


Summary of Changes

This is a large, multi-concern PR (~28K additions) that bundles:

  1. A new Rust counter-latency benchmarking tool (examples/kitchen-sink/scripts/counter-latency/)
  2. Significant VFS (SQLite) hardening: generation-fenced transport, retain_read_cache flag, has_readable_page skip for read-ahead, head-fence validation split
  3. New VfsContext::Drop summary log, promoted debuginfo VFS logs
  4. Actor metrics refactor (labels simplified to actor_name, new in-memory state tracking instead of global retained map)
  5. New /metrics Prometheus endpoint in rivetkit-core and NAPI bindings
  6. New registry.routes API (health, metadata, prometheusMetrics) in TypeScript
  7. Envoy ping health semantics change: last_ping_ts now starts at 0 (unhealthy) instead of now() (healthy by default)
  8. Shutdown grace period now driven by engine-provided actor_stop_threshold from protocol metadata
  9. Metric renames: pegboard_envoy_*envoy_*, sqlite_commit_envoy_*envoy_sqlite_commit_*, actor_kv_*pegboard_actor_kv_*; removal of rivet_ Prometheus registry prefix
  10. mpsc::channel(N)mpsc::unbounded_channel() across gateway, universaldb, guard-core
  11. Several new VFS correctness tests including warm-PIDX hazard repro attempts and batch-atomic fence tests

Critical Issues

1. /metrics endpoint has no authorization (security)

handle_metrics_fetch in rivetkit-rust/packages/rivetkit-core/src/registry/http.rs calls render_prometheus_metrics() directly without invoking authorize_metrics_request. The authorize_metrics_request function and RIVETKIT_METRICS_ENABLED / RIVETKIT_METRICS_TOKEN env vars exist in metrics_endpoint.rs but are never called from handle_metrics_fetch.

The same applies in the NAPI metrics() method in rivetkit-typescript/packages/rivetkit-napi/src/registry.rs: it renders metrics without checking any token.

The prometheusMetrics() TypeScript route accepts a Request argument (ostensibly to read an Authorization header) but the argument is named _request and discarded. The token check infrastructure exists but is wired to nothing.

In practice, Prometheus metrics are scraped from internal network paths, but the endpoint is exposed on the same port as user traffic and actor requests. Any HTTP request to /metrics on an actor port returns all metrics without credentials. If authorize_metrics_request is the intended gate, it must be called before rendering.

2. Renamed Prometheus metrics are a breaking change for deployed monitoring

The PR renames ~20 metrics:

  • pegboard_envoy_connection_totalenvoy_connection_total
  • sqlite_commit_envoy_dispatch_duration_secondsenvoy_sqlite_commit_dispatch_duration_seconds
  • actor_kv_operation_duration_secondspegboard_actor_kv_operation_duration_seconds
  • (etc.)

Additionally, the Registry::new_custom call that previously set Some("rivet".to_string()) as the global prefix is changed to None, dropping the rivet_ prefix from all metrics in the shared registry.

These are breaking changes for any deployed Grafana dashboards, alerting rules, or Prometheus recording rules that reference the old names. The diff updates SQLITE_METRICS.md but does not mention the prefix removal or provide a migration table. There should be either a transition period with both old and new names, or explicit documentation that these dashboards/alerts need updating as part of this merge.


Code Quality Issues

3. eprintln! in VFS tests bypasses structured logging

The three warm-PIDX hazard repro tests use eprintln! for diagnostic output. Per CLAUDE.md, eprintln! should not be used for logging even in test code; tracing::info! / tracing::warn! should be used. The same tests also already use tracing::warn! for summary logging, making the eprintln! calls inconsistent.

4. Per-page-fetch logs promoted from debug to info

In engine/packages/depot-client/src/vfs.rs, the per-get_pages-call log ("vfs get_pages fetch") and per-commit log are promoted from debug to info. These fire on every page fetch during every SQLite operation. On a loaded actor that runs frequent SQL, this will produce very high-volume info logs. The structured fields are valuable for debugging; keep them at debug or add a sampling condition.

5. GenerationFencedTransport::get_pages silently overrides user-provided expected_generation

request.expected_generation.get_or_insert(self.generation);

get_or_insert only sets the field if it is None. If a caller explicitly sets expected_generation to a different value, the fence silently skips. The semantics should be documented with a comment explaining "callers may not override the fenced generation" or an explicit assert/error to prevent future silent bypasses.

6. validate_sqlite_actor_for_request falls back to weaker validation when generation is absent

When expected_generation is None (old clients that don't send a generation), the request falls back to the non-generation validate_sqlite_actor() check. There should be explicit documentation that the fallback is intentional and describes the security model for pre-generation clients.

7. unbounded_channel migration removes backpressure

The PR migrates mpsc::channel(128) to mpsc::unbounded_channel() in pegboard-gateway, pegboard-gateway2, and universaldb transaction drivers. The original caps were presumably chosen deliberately. If backpressure is not needed (e.g. because the sender detects consumer death via a drop_rx watch), that should be documented.


Potential Bugs

8. is_ping_healthy returns false until the first ping arrives — startup health probes may fail

Previously, last_ping_ts was initialized to now() so a fresh envoy was healthy during the ping handshake window. Now it is initialized to 0 and is_ping_healthy returns false when last == 0. During the window between envoy creation and first engine ping (which can be seconds), any health check probe will receive 503. Depending on the container host's readiness-probe behavior, this could trigger premature recycling of a healthy starting container. Consider adding a startup grace window.

9. Shutdown grace period fallback of 30 minutes

let stop_threshold = handle
    .get_protocol_metadata()
    .await
    .map(|x| x.actor_stop_threshold)
    .unwrap_or(30 * 60 * 1000);

If get_protocol_metadata returns None, the drain timeout is 30 minutes — a silent, very long wait compared to the prior 20-second SHUTDOWN_DRAIN_TIMEOUT. This is probably intentional (matching Pegboard's hard stop threshold), but it needs an explanatory comment.

10. VFS generation parsed from name string after registration

In SqliteVfs::register_with_transport_and_initial_pages, the generation is extracted from the VFS name string via rsplit_once("-g"). If the naming convention changes or a test uses a VFS name without a -g suffix, generation silently becomes None instead of an error. Consider asserting the parse succeeds in non-test paths or passing the generation as an explicit parameter.


Design / Architecture Notes

11. Warm-PIDX hazard tests are reproduction attempts, not regression guards

All three warm-PIDX tests end with tracing::info!("... no malformed DB observed on this run") when no corruption is detected — the tests always pass whether or not the bug manifests. These should either: (a) be moved to a dedicated debug/exploration module with #[ignore] until the underlying fix is confirmed, or (b) be replaced by a deterministic repro once the root cause is fixed at the Depot layer.

12. Committed build artifact dist-server/server.mjs

The PR adds examples/kitchen-sink/dist-server/server.mjs (13K+ lines) and its .map file as tracked files. Built artifacts should not be committed; the file should be generated as part of the Docker build step and excluded via .gitignore.

13. actor_name label cardinality

The metrics refactor changes label set from (actor_id_gen, actor_key, envoy_key) to just (actor_name). This is a good cardinality improvement. However, actor_name is still user-controlled at registration time — if users register many actor names, each distinct name creates a new label set. A note or bound on expected cardinality would help operators.


Minor Issues

  • disabled_by_default is a near-duplicate of enabled_by_default, differing only in the default value. A single parse_bool_env(var, default) -> bool helper would remove the duplication.
  • // TODO: Move into envoy-client since timing out has to do with protocol compliance in rivetkit-core/src/registry/mod.rs should be tracked as a proper todo item (.agent/todo/) rather than left as an inline comment.
  • The _request parameter in prometheusMetricsRoute(_request?: Request) confirms the auth gap noted in finding [SVC-2555] Set up issue templates #1 — the underscore signals intentional discard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants