Skip to content

feat(rivetkit): expose runtime diagnostics#4918

Closed
NathanFlurry wants to merge 1 commit into05-03-chore_depot-client_migrate_to_worker_threadfrom
driver-test-complaints/expose-runtime-diagnosts
Closed

feat(rivetkit): expose runtime diagnostics#4918
NathanFlurry wants to merge 1 commit into05-03-chore_depot-client_migrate_to_worker_threadfrom
driver-test-complaints/expose-runtime-diagnosts

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented May 4, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 4, 2026

Code Review — feat(rivetkit): expose runtime diagnostics

Overview

This PR adds runtime diagnostics across the stack: active actor count, SQLite VFS metrics, SSE stopping event when an actor shuts down, a malloc_trim call after actor shutdown on glibc targets, and debug HTTP endpoints (/debug/memory, /debug/heap-snapshot) in the kitchen-sink example. It also refactors onSleep/onDestroy native callbacks to always run cleanup regardless of whether the user supplied them.


Issues

1. Hardcoded zeros in JsSqliteVfsMetrics — misleading diagnostic data

rivetkit-typescript/packages/rivetkit-napi/src/database.rs (lines ~714–727):

page_cache_entries: 0.0,
page_cache_weighted_size: 0.0,
page_cache_capacity_pages: 0.0,
write_buffer_dirty_pages: 0.0,
db_size_pages: 0.0,

These fields are in the public JsSqliteVfsMetrics struct and returned from metrics(), but they always return zero. Consumers will silently get wrong data. Either populate them from the underlying struct or remove them from the interface until the data is available.


2. Race condition in wait_actor_registered_then_stopped

engine/sdks/rust/envoy-client/src/handle.rs:

If the actor registers and deregisters before wait_actor_registered_then_stopped is called, the function will wait forever:

  1. notified is created — no pending wakeup because both events already fired.
  2. registered is false, so registered && !actor_is_registered is false.
  3. The function enters select! and waits indefinitely (unless the envoy stops).

In practice the caller is an SSE handler spawned right after decoding the start payload, but the actor could theoretically be very short-lived. A simpler fix: if registered is still false when the state shows the actor is not present, treat that as "never registered" and exit early (or add a timeout arm).


3. Path traversal in /debug/heap-snapshot

examples/kitchen-sink/src/server.ts:

const path = c.req.query("path");
if (!path) {
    return c.json({ error: "missing path" }, 400);
}
const writtenPath = v8.writeHeapSnapshot(path);

path is user-controlled and passed directly to v8.writeHeapSnapshot, allowing a caller to write a heap dump to any location writable by the process (e.g. overwriting config files, secrets, or binary artifacts). The env-flag guard is a soft gate. At minimum, restrict the output path to a fixed temp directory (/tmp/, or better, a pre-configured dir), and reject paths containing .. or absolute paths outside that directory.


4. Test uses a retry poll loop — violates CLAUDE.md testing policy

rivetkit-rust/packages/rivetkit-core/tests/task.rs (action_dispatch_blocks_idle_sleep_until_reply):

for _ in 0..10 {
    if ctx.internal_keep_awake_count() == 0 {
        break;
    }
    yield_now().await;
}
assert_eq!(ctx.internal_keep_awake_count(), 0);

CLAUDE.md: "Never paper over flakes with retry loops or bumped waits." The keep-awake guard drop is async, but the right approach is to add a Notify (or watch channel) that fires when the count drops to zero, then wait on that directly instead of polling with yield_now.


5. Dead code: normalizeExecuteRoute

rivetkit-typescript/packages/rivetkit/src/registry/native-database.ts:

function normalizeExecuteRoute(route: string): SqliteExecuteResult["route"] { ... }

This function is defined but never called in this PR (or apparently anywhere in the diff). It should be removed or used.


6. Double-decode of serverless start payload

handle.rs / serverless.rs: The caller decodes the payload via handle.decode_serverless_actor_start(&payload) to get the actor info, then later start_serverless_actor(&payload) re-decodes the same bytes. Consider returning the parsed (message, actor_start) from a single decode call and threading the message through instead of decoding twice.


7. HTTP server timeouts set to 0 — needs a comment

examples/kitchen-sink/src/server.ts:

httpServer.requestTimeout = 0;
httpServer.headersTimeout = 0;
httpServer.keepAliveTimeout = 0;
httpServer.timeout = 0;

Setting all timeouts to 0 (infinite) is intentional here (long-running SSE streams), but there's no comment explaining why. A reader will assume these are bugs. Add a one-line comment per CLAUDE.md conventions.


8. diagnostics() can hang if the registry is still initialising

rivetkit-typescript/packages/rivetkit/src/registry/index.ts:

for (const candidate of candidates) {
    const { runtime, registry } = await candidate;  // may block indefinitely
    ...
}

If buildConfiguredRegistry is in progress (e.g. compiling WASM), this will block the caller until it resolves. For a diagnostics endpoint this is surprising. Either add a Promise.race with a short timeout or check if the promise is settled before awaiting.


9. malloc_trim(0) called after every actor shutdown — potential cost concern

rivetkit-rust/packages/rivetkit-core/src/actor/task.rs:

malloc_trim is a glibc system call that walks the heap — its cost scales with heap size and fragmentation. Calling it after every single actor shutdown may add measurable tail latency in high-throughput scenarios (many short-lived actors). Consider rate-limiting (e.g. only trim when idle actor count drops to 0) or only calling it from a periodic background task.


10. Split impl EnvoyHandle blocks — style

engine/sdks/rust/envoy-client/src/handle.rs: The free function decode_serverless_actor_start_payload is sandwiched between two separate impl EnvoyHandle blocks. This is valid Rust but unusual. Move the free function before or after all impl blocks so the impl block is contiguous.


11. normalizeNativeMetrics snake_case fallback — dead code

rivetkit-typescript/packages/rivetkit/src/registry/native-database.ts:

const numberField = (camel: string, snake: string) =>
    Number(raw[camel] ?? raw[snake] ?? 0);

NAPI always returns camelCase keys. The snake_case fallback will never be hit and creates confusion about whether the format is stable. Remove the snake fallback parameter.


Minor / Nits

  • SSE_PING_FRAME / SSE_STOPPING_FRAME constants — good cleanup, no issues.
  • onSleep/onDestroy always registered — the semantic change (always run cleanup) is safer and intentional. The comment inside the wasm branch is helpful.
  • actors_notify placement and pairing — the notify_waiters() calls at actor add and remove sites look correct.
  • active_actor_count filtering closed handles — reasonable; worth a comment explaining what "closed" means in this context.
  • Process lifecycle logging in server.ts guarded by SQLITE_MEMORY_SOAK_DIAGNOSTICS — appropriate scoping.

Summary

The main concerns before merging are the hardcoded zeros in the metrics struct (silent misinformation), the path traversal in the heap snapshot endpoint (security), and the race in wait_actor_registered_then_stopped (correctness for very short-lived actors). The retry loop in the test and the dead normalizeExecuteRoute function are small but violate project conventions. Everything else is style or advisory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant