feat(rivetkit): expose runtime diagnostics by NathanFlurry · Pull Request #4918 · rivet-dev/rivet

NathanFlurry · 2026-05-04T00:46:09Z

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

My code follows the style guidelines of this project
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

NathanFlurry · 2026-05-04T00:46:59Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

fix(rivetkit-core): fail closed sqlite during shutdown #4921
docs(rivetkit): raise serverless start payload limit #4920
test(kitchen-sink): add sqlite memory soak harness #4919
feat(rivetkit): expose runtime diagnostics #4918 👈 (View in Graphite)
chore(depot-client): migrate to worker thread #4907 : 2 other dependent PRs (#4909 , #4912 )
test(driver): select matrix cells with vitest filters #4906
fix(pegboard): restore hibernating requests on serverful start #4904 : 1 other dependent PR (#4905 )
fix(rivetkit): drain shutdown work before sleep #4902 : 1 other dependent PR (#4903 )
feat(sqlite): wire remote sqlite execute handlers #4897
test(serverless): avoid pinning header validation order #4896
chore(driver-test-runner): add native/wasm runtime matrix to skill #4884 : 3 other dependent PRs (#4892 , #4894 , #4901 )
chroe: rename rivetkit-sqlite to depot-client #4883
fix(rivetkit-wasm): fix mem leaks #4880
chore(rivetkit): wasm support #4860
feat(sqlite): add cold read benchmarks and simplify optimizations #4857
feat(sqlite): benchmark cold reads #4848 : 1 other dependent PR (#4864 )
docs(actors): document manual lifecycle controls #4847
test(sqlite): expand db stress coverage #4846 : 1 other dependent PR (#4849 )
test(raw-websocket): add serverless smoke coverage #4845
feat(serverless): configure drain grace period #4844
fix(publish): enable frontend build for engine target and use turbo build instead of turbo build:engine #4843
feat(rivetkit): expose conn handle in onWebSocket context #4842
fix(rivetkit): normalize waitUntil/keepAwake promises to null to avoid serde undefined error #4841
fix(rivetkit): remove requirement for token #4839 : 1 other dependent PR (#4840 )
fix(pegboard-envoy): allow sqlite requests across lifecycle states #4838
fix(pegboard): route scoped actor key reads by runner dc #4837
fix(sqlite-storage): let sqlite open take over stale owners #4834
fix(engine-tests): parallelize multi-DC startup + enable 6 multi-DC tests #4833
fix(api): honor cursor in actor_ids list path #4832
fix(api): silently filter invalid IDs from actor_ids query instead of 400 #4831
fix(api-peer): make actors_delete idempotent for already-destroyed actors #4830
fix(pegboard): use Iterator streaming mode in list_names so limit param is honored #4829
chore(engine-tests): convert envoy tests to Sleep crash policy + triage ignored #4828
feat(engine-tests): wire envoy actor names through prepopulate_actor_names #4827
chore(depot): fault injection tests #4868
feat(sqlite): pitr & forking #4859 : 1 other dependent PR (#4863 )
chore(sqlite): stateless storage refactor #4856
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

claude · 2026-05-04T00:51:21Z

Code Review — feat(rivetkit): expose runtime diagnostics

Overview

This PR adds runtime diagnostics across the stack: active actor count, SQLite VFS metrics, SSE stopping event when an actor shuts down, a malloc_trim call after actor shutdown on glibc targets, and debug HTTP endpoints (/debug/memory, /debug/heap-snapshot) in the kitchen-sink example. It also refactors onSleep/onDestroy native callbacks to always run cleanup regardless of whether the user supplied them.

Issues

1. Hardcoded zeros in `JsSqliteVfsMetrics` — misleading diagnostic data

rivetkit-typescript/packages/rivetkit-napi/src/database.rs (lines ~714–727):

page_cache_entries: 0.0,
page_cache_weighted_size: 0.0,
page_cache_capacity_pages: 0.0,
write_buffer_dirty_pages: 0.0,
db_size_pages: 0.0,

These fields are in the public JsSqliteVfsMetrics struct and returned from metrics(), but they always return zero. Consumers will silently get wrong data. Either populate them from the underlying struct or remove them from the interface until the data is available.

2. Race condition in `wait_actor_registered_then_stopped`

engine/sdks/rust/envoy-client/src/handle.rs:

If the actor registers and deregisters before wait_actor_registered_then_stopped is called, the function will wait forever:

notified is created — no pending wakeup because both events already fired.
registered is false, so registered && !actor_is_registered is false.
The function enters select! and waits indefinitely (unless the envoy stops).

In practice the caller is an SSE handler spawned right after decoding the start payload, but the actor could theoretically be very short-lived. A simpler fix: if registered is still false when the state shows the actor is not present, treat that as "never registered" and exit early (or add a timeout arm).

3. Path traversal in `/debug/heap-snapshot`

examples/kitchen-sink/src/server.ts:

const path = c.req.query("path");
if (!path) {
    return c.json({ error: "missing path" }, 400);
}
const writtenPath = v8.writeHeapSnapshot(path);

path is user-controlled and passed directly to v8.writeHeapSnapshot, allowing a caller to write a heap dump to any location writable by the process (e.g. overwriting config files, secrets, or binary artifacts). The env-flag guard is a soft gate. At minimum, restrict the output path to a fixed temp directory (/tmp/, or better, a pre-configured dir), and reject paths containing .. or absolute paths outside that directory.

4. Test uses a retry poll loop — violates CLAUDE.md testing policy

rivetkit-rust/packages/rivetkit-core/tests/task.rs (action_dispatch_blocks_idle_sleep_until_reply):

for _ in 0..10 {
    if ctx.internal_keep_awake_count() == 0 {
        break;
    }
    yield_now().await;
}
assert_eq!(ctx.internal_keep_awake_count(), 0);

CLAUDE.md: "Never paper over flakes with retry loops or bumped waits." The keep-awake guard drop is async, but the right approach is to add a Notify (or watch channel) that fires when the count drops to zero, then wait on that directly instead of polling with yield_now.

5. Dead code: `normalizeExecuteRoute`

rivetkit-typescript/packages/rivetkit/src/registry/native-database.ts:

function normalizeExecuteRoute(route: string): SqliteExecuteResult["route"] { ... }

This function is defined but never called in this PR (or apparently anywhere in the diff). It should be removed or used.

6. Double-decode of serverless start payload

handle.rs / serverless.rs: The caller decodes the payload via handle.decode_serverless_actor_start(&payload) to get the actor info, then later start_serverless_actor(&payload) re-decodes the same bytes. Consider returning the parsed (message, actor_start) from a single decode call and threading the message through instead of decoding twice.

7. HTTP server timeouts set to 0 — needs a comment

examples/kitchen-sink/src/server.ts:

httpServer.requestTimeout = 0;
httpServer.headersTimeout = 0;
httpServer.keepAliveTimeout = 0;
httpServer.timeout = 0;

Setting all timeouts to 0 (infinite) is intentional here (long-running SSE streams), but there's no comment explaining why. A reader will assume these are bugs. Add a one-line comment per CLAUDE.md conventions.

8. `diagnostics()` can hang if the registry is still initialising

rivetkit-typescript/packages/rivetkit/src/registry/index.ts:

for (const candidate of candidates) {
    const { runtime, registry } = await candidate;  // may block indefinitely
    ...
}

If buildConfiguredRegistry is in progress (e.g. compiling WASM), this will block the caller until it resolves. For a diagnostics endpoint this is surprising. Either add a Promise.race with a short timeout or check if the promise is settled before awaiting.

9. `malloc_trim(0)` called after every actor shutdown — potential cost concern

rivetkit-rust/packages/rivetkit-core/src/actor/task.rs:

malloc_trim is a glibc system call that walks the heap — its cost scales with heap size and fragmentation. Calling it after every single actor shutdown may add measurable tail latency in high-throughput scenarios (many short-lived actors). Consider rate-limiting (e.g. only trim when idle actor count drops to 0) or only calling it from a periodic background task.

10. Split `impl EnvoyHandle` blocks — style

engine/sdks/rust/envoy-client/src/handle.rs: The free function decode_serverless_actor_start_payload is sandwiched between two separate impl EnvoyHandle blocks. This is valid Rust but unusual. Move the free function before or after all impl blocks so the impl block is contiguous.

11. `normalizeNativeMetrics` snake_case fallback — dead code

rivetkit-typescript/packages/rivetkit/src/registry/native-database.ts:

const numberField = (camel: string, snake: string) =>
    Number(raw[camel] ?? raw[snake] ?? 0);

NAPI always returns camelCase keys. The snake_case fallback will never be hit and creates confusion about whether the format is stable. Remove the snake fallback parameter.

Minor / Nits

SSE_PING_FRAME / SSE_STOPPING_FRAME constants — good cleanup, no issues.
onSleep/onDestroy always registered — the semantic change (always run cleanup) is safer and intentional. The comment inside the wasm branch is helpful.
actors_notify placement and pairing — the notify_waiters() calls at actor add and remove sites look correct.
active_actor_count filtering closed handles — reasonable; worth a comment explaining what "closed" means in this context.
Process lifecycle logging in server.ts guarded by SQLITE_MEMORY_SOAK_DIAGNOSTICS — appropriate scoping.

Summary

The main concerns before merging are the hardcoded zeros in the metrics struct (silent misinformation), the path traversal in the heap snapshot endpoint (security), and the race in wait_actor_registered_then_stopped (correctness for very short-lived actors). The retry loop in the test and the dead normalizeExecuteRoute function are small but violate project conventions. Everything else is style or advisory.

feat(rivetkit): expose runtime diagnostics

34179c1

This was referenced May 4, 2026

test(kitchen-sink): add sqlite memory soak harness #4919

Closed

docs(rivetkit): raise serverless start payload limit #4920

Closed

fix(rivetkit-core): fail closed sqlite during shutdown #4921

Closed

chore(depot-client): split sqlite vfs transports #4922

Closed

NathanFlurry closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rivetkit): expose runtime diagnostics#4918

feat(rivetkit): expose runtime diagnostics#4918
NathanFlurry wants to merge 1 commit into05-03-chore_depot-client_migrate_to_worker_threadfrom
driver-test-complaints/expose-runtime-diagnosts

NathanFlurry commented May 4, 2026

Uh oh!

NathanFlurry commented May 4, 2026 •

edited

Loading

Uh oh!

claude Bot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NathanFlurry commented May 4, 2026

Description

Type of change

How Has This Been Tested?

Checklist:

Uh oh!

NathanFlurry commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented May 4, 2026

Code Review — feat(rivetkit): expose runtime diagnostics

Overview

Issues

1. Hardcoded zeros in JsSqliteVfsMetrics — misleading diagnostic data

2. Race condition in wait_actor_registered_then_stopped

3. Path traversal in /debug/heap-snapshot

4. Test uses a retry poll loop — violates CLAUDE.md testing policy

5. Dead code: normalizeExecuteRoute

6. Double-decode of serverless start payload

7. HTTP server timeouts set to 0 — needs a comment

8. diagnostics() can hang if the registry is still initialising

9. malloc_trim(0) called after every actor shutdown — potential cost concern

10. Split impl EnvoyHandle blocks — style

11. normalizeNativeMetrics snake_case fallback — dead code

Minor / Nits

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NathanFlurry commented May 4, 2026 •

edited

Loading

1. Hardcoded zeros in `JsSqliteVfsMetrics` — misleading diagnostic data

2. Race condition in `wait_actor_registered_then_stopped`

3. Path traversal in `/debug/heap-snapshot`

5. Dead code: `normalizeExecuteRoute`

8. `diagnostics()` can hang if the registry is still initialising

9. `malloc_trim(0)` called after every actor shutdown — potential cost concern

10. Split `impl EnvoyHandle` blocks — style

11. `normalizeNativeMetrics` snake_case fallback — dead code