Skip to content

chore(rivetkit): add child spans for logging#4734

Draft
NathanFlurry wants to merge 1 commit into04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashesfrom
04-23-chore_rivetkit_add_child_spans_for_logging
Draft

chore(rivetkit): add child spans for logging#4734
NathanFlurry wants to merge 1 commit into04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashesfrom
04-23-chore_rivetkit_add_child_spans_for_logging

Conversation

@NathanFlurry
Copy link
Copy Markdown
Member

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Copy Markdown
Member Author

NathanFlurry commented Apr 24, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@NathanFlurry NathanFlurry force-pushed the 04-23-fix_rivetkit-core_orphan_engine_process_and_detect_crashes branch from fbf0f28 to b73cea0 Compare April 24, 2026 09:52
@NathanFlurry NathanFlurry force-pushed the 04-23-chore_rivetkit_add_child_spans_for_logging branch from fbc8b08 to 475a302 Compare April 24, 2026 09:52
@NathanFlurry NathanFlurry mentioned this pull request Apr 24, 2026
11 tasks
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Code Review: chore(rivetkit): add child spans for logging

Note: This PR is a draft. The scope is larger than the title suggests — it also adds graceful shutdown, refactors the NAPI registry state machine, and removes the logfmt TS formatter.


Summary of changes

  • Propagates tracing spans to spawned tasks via .in_current_span() / .instrument()
  • Adds #[tracing::instrument] to actor_inner, ActorTask::run, and dispatch_event
  • Removes per-statement actor_id fields where they're now captured at span level
  • Adds graceful shutdown (CancellationToken on serve, EnvoyHandle::wait_stopped, shutdown_and_wait)
  • Refactors NAPI CoreRegistry into a RegistryState state machine
  • Installs SIGINT/SIGTERM handlers in TypeScript for Mode A (persistent envoy)
  • Removes logfmt.ts and the Pino browser custom writer
  • Switches Rust tracing subscriber to JSON format

Issues

Breaking API change (no migration path documented)

CoreRegistry::serve and serve_with_config now require an explicit CancellationToken:

// Before
pub async fn serve(self) -> Result<()>
pub async fn serve_with_config(self, config: ServeConfig) -> Result<()>

// After
pub async fn serve(self, shutdown: CancellationToken) -> Result<()>
pub async fn serve_with_config(self, config: ServeConfig, shutdown: CancellationToken) -> Result<()>

Same break on the rivetkit thin wrapper. The examples are updated, but any external consumer of the rivetkit-core or rivetkit crates will fail to compile. If this is public API, the breakage should be called out explicitly in the PR description.

Log level inflation — info used for internal lifecycle noise

Many of the new statements log at info for per-request, per-action, and per-message lifecycle events. In production with concurrent actors these will flood structured log pipelines. They should be debug:

  • registry/dispatch.rs: "dispatch_action: sending DispatchCommand::Action", "command queued, awaiting reply", "reply received", "reply was an error", "reply channel dropped"
  • actor/task.rs: "actor task: handling DispatchCommand::Action", "ActorEvent::Action enqueued", "tracked reply received, forwarding", "tracked reply dropped"
  • registry/inspector.rs: every "inspector RPC: ..." entry
  • registry/inspector_ws.rs: "on_message fired", "decoded client message", "processed client message", "ActionRequest received", "ActionResponse ready to send", "sent response"
  • registry/websocket.rs: "handle_websocket: routing", "dispatching to ... handler"
  • napi_actor_events.rs: "napi: dispatching ActorEvent::Action to JS", "napi: invoking action JS callback", "napi: action JS callback returned"

Concretely, a single client action today generates at least 10 separate info-level lines (dispatch → enqueue → napi dispatch → napi invoke → napi return → task reply → forward → dispatch reply), all of which are structural duplicates of what the tracing spans already capture.

Redundant actor_id re-logged inside an instrumented span

ActorTask::run is now instrumented with actor_id as a span field, but several log calls added inside task.rs still carry actor_id = %self.ctx.actor_id() individually. Same pattern in schedule.rssync_alarm_logged / sync_future_alarm_logged add the field back manually. The point of propagating it to the span is to avoid exactly this redundancy.

SHUTDOWN_DRAIN_TIMEOUT defined twice

// rivetkit-core/src/registry/mod.rs
const SHUTDOWN_DRAIN_TIMEOUT: Duration = Duration::from_secs(20);

// rivetkit-core/src/serverless.rs
const SHUTDOWN_DRAIN_TIMEOUT: Duration = Duration::from_secs(20);

Define once in a shared location; drift between these two will cause inconsistent drain behavior.

ensure_envoy TOCTOU race in serverless.rs

if self.shutting_down.load(Ordering::Acquire) {
    return Err(RuntimeShutDown.build());
}
let mut guard = self.envoy.lock().await;
// ... gap where shutdown() can run and take() the handle ...

shutdown() does self.envoy.lock().await.take(), so if it wins the lock between the Ordering::Acquire check and the subsequent self.envoy.lock().await in ensure_envoy, ensure_envoy will then install a fresh handle that shutdown already finished and missed. The post-build check in build_serverless_runtime handles one race window in the NAPI layer, but this earlier window inside the Rust CoreServerlessRuntime::ensure_envoy is not guarded. Fix: do the shutdown check inside the same lock guard.

Comment style violations (CLAUDE.md §Comments)

Several new comments reference implementation artifacts that will rot:

  • serverless.rs: // not_global: true to avoid caching the handle in the process-wide GLOBAL_ENVOY OnceLock. Without this, a shutdown-during-build race (spec §3 step 7) leaves a dead handle cached... — "spec §3 step 7" is meaningless to any future reader.
  • registry.rs (NAPI): the RegistryState doc comment references "spec invariant" without defining it.

Per CLAUDE.md: comments should explain why in terms of constraints or invariants, not reference current task, fix, or spec sections.

Missing newline at end of artifact file

rivetkit-rust/engine/artifacts/errors/registry.shut_down.json has no trailing newline. All other artifact files in that directory end with a newline.

JSON tracing format is now unconditional in NAPI

tracing_subscriber::fmt()
    .json()
    // ...

The format switch is global and permanent for any Node process embedding rivetkit. Logfmt was removed from the TS side in the same PR, so the two layers now diverge: JS/Pino emits its own format, Rust emits JSON on stdout. If the intent is structured JSON end-to-end that is fine, but it should be configurable (or at minimum called out explicitly since it changes developer UX).


Minor observations

  • format_actor_key empty-segment encoding: the \0 escape for empty segments is non-obvious and there is no corresponding decoder. If this is only used for tracing labels (display only), a simpler sentinel like "(empty)" would be clearer.

  • spawn_task signature change: the new actor_id: String parameter is plumbed through to error logging only. Consider using a span field on the spawn instead, consistent with the rest of the PR.

  • logfmt.ts removal: verify no remaining references before merging (grep -r 'logfmt' in the TS packages).

  • with_span_list(false): the tracing subscriber emits the current span but not the ancestor list. This limits traceability when debugging nested span relationships. A comment explaining the trade-off (log volume vs. depth) would help.


What looks good

  • Propagating spans to detached tokio::spawn tasks via .in_current_span() is the correct pattern and was a real gap before.
  • The RegistryState state machine in NAPI correctly handles the BuildingServerless TOCTOU race with the Notify arm-before-drop pattern.
  • EnvoyHandle::wait_stopped using a latched watch::Sender<bool> is clean and avoids the closed-channel case.
  • Removing tokio::signal::ctrl_c() from start_envoy_sync_inner is correct per CLAUDE.md: signal policy belongs to the host, not the library.
  • The bounded SHUTDOWN_DRAIN_TIMEOUT guard prevents indefinite hangs when the engine is unreachable.
  • The TypeScript double-Ctrl-C re-raise pattern (process.kill(pid, signal) after removing handlers) is the right idiom for graceful shutdown.

🤖 Generated with Claude Code

@NathanFlurry NathanFlurry force-pushed the 04-23-chore_rivetkit_add_child_spans_for_logging branch from 36cef20 to a70e758 Compare April 24, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant