perf(wasi): cut the shim's path_filestat_get cost 10x#226
Merged
NathanFlurry merged 13 commits intoJul 2, 2026
Merged
Conversation
…s via event The sidecar drains process output before emitting process_exited and the frame stream + event pump are FIFO, so once the exit event is observed no trailing output can follow it; the host-side quiet-turn drain (2 turns at 10ms) only remains for the snapshot-poll fallback exit path. Also stop awaiting signal-state refreshes on the exec critical path (start + first output) — kill paths await the in-flight refresh instead — and clean up the parked refresh in finishProcess so fallback exits release it too. Warm host-driven exec p50: node -e '' 30ms -> ~17ms, wasm true 52ms -> 39ms; whole wasm-command-floor lane down ~13ms/row; bench:gate green. New regression test: 10 sequential fast-exit 64KiB stdout captures.
Bake the constant wasm runner + wasi shim (~300KB of JS previously recompiled per exec) into the per-process V8 userland snapshot via guest_runtime.snapshot_userland_code, keyed and cached process-wide. Mode env AGENTOS_WASM_SNAPSHOT_RUNNER=auto|block|off: auto probes the snapshot cache without blocking (async warm kicked once per process) and falls back to the byte-identical inline runner until ready, so the cold path is unchanged. The sync-RPC glue is re-evaluated per exec (the snapshot-baked copy cannot bind session bridge fns). Module bytes are now base64-encoded once per module in a bounded, fingerprint-validated cache (64 entries, warn on evict, debug log reports per-entry and cumulative cache bytes) instead of fs::read+encode per exec. Adds AGENTOS_V8_SESSION_PHASES timers (snapshot_get / blob_clone / isolate_new / user_code_execute) and a measured NOTE against EagerCompile at snapshot build (moves cost to isolate deserialize). Warm wasm floor p50: true 39->33ms, pwd 48->43ms, ls-empty 81->75ms, date-version 127->115ms (52/62/96/140 before PR #212); cold first-exec improved; wasm suite green; bench:gate green. Remaining floor is isolate-per-exec + module decode — pooling tracked as Stage C.
The filtered install (--filter @secure-exec/benchmarks...) omitted packages/build-tools, so every bench-gate run died in the v8-bridge build script (missing Node dependencies) before gating anything — the PR gate has never actually run green on CI.
Pre-created V8 session workers (thread + snapshot-restored isolate, keyed by snapshot digest + heap limit) are claimed at session create via a warm hint, taking isolate_new (~4.2ms) and blob_clone off the per-exec critical path. Workers are never reused across guests (each exec still gets a virgin isolate); a wrong hint falls back to the existing lazy-create path. Capacity per key via AGENTOS_V8_WARM_ISOLATES (default 2, 0 disables); refill after claim; pool transitions logged; warm_worker_hit/miss counters in the session phases output. Seeding is deliberately conservative: auto mode stays snapshot-only and only AGENTOS_WASM_SNAPSHOT_RUNNER=block seeds workers (wasm-runner key + default node key). Reason: creating isolates on background threads while other isolates execute guest WebAssembly deterministically SIGSEGVs the pinned V8 130 (WasmCodePointerTable::AllocateUninitializedEntry inside Isolate::New, even under the isolate lifecycle lock — the race is against V8-internal wasm threads). Backtraces + notes in the rusty-v8-upgrade todo; enabling default seeding rides that upgrade. Warm floors (block, this host): node -e '' 17 -> 14.4ms p50, wasm true 33 -> ~30ms (warm_worker_hit 64/67, isolate_new down to 3 background calls); pwd 42.6 -> 37.5ms, printf-0b 44.7 -> 40.2ms, date 115 -> 111ms. wasm suite 8.9s (was 96s, pool+snapshot amortize debug isolates); v8-runtime tests and bench:gate green.
Module bytes now ride the in-process runtime protocol as Arc<Vec<u8>> from a raw-bytes cache (same bounded/fingerprinted semantics as the base64 cache it replaces) and are injected pre-exec as a __agentOSWasmModuleBytes Uint8Array; the runner prefers it over the AGENTOS_WASM_MODULE_BASE64 env fallback, so the per-exec base64 encode/decode (4.3ms @267KB, ~23ms @2.5MB) is gone in both snapshot and inline modes. The userland (wasm-runner) snapshot script is now compiled with EagerCompile: the fatter-blob isolate cost it causes is prepaid by parked warm workers, so the runner function no longer lazy-compiles inside user_code_execute. Warm floors (block, this host): wasm true 30 -> 19.7ms p50 (52ms at the start of the 3.2 campaign), echo hello 24.0ms, pwd 29.3ms, ls-empty ~39ms (was 96), date --version 30.4ms (was 140 — module-size tax eliminated), printf-64k 136ms (stdout streaming path, separate finding). Injection failure fails the exec loudly. wasm + v8-runtime suites green; bench:gate green (require_100_small flaked 4.3x once and passed 1.13x on rerun — bimodal, pre-existing, tracked).
WASM/shell commands write the kernel VFS, but a JS guest's mapped-shadow read arms returned the shadow's answer — including ENOENT — without materializing kernel state: readdir listed only shadow entries, exists and readlink probed only the shadow. Anything a wasm command created was invisible to node (fs.readdirSync ENOENT on a directory the client API could list). readdir now materializes the path from the kernel, merges kernel children into the shadow listing by name (shadow wins), and tolerates kernel ENOENT/ENOTDIR for shadow-only dirs; exists/readlink materialize first like fs.access always did. Mapped unlink/rmdir now mirror the removal into the kernel (best-effort, ENOENT tolerated) — the shadow cannot express deletions, so the new merge would otherwise resurrect a kernel-backed file in the very listing that follows the unlink. Five service regressions: wasm-mkdir visible to node readdir+stat, shell mkdir+redirect children merged, shadow/kernel union lists both writers' files exactly once, same-process shadow readdir, and unlink-no-resurrect (same process + next process). Discriminator repro (direct mkdir / shell mkdir / node mkdir / shell redirect) all visible; wasm suite green; focused readdir lane runs again (was ENOENT on its shell-made fixtures); bench:gate green (readdir_small +0.05ms from the kernel merge — 3.3 targets that row).
readdir (spec 3.3): the mapped-shadow arm now derives is_directory from the getdents d_type (DirEntry::file_type) instead of an anchored openat2 + statx per entry (symlinks keep the resolving fallback so symlink-to-dir still reports a directory), drops its redundant materialize-from-kernel side effect (the kernel union merge subsumes it for listing; opens still materialize), and the response crosses as one raw status-2 buffer ([kind][u32le len][name] per entry) decoded directly in the bridge instead of JSON->CBOR->V8 objects. fs/readdir_small guest 0.23 -> ~0.09ms p50 (4.5x node; Accept <=0.1ms met); readdir_big 0.57ms. promises.readdir stays on the CBOR path. streams (spec 3.4): new _fsWritevRaw / fs.writevSync raw multi-buffer RPC ([u32 count] + per-buffer [u32 len][bytes], sequential-write semantics), and WriteStream's end-flush sends its buffered chunks as ONE writev instead of an RPC per chunk. fs/stream_copy_big guest 7.6 -> 4.3ms, stream_copy_small 1.03 -> 0.63ms. The <=3ms Accept is NOT met — measured floor: the residual is sync-bridge byte throughput (~1.5ms per MiB per direction; a 1MiB copy is 2 large transfers), proven by an RPC-count probe (read-ahead batching to 1 RPC per direction made it SLOWER from extra copies and was reverted; the probe artifacts and the rejected round-2 read-ahead live in ~/progress/secure-exec/2026-07-02-*). Bridge byte-throughput is filed as the Phase 5 unblocker (same root cause as the stdout 1.3ms/KiB-chunk finding). Service regressions: raw readdir Dirent semantics (incl. symlink cases), writev stream-copy order, all five mapped-shadow tests; bench:gate green.
Every agent-facing command now has a verified host-vs-VM row: sh_pipeline, cd_tmp_pwd, echo_hello, cat_small/big, grep_small/big, sed_substitution, find_1000, tar_small/big, gzip_small/big, jq_extract joining ls_100 and git_init_commit — 16 rows, zero skips, every row asserts output correctness (match counts, extract-diff round trips, exact stdout, sha shape). Vendored wasm command binaries are validated at startup and missing ones fail loudly. Two-tier fixtures on the size-sensitive commands per the 1.5 convention; memory column included. First release-build numbers (warm vmCmd p50 vs host): echo 29ms/11x, cd 51ms/14x, cat 42ms/18x, sed 49ms/23x, jq 102ms/6.5x, git 1.17s/47x, gzip_big 629ms/54x, tar_big 5.1s/631x, grep_big 7.7s/961x, find_1000 5.3s/1470x — the Phase 5.2 worklist, now measured honestly.
Adds service regressions for the fs.promises metadata surface (stat/lstat shapes + ENOENT codes, rename/unlink round trip, mkdir/rmdir recursive + EEXIST/ENOENT, access rejection, chmod/utimes visibility, readlink/realpath). Written while evaluating routing these ops onto the sync bridge facade (spec 5.1 candidate for the fs_promises_stat_x32 128x row): a runtime probe showed promises-vs-sync is a WASH (~40us/stat either way) — the row sits at the sync-RPC per-call floor (32 x ~40us ~= 1.3ms) and the conversion moved nothing, so per the revert rule only the tests land. The row is recorded as a dated exception in the backlog: ~= 5x is unreachable for per-call metadata RPCs against sub-microsecond native statx without client-side stat caching (semantics hazard) or app-visible batching; the real lever is the 3.5 sync-RPC floor itself.
Loopback sockets now get the same event push external sockets earned in PR #199: the kernel socket table fires edge-triggered readiness callbacks (empty->nonempty on recv buffers, datagram queues, and listener pending queues; emitted outside the state lock), the sidecar routes them into net_socket stream events (data/accept/dgram) keyed by kernel socket id, the unix reader thread pushes like the TCP one, and the guest gained wake latches so a wake landing while a pump is active re-pumps instead of no-oping. Three guest cadence taxes found en route: the per-chunk yieldBridgeMacrotask (first event-woken chunk now delivers immediately), and reply-writes issued inside data handlers being microtask-coalesced (handler-originated writes flush immediately). Loopback peer-wait backoff capped at 1ms. Release rows (vs pre-change): tcp_tiny_writes 4.55->3.0ms, udp_echo_big 1.25->0.67, unix_echo_small 5.74->3.5, unix_echo_big 7.38->5-7, tcp_echo_small 4.02->3.2-4.8, tcp_concurrent 5.82->3.4-5. KNOWN RESIDUAL: rows are bimodal run-to-run — a wake/pump race remains (event consumed mid-pump -> EAGAIN -> 10ms park with no re-fire since the edge already triggered); documented in the backlog for the next pass. http_loopback_get carries ~+0.3ms from event overhead (0.89->1.16-1.9). Gate green; kernel edge-trigger unit tests + isolated service regressions for TCP wake/churn, unix echo, dgram.
The http2 poll-retain loop (PR #193's 1ms bounded-poll workaround) now gets event pushes: the sidecar's http2 server/session event queues fire an edge-triggered net_socket {event:'http2'} push on empty->nonempty (computed before push_back, sent via the retained V8 session captured at listen/ connect), routed to the guest retain dispatcher which wake-latches and polls immediately; the idle fallback stretches 1ms -> 10ms since events carry latency now. net/http2_loopback_get 95 -> ~53-57ms. Residual (documented in the backlog): ~5 turns per h2c GET still wake by timer. A round-2 attempt reached ZERO timer-origin wakes but broke the stream-API server path (server.on('stream') — the bench op) while the compat API kept working; it was reverted per the revert rule. Retry needs a regression test on the stream-API path first; artifacts in ~/progress/secure-exec/2026-07-02-http2-retain-push/. Adds an h2c request-handler round-trip service regression (runs twice in one VM).
website/vendor/ was gitignored while website/package.json depended on file:vendor/theme, so every fresh checkout and CI run died at pnpm install (spec item 0.5; main CI has been red since the website landed). The theme is now published: docs-theme v0.3.1 (cut today from the previously-vendored 0.3.0 surface, dist committed and un-gitignored — 0.3.0's tag dropped dist during pnpm git-dep packing). Deps move to github:rivet-dev/docs-theme#v0.3.1 with root pnpm overrides mapping @rivet-gg/components|icons to the tag's vendor subpackages (the theme's internal file: deps don't resolve for git consumers). Verified: pnpm install from a tree with NO website/vendor, and pnpm --dir website build (32 pages) both green.
The wasi shim charged 637-1283us PER path_filestat_get (attribution: AGENTOS_WASI_SYSCALL_PHASES instrumentation, now permanently available env-gated): repeated requireBuiltin lookups per call, multi-attempt path resolution over the guest mappings, and redundant bridge stat fallbacks. Now: fs/path module memos, single-pass direct-stat path resolution, one bridge stat per lookup, mode-derived wasi filetype, and a same-process stat cache cleared on every mutating op. Real-stat per-call cost 637 -> ~60-130us. Review catch (removed before ship): the draft also PREFILLED the stat cache from readdir dirents with SYNTHESIZED attrs (size 0, mtime 0) — that fabricates ls -l sizes / find -newer results and was stripped; only real stat results are cached. DISCOVERED en route (pre-existing, filed as 0.11): wasm-lane stat of KERNEL-backed files (e.g. shell-created) already reports size 0 on trunk — reads return real bytes, only filestat size lies. Rows (vs pre-change): ecosystem ls_100 452 -> ~270-324ms (gate shows 0.55x of baseline-local), tar_small 1439 -> 1398ms, git 1169 -> ~1340 (noise); find_1000 flat at ~5.2s — its cost is 200 x ~1ms stdout fd_writes (the L3 byte/event path), not stats. wasm suite green; gate green.
This was referenced Jul 2, 2026
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The wasi shim charged 637-1283us PER path_filestat_get (attribution:
AGENTOS_WASI_SYSCALL_PHASES instrumentation, now permanently available
env-gated): repeated requireBuiltin lookups per call, multi-attempt path
resolution over the guest mappings, and redundant bridge stat fallbacks.
Now: fs/path module memos, single-pass direct-stat path resolution, one
bridge stat per lookup, mode-derived wasi filetype, and a same-process stat
cache cleared on every mutating op. Real-stat per-call cost 637 -> ~60-130us.
Review catch (removed before ship): the draft also PREFILLED the stat cache
from readdir dirents with SYNTHESIZED attrs (size 0, mtime 0) — that
fabricates ls -l sizes / find -newer results and was stripped; only real
stat results are cached. DISCOVERED en route (pre-existing, filed as 0.11):
wasm-lane stat of KERNEL-backed files (e.g. shell-created) already reports
size 0 on trunk — reads return real bytes, only filestat size lies.
Rows (vs pre-change): ecosystem ls_100 452 -> ~270-324ms (gate shows 0.55x
of baseline-local), tar_small 1439 -> 1398ms, git 1169 -> ~1340 (noise);
find_1000 flat at ~5.2s — its cost is 200 x ~1ms stdout fd_writes (the L3
byte/event path), not stats. wasm suite green; gate green.