Skip to content

perf(wasi): cut the shim's path_filestat_get cost 10x#226

Merged
NathanFlurry merged 13 commits into
mainfrom
stack/perf-wasi-cut-the-shim-s-path_filestat_get-cost-10x-opqrluul
Jul 2, 2026
Merged

perf(wasi): cut the shim's path_filestat_get cost 10x#226
NathanFlurry merged 13 commits into
mainfrom
stack/perf-wasi-cut-the-shim-s-path_filestat_get-cost-10x-opqrluul

Conversation

@NathanFlurry

@NathanFlurry NathanFlurry commented Jul 2, 2026

Copy link
Copy Markdown
Member

The wasi shim charged 637-1283us PER path_filestat_get (attribution:
AGENTOS_WASI_SYSCALL_PHASES instrumentation, now permanently available
env-gated): repeated requireBuiltin lookups per call, multi-attempt path
resolution over the guest mappings, and redundant bridge stat fallbacks.
Now: fs/path module memos, single-pass direct-stat path resolution, one
bridge stat per lookup, mode-derived wasi filetype, and a same-process stat
cache cleared on every mutating op. Real-stat per-call cost 637 -> ~60-130us.

Review catch (removed before ship): the draft also PREFILLED the stat cache
from readdir dirents with SYNTHESIZED attrs (size 0, mtime 0) — that
fabricates ls -l sizes / find -newer results and was stripped; only real
stat results are cached. DISCOVERED en route (pre-existing, filed as 0.11):
wasm-lane stat of KERNEL-backed files (e.g. shell-created) already reports
size 0 on trunk — reads return real bytes, only filestat size lies.

Rows (vs pre-change): ecosystem ls_100 452 -> ~270-324ms (gate shows 0.55x
of baseline-local), tar_small 1439 -> 1398ms, git 1169 -> ~1340 (noise);
find_1000 flat at ~5.2s — its cost is 200 x ~1ms stdout fd_writes (the L3
byte/event path), not stats. wasm suite green; gate green.

…s via event

The sidecar drains process output before emitting process_exited and the
frame stream + event pump are FIFO, so once the exit event is observed no
trailing output can follow it; the host-side quiet-turn drain (2 turns at
10ms) only remains for the snapshot-poll fallback exit path. Also stop
awaiting signal-state refreshes on the exec critical path (start + first
output) — kill paths await the in-flight refresh instead — and clean up
the parked refresh in finishProcess so fallback exits release it too.

Warm host-driven exec p50: node -e '' 30ms -> ~17ms, wasm true 52ms ->
39ms; whole wasm-command-floor lane down ~13ms/row; bench:gate green.
New regression test: 10 sequential fast-exit 64KiB stdout captures.
Bake the constant wasm runner + wasi shim (~300KB of JS previously
recompiled per exec) into the per-process V8 userland snapshot via
guest_runtime.snapshot_userland_code, keyed and cached process-wide.
Mode env AGENTOS_WASM_SNAPSHOT_RUNNER=auto|block|off: auto probes the
snapshot cache without blocking (async warm kicked once per process)
and falls back to the byte-identical inline runner until ready, so the
cold path is unchanged. The sync-RPC glue is re-evaluated per exec (the
snapshot-baked copy cannot bind session bridge fns). Module bytes are
now base64-encoded once per module in a bounded, fingerprint-validated
cache (64 entries, warn on evict, debug log reports per-entry and
cumulative cache bytes) instead of fs::read+encode per exec. Adds
AGENTOS_V8_SESSION_PHASES timers (snapshot_get / blob_clone /
isolate_new / user_code_execute) and a measured NOTE against
EagerCompile at snapshot build (moves cost to isolate deserialize).

Warm wasm floor p50: true 39->33ms, pwd 48->43ms, ls-empty 81->75ms,
date-version 127->115ms (52/62/96/140 before PR #212); cold first-exec
improved; wasm suite green; bench:gate green. Remaining floor is
isolate-per-exec + module decode — pooling tracked as Stage C.
The filtered install (--filter @secure-exec/benchmarks...) omitted
packages/build-tools, so every bench-gate run died in the v8-bridge build
script (missing Node dependencies) before gating anything — the PR gate has
never actually run green on CI.
Pre-created V8 session workers (thread + snapshot-restored isolate, keyed by
snapshot digest + heap limit) are claimed at session create via a warm hint,
taking isolate_new (~4.2ms) and blob_clone off the per-exec critical path.
Workers are never reused across guests (each exec still gets a virgin
isolate); a wrong hint falls back to the existing lazy-create path. Capacity
per key via AGENTOS_V8_WARM_ISOLATES (default 2, 0 disables); refill after
claim; pool transitions logged; warm_worker_hit/miss counters in the session
phases output.

Seeding is deliberately conservative: auto mode stays snapshot-only and only
AGENTOS_WASM_SNAPSHOT_RUNNER=block seeds workers (wasm-runner key + default
node key). Reason: creating isolates on background threads while other
isolates execute guest WebAssembly deterministically SIGSEGVs the pinned V8
130 (WasmCodePointerTable::AllocateUninitializedEntry inside Isolate::New,
even under the isolate lifecycle lock — the race is against V8-internal wasm
threads). Backtraces + notes in the rusty-v8-upgrade todo; enabling default
seeding rides that upgrade.

Warm floors (block, this host): node -e '' 17 -> 14.4ms p50, wasm true
33 -> ~30ms (warm_worker_hit 64/67, isolate_new down to 3 background calls);
pwd 42.6 -> 37.5ms, printf-0b 44.7 -> 40.2ms, date 115 -> 111ms. wasm suite
8.9s (was 96s, pool+snapshot amortize debug isolates); v8-runtime tests and
bench:gate green.
Module bytes now ride the in-process runtime protocol as Arc<Vec<u8>> from a
raw-bytes cache (same bounded/fingerprinted semantics as the base64 cache it
replaces) and are injected pre-exec as a __agentOSWasmModuleBytes Uint8Array;
the runner prefers it over the AGENTOS_WASM_MODULE_BASE64 env fallback, so the
per-exec base64 encode/decode (4.3ms @267KB, ~23ms @2.5MB) is gone in both
snapshot and inline modes. The userland (wasm-runner) snapshot script is now
compiled with EagerCompile: the fatter-blob isolate cost it causes is prepaid
by parked warm workers, so the runner function no longer lazy-compiles inside
user_code_execute.

Warm floors (block, this host): wasm true 30 -> 19.7ms p50 (52ms at the start
of the 3.2 campaign), echo hello 24.0ms, pwd 29.3ms, ls-empty ~39ms (was 96),
date --version 30.4ms (was 140 — module-size tax eliminated), printf-64k 136ms
(stdout streaming path, separate finding). Injection failure fails the exec
loudly. wasm + v8-runtime suites green; bench:gate green (require_100_small
flaked 4.3x once and passed 1.13x on rerun — bimodal, pre-existing, tracked).
WASM/shell commands write the kernel VFS, but a JS guest's mapped-shadow
read arms returned the shadow's answer — including ENOENT — without
materializing kernel state: readdir listed only shadow entries, exists and
readlink probed only the shadow. Anything a wasm command created was
invisible to node (fs.readdirSync ENOENT on a directory the client API could
list). readdir now materializes the path from the kernel, merges kernel
children into the shadow listing by name (shadow wins), and tolerates
kernel ENOENT/ENOTDIR for shadow-only dirs; exists/readlink materialize
first like fs.access always did. Mapped unlink/rmdir now mirror the removal
into the kernel (best-effort, ENOENT tolerated) — the shadow cannot express
deletions, so the new merge would otherwise resurrect a kernel-backed file
in the very listing that follows the unlink.

Five service regressions: wasm-mkdir visible to node readdir+stat, shell
mkdir+redirect children merged, shadow/kernel union lists both writers'
files exactly once, same-process shadow readdir, and unlink-no-resurrect
(same process + next process). Discriminator repro (direct mkdir / shell
mkdir / node mkdir / shell redirect) all visible; wasm suite green; focused
readdir lane runs again (was ENOENT on its shell-made fixtures); bench:gate
green (readdir_small +0.05ms from the kernel merge — 3.3 targets that row).
readdir (spec 3.3): the mapped-shadow arm now derives is_directory from the
getdents d_type (DirEntry::file_type) instead of an anchored openat2 + statx
per entry (symlinks keep the resolving fallback so symlink-to-dir still
reports a directory), drops its redundant materialize-from-kernel side effect
(the kernel union merge subsumes it for listing; opens still materialize),
and the response crosses as one raw status-2 buffer ([kind][u32le len][name]
per entry) decoded directly in the bridge instead of JSON->CBOR->V8 objects.
fs/readdir_small guest 0.23 -> ~0.09ms p50 (4.5x node; Accept <=0.1ms met);
readdir_big 0.57ms. promises.readdir stays on the CBOR path.

streams (spec 3.4): new _fsWritevRaw / fs.writevSync raw multi-buffer RPC
([u32 count] + per-buffer [u32 len][bytes], sequential-write semantics), and
WriteStream's end-flush sends its buffered chunks as ONE writev instead of an
RPC per chunk. fs/stream_copy_big guest 7.6 -> 4.3ms, stream_copy_small
1.03 -> 0.63ms. The <=3ms Accept is NOT met — measured floor: the residual is
sync-bridge byte throughput (~1.5ms per MiB per direction; a 1MiB copy is 2
large transfers), proven by an RPC-count probe (read-ahead batching to 1 RPC
per direction made it SLOWER from extra copies and was reverted; the probe
artifacts and the rejected round-2 read-ahead live in
~/progress/secure-exec/2026-07-02-*). Bridge byte-throughput is filed as the
Phase 5 unblocker (same root cause as the stdout 1.3ms/KiB-chunk finding).

Service regressions: raw readdir Dirent semantics (incl. symlink cases),
writev stream-copy order, all five mapped-shadow tests; bench:gate green.
Every agent-facing command now has a verified host-vs-VM row: sh_pipeline,
cd_tmp_pwd, echo_hello, cat_small/big, grep_small/big, sed_substitution,
find_1000, tar_small/big, gzip_small/big, jq_extract joining ls_100 and
git_init_commit — 16 rows, zero skips, every row asserts output correctness
(match counts, extract-diff round trips, exact stdout, sha shape). Vendored
wasm command binaries are validated at startup and missing ones fail loudly.
Two-tier fixtures on the size-sensitive commands per the 1.5 convention;
memory column included. First release-build numbers (warm vmCmd p50 vs host):
echo 29ms/11x, cd 51ms/14x, cat 42ms/18x, sed 49ms/23x, jq 102ms/6.5x,
git 1.17s/47x, gzip_big 629ms/54x, tar_big 5.1s/631x, grep_big 7.7s/961x,
find_1000 5.3s/1470x — the Phase 5.2 worklist, now measured honestly.
Adds service regressions for the fs.promises metadata surface (stat/lstat
shapes + ENOENT codes, rename/unlink round trip, mkdir/rmdir recursive +
EEXIST/ENOENT, access rejection, chmod/utimes visibility, readlink/realpath).

Written while evaluating routing these ops onto the sync bridge facade
(spec 5.1 candidate for the fs_promises_stat_x32 128x row): a runtime probe
showed promises-vs-sync is a WASH (~40us/stat either way) — the row sits at
the sync-RPC per-call floor (32 x ~40us ~= 1.3ms) and the conversion moved
nothing, so per the revert rule only the tests land. The row is recorded as
a dated exception in the backlog: ~= 5x is unreachable for per-call metadata
RPCs against sub-microsecond native statx without client-side stat caching
(semantics hazard) or app-visible batching; the real lever is the 3.5
sync-RPC floor itself.
Loopback sockets now get the same event push external sockets earned in PR
#199: the kernel socket table fires edge-triggered readiness callbacks
(empty->nonempty on recv buffers, datagram queues, and listener pending
queues; emitted outside the state lock), the sidecar routes them into
net_socket stream events (data/accept/dgram) keyed by kernel socket id, the
unix reader thread pushes like the TCP one, and the guest gained wake
latches so a wake landing while a pump is active re-pumps instead of
no-oping. Three guest cadence taxes found en route: the per-chunk
yieldBridgeMacrotask (first event-woken chunk now delivers immediately),
and reply-writes issued inside data handlers being microtask-coalesced
(handler-originated writes flush immediately). Loopback peer-wait backoff
capped at 1ms.

Release rows (vs pre-change): tcp_tiny_writes 4.55->3.0ms, udp_echo_big
1.25->0.67, unix_echo_small 5.74->3.5, unix_echo_big 7.38->5-7,
tcp_echo_small 4.02->3.2-4.8, tcp_concurrent 5.82->3.4-5. KNOWN RESIDUAL:
rows are bimodal run-to-run — a wake/pump race remains (event consumed
mid-pump -> EAGAIN -> 10ms park with no re-fire since the edge already
triggered); documented in the backlog for the next pass. http_loopback_get
carries ~+0.3ms from event overhead (0.89->1.16-1.9). Gate green; kernel
edge-trigger unit tests + isolated service regressions for TCP wake/churn,
unix echo, dgram.
The http2 poll-retain loop (PR #193's 1ms bounded-poll workaround) now gets
event pushes: the sidecar's http2 server/session event queues fire an
edge-triggered net_socket {event:'http2'} push on empty->nonempty (computed
before push_back, sent via the retained V8 session captured at listen/
connect), routed to the guest retain dispatcher which wake-latches and polls
immediately; the idle fallback stretches 1ms -> 10ms since events carry
latency now. net/http2_loopback_get 95 -> ~53-57ms.

Residual (documented in the backlog): ~5 turns per h2c GET still wake by
timer. A round-2 attempt reached ZERO timer-origin wakes but broke the
stream-API server path (server.on('stream') — the bench op) while the compat
API kept working; it was reverted per the revert rule. Retry needs a
regression test on the stream-API path first; artifacts in
~/progress/secure-exec/2026-07-02-http2-retain-push/. Adds an h2c
request-handler round-trip service regression (runs twice in one VM).
website/vendor/ was gitignored while website/package.json depended on
file:vendor/theme, so every fresh checkout and CI run died at pnpm install
(spec item 0.5; main CI has been red since the website landed). The theme
is now published: docs-theme v0.3.1 (cut today from the previously-vendored
0.3.0 surface, dist committed and un-gitignored — 0.3.0's tag dropped dist
during pnpm git-dep packing). Deps move to github:rivet-dev/docs-theme#v0.3.1
with root pnpm overrides mapping @rivet-gg/components|icons to the tag's
vendor subpackages (the theme's internal file: deps don't resolve for git
consumers). Verified: pnpm install from a tree with NO website/vendor, and
pnpm --dir website build (32 pages) both green.
The wasi shim charged 637-1283us PER path_filestat_get (attribution:
AGENTOS_WASI_SYSCALL_PHASES instrumentation, now permanently available
env-gated): repeated requireBuiltin lookups per call, multi-attempt path
resolution over the guest mappings, and redundant bridge stat fallbacks.
Now: fs/path module memos, single-pass direct-stat path resolution, one
bridge stat per lookup, mode-derived wasi filetype, and a same-process stat
cache cleared on every mutating op. Real-stat per-call cost 637 -> ~60-130us.

Review catch (removed before ship): the draft also PREFILLED the stat cache
from readdir dirents with SYNTHESIZED attrs (size 0, mtime 0) — that
fabricates ls -l sizes / find -newer results and was stripped; only real
stat results are cached. DISCOVERED en route (pre-existing, filed as 0.11):
wasm-lane stat of KERNEL-backed files (e.g. shell-created) already reports
size 0 on trunk — reads return real bytes, only filestat size lies.

Rows (vs pre-change): ecosystem ls_100 452 -> ~270-324ms (gate shows 0.55x
of baseline-local), tar_small 1439 -> 1398ms, git 1169 -> ~1340 (noise);
find_1000 flat at ~5.2s — its cost is 200 x ~1ms stdout fd_writes (the L3
byte/event path), not stats. wasm suite green; gate green.
@NathanFlurry

NathanFlurry commented Jul 2, 2026

Copy link
Copy Markdown
Member Author

Stack for rivet-dev/secure-exec

Get stack: forklift get 226
Push local edits: forklift submit
Merge when ready: forklift merge 226

@NathanFlurry NathanFlurry changed the base branch from stack/fix-website-pin-docs-theme-to-the-published-v0-3-1-tag-urvqxsnp to main July 2, 2026 23:42
@NathanFlurry NathanFlurry merged commit 96a6eaf into main Jul 2, 2026
3 checks passed
@NathanFlurry NathanFlurry deleted the stack/perf-wasi-cut-the-shim-s-path_filestat_get-cost-10x-opqrluul branch July 2, 2026 23:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant