Skip to content

Phase 0 / IPC / IPC complete (SCM_RIGHTS primary attach + Windows + replay + nightly fuzz)#22

Merged
guysenpai merged 30 commits into
mainfrom
phase-0/ipc/scm-rights-and-windows-and-fuzz
Jun 6, 2026
Merged

Phase 0 / IPC / IPC complete (SCM_RIGHTS primary attach + Windows + replay + nightly fuzz)#22
guysenpai merged 30 commits into
mainfrom
phase-0/ipc/scm-rights-and-windows-and-fuzz

Conversation

@guysenpai

@guysenpai guysenpai commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

M0.7 — IPC complete (SCM_RIGHTS primary attach + Windows + replay + nightly fuzz)

Completes and hardens the S6 IPC skeleton and absorbs the carried-forward Phase −1 IPC debts. No Tier 0 surface widening beyond what the SCM_RIGHTS pivot requires. Advances criterion C0.4 (editor↔runtime IPC stable).

Brief (source of truth): briefs/M0.7-ipc-scm-rights-windows-fuzz.md — Status CLOSED (2026-06-06).


Closing notes (copy from the brief)

What shipped (E1→E4)

  • E1 — SCM_RIGHTS primary attach (POSIX). Cross-process shm_open(O_RDWR) (refused EACCES for non-creator siblings on macOS BSD shm) is gone. The editor creates each region, keeps the fd, and hands the viewport fd to the runtime via IpcSocket.sendWithHandles in a ShmRegionsHandoff right after ProtocolHelloAck; the runtime maps it with ShmRegion.fromFd / ShmViewport.fromFd. open is demoted to intra-process. Protocol bumped to v3. Startup orphan reaping added. EINTR loops attested non-regressed (M0.5-owned, not re-implemented).
  • E2 — extended catalogue. ShmRegionsHandoff/ShmRegionDesc, Play/Pause/Stop/LoadScene/HotReloadScript, SaveScene, SaveProjectProjectSaved (transactional, same seq_id), RuntimeError (non-fatal event). msgTypeOf + isKnown extended to 1..23; schema_hash uniqueness guard kept complete. SaveScene declared but intentionally not wired (out of scope).
  • E3 — Windows editor path. src/editor/main.zig out of error.Unimplemented: CreateProcessW spawn + named pipe + the S2 Win32 window. Editor switched to full std.process.Init (Juicy Main) for init.io/init.arena.
  • E4 — best-effort replay + nightly fuzz. Editor CommandLog ring (1024); last_clean_line set on the ProjectSaved ack. SaveProject persists a minimal binary snapshot (option 1; no .scene.etch writer), reloaded on restart. connection.replayCommands re-sends pending post-clean-line frames and awaits matching acks, stopping hard on nack/timeout/desync (no idempotence, §7.2/§7.3). crash_recovery.zig un-gated to Windows (detection < 100 ms, replay < 500 ms aggregate). fuzz_1h.zig promoted to a nightly cron (Linux + Windows, artifact), rewritten to fuzz the full catalogue with leak detection.

Review addendums acted

  • E3-a (handoff §8.3 hardening, 8859ff9)acceptShmHandoff rejects region_count == 0, > MAX_SHM_REGIONS, or an fd-count mismatch; closes every received fd on rejection and every non-viewport fd on success. Negative tests added.
  • E3-b/c/d (Windows portability, 32e4fbb + 1c3e31d + 34a057e) — surfaced by Guy's real Windows runs, invisible to compile checks: parseArgsArgs.Iterator.initAllocator (.init is a @compileError on Windows); spawn_process error set remapped InvalidUtf8InvalidArgument; CreateProcessW error.SpawnFailed fixed by resolving the runtime path from the editor's exe dir + .exe (with a GetLastError diagnostic); MSVCRT-correct command-line quoting (ArgvQuote).
  • E4-a (fuzz teardown, 3b53414) — the bad-magic teardown sentinel had reclassified InvalidMagic as benign once stop == 1, a narrow window that could mask a real desync on the last backlog frames. Replaced by a well-formed sentinel frame of a type kept out of fuzz_types (ShutdownAck); InvalidMagic / UnknownMsgType / size-mismatch are now unconditional faults at any time. Harness only — prod transport unchanged.

Acted deviations (recap)

  • Path reconciliationsrc/ipc/ipc_messages.etch (no .etch codegen exists) → src/core/ipc/messages.zig; the <platform>/ipc.zig wrapper → src/core/ipc/{shm,shm_posix,shm_windows,transport,transport_windows,viewport}.zig.
  • Dedicated rooted modules instead of inlining into the mains: cleanup.zig (E1), command_log.zig + snapshot.zig (E4) — reusable, unit-tested Tier 0 primitives, rooted per §13.
  • connection.replayCommands at the connection layer (not the editor main) — keeps replay unit-testable.
  • src/runtime/main.zig + src/core/platform/process.zig Windows paths beyond the E3 Files list — a green Windows handshake required the runtime to leave error.Unimplemented and spawn_process to implement CreateProcessW. Confirmed acted with Guy.
  • .gitignore (snapshot artifacts), build.zig (handoff_fd/catalogue registration + nightly arg forwarding), schema_hash.zig/process.zig (guard + native Windows spawn test).

Validation checklist

  • POSIX — macOS dev (Apple Silicon): zig build test / test-ipc green incl. all 4 crash_recovery cases; fuzz 60 s smoke sent == recv == 20 440 075, fault=0, alloc/free balanced; fmt + lint + pre-push (build + test + test-release + tsan-wayland) green.
  • POSIX — Linux CI (ubuntu-24.04, Debug + ReleaseSafe): green.
  • Windows — real PC (Guy): native zig build + test-ipc + test green; interactive editor run ("ipc demo completed cleanly": CreateProcessW + handshake + round-trip + clean shutdown); crash_recovery green; fuzz 60 s smoke green.
  • Windows CI (windows-2025, Debug + ReleaseSafe): green.
  • Nightly fuzz cron wired (.github/workflows/nightly-fuzz.yml) — activates on squash-merge to main (scheduled runs fire only from the default branch); workflow_dispatch available meanwhile.
  • Brief CLOSED, no merge-blocking IN PROGRESS marker; git diff main..HEAD files all in the brief's Files list or journaled as acted deviations.

Note: tests/assets/cache_diff.zig flakes intermittently on Windows — pre-existing M0.6 debt (absolute cook-time-ratio assertion), out of M0.7 scope, not an M0.7 regression. A follow-up chip is spawned.


Doc files touched (in this repo)

  • briefs/M0.7-ipc-scm-rights-windows-fuzz.md — milestone brief (journal E1→E4 + addendums, acted deviations, Closing notes; Status CLOSED).

CLAUDE.md is intentionally not touched here — milestone-state table + tag are Guy's closing gestures.

Out-of-repo (Claude.ai KB — NOT in this PR)

engine-ipc.md (§4.7/§4.8 fd-passing, §7 replay) and engine-phase-0-plan.md are patched by Claude.ai in the knowledge base. They live outside the repo and are not part of this PR.


Review notes (non-blocking)

  • (a) Fuzz teardown reuses ShutdownAck. The well-formed teardown sentinel reuses the catalogue's ShutdownAck type, recognised by msg_type and kept out of fuzz_types via a comptime guard. This is a harness-internal semantic (no dedicated test-only msg_type was added to the prod enum, to avoid widening the catalogue).
  • (b) Replay per-command timeout is delegated to the caller's SO_RCVTIMEO. replayCommands relies on the socket recv timeout the caller arms; the timeout code path itself has no dedicated test (the happy path + nack/desync stop are covered by the crash_recovery replay case).
  • (c) ✅ Addressed (50fa16f). The synchronous-replay seq_id invariant is now documented explicitly in replayCommands (doc comment + inline note): the pass drains each command (send → same-seq_id ack → markAcked) before the editor resumes emitting new commands, which is what keeps a replayed seq_id from colliding with a fresh one in the seq_id→callback map (§3.4); making replay async or pipelined would break it. Comment only — no behavior change.
  • (d) ProjectSaved.ok = 0 (snapshot write failure) path is untested. The runtime surfaces a snapshot write failure via ok = 0 so the editor does not advance its clean line, but there is no test injecting a write failure.

🤖 Generated with Claude Code

guysenpai added 30 commits June 5, 2026 16:28
E1 of M0.7: the POSIX cross-process shm attach pivots to SCM_RIGHTS
fd-passing as the primary path (engine-ipc.md §4.8).

- protocol: WELD_IPC_PROTOCOL_VERSION 2 -> 3
- messages: ShmRegionsHandoff + ShmRegionDesc (msg_type 14)
- connection: recvFrameWithHandles (frame + ancillary fds)
- editor: send ShmRegionsHandoff after the handshake; reap orphan
  sockets/regions at startup (cleanup.zig, is_alive PID check)
- runtime: attach the viewport via ShmViewport.fromFd, no shm_open
- crash_recovery: send the handoff; un-gated to macOS since the pivot
  removes the BSD shm cross-process EACCES
Review fix (E1 addendum). The runtime accepted any handoff with at
least one fd and region_count >= 1; tighten to the §8.3 contract.

- connection.acceptShmHandoff: region_count must be in [1, MAX_SHM_REGIONS]
  and equal the received fd count exactly, else error.InvalidHandoff.
- On rejection, every received fd is closed; on success only the
  viewport (regions[0]) is kept and any further region fd is closed —
  no descriptor leak from a malformed or multi-region handoff.
- transport.closeHandle: platform-abstracted single-handle close.
- handoff_fd.zig: negative tests (count mismatch, over-cap) + multi-region
  success, asserting the unmapped/excess fds are actually closed.
E2 of M0.7. Adds the 9 messages the brief scopes (engine-ipc.md §3.3),
hand-written extern-struct POD like the S6 catalogue (no .etch codegen
exists in-repo). MsgType range 1..23.

- Play / Pause / Stop (fire-and-forget): the runtime tracks play_state;
  the render loop advances the mire only while playing (default playing,
  so S6 behaviour and crash_recovery are unchanged).
- LoadScene / HotReloadScript (fire-and-forget): handlers decode + act;
  an empty LoadScene path emits a non-fatal RuntimeError event.
- SaveScene: declared with NO wired handler (scene granularity, deferred).
- SaveProject -> ProjectSaved: transactional, ack carries the same seq_id.
- RuntimeError: runtime->editor non-fatal event (distinct from CrashReport).
- tests/ipc/catalogue.zig: framing round-trips for every new message +
  end-to-end handler tests against the spawned runtime (SaveProject ack,
  LoadScene->RuntimeError, Play/Pause/Stop sync). schema_hash uniqueness
  extended to all 23 messages.
E3 of M0.7. The editor and runtime leave error.Unimplemented on Windows;
the IPC round-trip now uses the existing named-pipe transport + by-name
shm attach.

- platform/process.zig: spawn_process implements CreateProcessW (UTF-16
  command line, STARTUPINFOW + PROCESS_INFORMATION), replacing the stub.
- editor/main.zig: drops the early error.Unimplemented; OS-correct
  endpoints (transport.buildSocketPath -> `\\.\pipe\weld-<pid>`; shm name
  `Local\weld-shm-viewport-<pid>`); the SCM_RIGHTS handoff is gated to
  POSIX (Windows attaches the named mapping by name, §2.2).
- runtime/main.zig: drops the early error.Unimplemented; branches the
  viewport attach POSIX (fromFd via handoff) vs Windows (open by name).
- runtime reader buffer (review point 1): sized over the FULL incoming
  editor->runtime catalogue, not @max(Echo, LoadScene). Frame sizes
  (24 + @sizeof): Heartbeat 32, Shutdown 25, Echo 88, SpawnEntity 28,
  ModifyComponent 80, Play/Pause/Stop 25, LoadScene 280, SaveScene 280,
  HotReloadScript 32, SaveProject 25 -> max 280. No undersize today, but
  the explicit enumeration is regression-proof.

Windows compile verified by cross-compiling editor + runtime to
x86_64-windows (the full `zig build` cross is blocked by a native-tool
toolchain quirk, so a direct build-exe per binary is used). Runtime
behaviour on Windows is validated by CI + the E4 crash_recovery un-gate.
Review fix (E3 addendum). The naive `"arg"` wrapping in spawn_process
corrupts any argument ending in backslashes (a `\` before the closing
`"` is read as an escaped quote, swallowing the next arg) or containing
a `"`. Replace it with the standard ArgvQuote / CommandLineToArgvW
algorithm in a reusable, testable `quoteArg`.

- platform/process.zig: `pub fn quoteArg` (UTF-8, ASCII-transparent):
  verbatim when no whitespace/quote; else wrap in `"` and double runs of
  backslashes before a `"` (literal or closing). spawn_process uses it
  per arg (argv[0] included, by convention).
- tests/ipc/process.zig: golden cases (empty, verbatim, space, trailing
  backslash with/without space) + a round-trip through a reference
  CommandLineToArgvW parser over the nasty cases (internal quote,
  backslash+quote, tab, multiple trailing backslashes). Pure +
  cross-platform — runs without Windows. Failure-injection confirmed the
  assertions actually run.
Two Windows compile errors found on Guy's real Windows build (E3
addendum) — my cross-compile had missed them (see the method note below).

1. parseArgs (editor + runtime): `std.process.Args.Iterator.init` is a
   `@compileError` on Windows (no POSIX argv). Switch to the allocator
   variant `Iterator.initAllocator(init.args, gpa)` (preserves the Juicy
   Main `init.args`; `deinit` frees the Windows buffer). Works on POSIX too.
2. spawn_process: `utf8ToUtf16LeAllocZ` can return `error.InvalidUtf8`,
   which is outside the process `Error` set. New `utf8ToUtf16Z` helper
   remaps it to `error.InvalidArgument` (non-UTF-8 argv = invalid input)
   rather than widening `Error` with a Unicode member.

Verification-method fix: a `zig build-exe -target x86_64-windows` with
`-target` placed AFTER the `-M` module definitions silently compiled the
modules for the NATIVE (macOS) target — so the Windows-gated paths were
never analyzed (that is why the earlier "cross GREEN" missed both bugs).
With `-target` BEFORE the `-M` flags, `builtin.os.tag == .windows` holds
and the broken `Iterator.init` is correctly rejected; both binaries then
compile clean for Windows. Real `zig build` on Windows remains the
authority.
E3 addendum 2 — CreateProcessW failed (error.SpawnFailed) on Guy's real
Windows run, the first real exercise of the Windows runtime path.

1. Diagnostic: on CreateProcessW failure, log path + GetLastError()
   before returning error.SpawnFailed (e.g. 2 = ERROR_FILE_NOT_FOUND),
   so the cause is confirmable instead of opaque.
2. Root fix: the default runtime_path was the POSIX-inherited
   "zig-out/bin/weld-runtime" — no `.exe`, CWD-relative. CreateProcessW
   with lpApplicationName needs the exact path and does not add `.exe`
   nor search PATH. The editor now derives it from its own executable
   directory (std.process.executableDirPathAlloc) + "weld-runtime" +
   ".exe" on Windows. The `--runtime=` override is kept (tests pass it).
   The editor switches to the full `std.process.Init` (Juicy Main for
   dev tools, §2) to obtain `io` for the lookup.
3. Anti-regression: native Windows spawn_process test (cmd.exe /c exit 0
   → wait_nonblock == 0), gated Windows — covers the path the
   interactive run revealed.

Verified: editor + runtime + tests/ipc/process.zig compile clean for
x86_64-windows (corrected target-first cross method); POSIX test-ipc +
fmt + lint green. Guy re-tests the live CreateProcessW run on Windows.
E4 of M0.7. src/core/ipc/command_log.zig: a 1024-entry FIFO ring of the
editor->runtime commands sent, each retaining its encoded frame for
verbatim replay (engine-ipc.md §7, engine-tools-editor.md §2.7.3).
last_clean_line advances to head on markCleanLine (called when the
ProjectSaved ack arrives); replaySince iterates the post-clean-line
entries still pending (never acked/nacked), in send order, clamped to the
retained window. droppedUnsaved flags ring overflow since the last save.
Rooted in core/root.zig (§13); 6 inline tests.
SaveProject now writes a minimal fixed-size binary snapshot (magic +
version + frame_id marker) as the best-effort-replay reload point
(engine-ipc.md §7.1, brief E4 option 1) — not a .scene.etch writer and
no project-settings serialization (out of Phase 0). The runtime reloads
the marker on restart to resume the mire; a write failure surfaces via
ProjectSaved.ok = 0 so the editor does not advance its clean line on a
save that did not persist. File I/O goes through std.Io.Dir + io (0.16
API; std.fs.cwd() is gone). Snapshot .bin artifacts are gitignored.
replayCommands re-sends each CommandLog entry pending since the last
clean line and awaits a reply carrying the same seq_id; the first nack,
timeout, or seq desync stops the pass hard with complete = false (no
idempotence, engine-ipc.md §7.2/§7.3). The caller arms the per-command
timeout via the socket recv timeout. Never raises — a failure ends the
pass cleanly so the editor can surface it without crashing.
crash_recovery now compiles + runs on Windows too (was POSIX-only): the
per-OS differences are isolated in spawnAndHandshake (POSIX hands the
viewport fd off via SCM_RIGHTS; Windows opens the named mapping by name)
and the cleanup helpers; clock/sleep go through the cross-platform
platform-time wrapper (std.time.milliTimestamp / std.Thread.sleep are
gone in 0.16). Adds a fourth case: SaveProject establishes a clean line,
3 post-save commands are queued unacked, the runtime is killed, then a
restart + replayCommands re-acks all 3 in < 500 ms aggregate.
fuzz_1h sent only Echo frames; it now picks a random message type each
iteration across a 15-type catalogue slice (incl. ShmRegionsHandoff and
every E2 command). Interleaving heterogeneous frame sizes is the real
test — it stresses the length-prefixed framing's delimiting over tens of
millions of back-to-back frames (the no-desync gate), which a fixed-size
Echo stream never exercised. A counting allocator over page_allocator
turns leaks into a non-zero exit (alloc/free + byte tallies must
balance); sent == recv and a clean reader confirm no desync. A
deterministic bad-magic teardown sentinel (published after the stop
flag) unblocks the reader's blocking recv instead of relying on a recv
timeout. Duration is overridable via argv (env vars are gone in 0.16):
zig build test-ipc-fuzz-1h -- --duration-ms=N.
Scheduled (04:00 UTC) + workflow_dispatch job that runs the 1 h
full-catalogue IPC fuzz in ReleaseSafe on ubuntu-24.04 and windows-2025,
uploading each run's stdout digest as an artifact (G3 gate). Scheduled
runs fire only from the default branch, so this activates once M0.7 is
squash-merged to main; workflow_dispatch (with a duration_ms input)
allows manual runs in the meantime.
The bad-magic teardown sentinel reclassified error.InvalidMagic as
benign once stop == 1, opening a masking window: a real framing desync on
the last backlog data frames — sent before the writer published stop but
received after — would be scored as teardown instead of a fault, blunting
the no-desync gate.

The sentinel is now a well-formed frame of a catalogue type kept out of
fuzz_types (ShutdownAck), recognised by the reader via its msg_type. The
clean end-of-stream is thus a valid frame, never an error — so
InvalidMagic / UnknownMsgType / version / size mismatches are
unconditional faults at any time, regardless of stop. Only a post-stop
socket EOF stays tolerated (outside the targeted bug class). A comptime
guard keeps the sentinel type out of fuzz_types. Prod transport
unchanged. 60 s smoke: sent == recv == 20440075, fault=0, alloc/free
balanced.
Review note (c): make explicit that replayCommands is synchronous and
drains each replayed command (send → recv same-seq_id ack → markAcked)
before the editor resumes emitting new commands. That serialization is
what keeps a replayed seq_id from ever colliding with a fresh one in the
seq_id->callback map (§3.4); making replay async or pipelined would break
it. Comment only — no behavior change.
@guysenpai guysenpai merged commit cc782d1 into main Jun 6, 2026
7 checks passed
@guysenpai guysenpai deleted the phase-0/ipc/scm-rights-and-windows-and-fuzz branch June 6, 2026 07:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant