Add opt-in Linux io_uring worktree writer by russellromney · Pull Request #4 · russellromney/ripclone

russellromney · 2026-06-20T23:50:43Z

Summary

Adds an opt-in Linux io_uring worktree writer while keeping POSIX as the portable default for macOS and normal Linux runs.

RIPCLONE_IO_URING behavior:

unset or 0: POSIX writer
1: force the Linux io_uring writer
auto: probe io_uring and fall back to POSIX if unavailable

The original high-level tokio-uring path has been replaced with raw io-uring batching. Each extractor writer thread now lazily owns a thread-local ring, so it preserves the existing extraction parallelism and avoids a cross-thread request/reply per file.

The current Linux fast path uses one linked chain per regular file:

openat into a registered direct/fixed file slot
fixed-file write, hard-linked to the open
fixed-file close, hard-linked to the write

That keeps each batch in flight without first materializing normal file descriptors. Fly's kernel rejected direct openat while O_CLOEXEC was present, so the io_uring path strips that flag for direct opens. If a kernel still rejects direct descriptors with EINVAL/EOPNOTSUPP, the writer logs once and retries the batch with the normal-fd batched io_uring path.

The implementation keeps the existing path validation, symlink final-component protection, executable modes, deterministic mtimes, and portable POSIX fallback behavior.

Fly validation

Validated on the consolidated Fly setup from main:

server: https://ripclone.fly.dev
client app/image: ripclone-client-test
client region: ewr
benchmark machine: 8 shared CPUs, 2 GB, auto-stop/no restart
target path: mounted ext4 Fly volume at /data
repo: oven-sh/bun synced to 88417471cb28aab8943eb6227c014ac3f1c50cbc

Latest explicit /data comparison after direct descriptors started working:

io_uring direct descriptors: extraction 542ms, bench write=755ms, total=2904ms
POSIX comparison: extraction 573ms, bench write=722ms, total=2062ms

The write benchmark field is not a pure syscall/write timer; it covers working-tree extraction plus local blob pack work, and total time is noisier because resolve/fetch vary between runs. The direct-descriptor io_uring path is now functioning and is slightly faster in the extractor loop on this Fly volume run, but remaining cost is mixed with decompression, SHA-1 verification, mtime syscalls, and git skip-worktree cleanup.

Temporary benchmark volume vol_re1076l6w6y7mz14 was destroyed after testing. One-off machines exited with --rm; the standing 8-core app machines were stopped after validation.

Local validation

rustfmt --edition 2024 src/worktree_writer.rs
cargo check
Linux Docker cargo check
git diff --check

A generic rust:latest Docker image cannot link the focused worktree_writer test binary without the system libfuse package, so the post-direct-descriptor Linux validation used cargo check there.

The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…1/A2, A3, U1/U2, S1, AU1) (#62) * fix(refs): enforce "newer never loses" atomically on every backend (M1/M2/M3) The README promises "a newer sync never loses to an older one." That was enforced on none of: the file ref store (the default when no S3), S3 branch refs, and was a get-then-write TOCTOU on every SQL store — so the same race resolved differently depending on where refs happened to live. - Hoist the compare-policy into one shared `should_replace_ref` helper (same-commit always wins; otherwise newer-or-equal `synced_at` wins; a missing timestamp on either side defers to the backend's atomic tie-break). - FileRefStore: give save/save_branch the read-compare-then-rename they lacked, serialized by an in-process write lock (the file store is per-host by design). Also fixes two writers fighting over the shared `.json.tmp`. - SqlRefStore: replace get-then-unconditional-upsert with a single atomic conditional upsert via `MetaDb::save_ordered`. sqlite/postgres/libsql use `ON CONFLICT DO UPDATE ... WHERE`; MySQL (no WHERE on ON DUPLICATE KEY) uses an `@ripl := (...)` IF() session-variable computed once against the original row. Verified policy parity across all four dialects. - S3RefStore: route branch saves through the same ETag CAS as HEAD; unify both into one read-compare-CAS `save_keyed` with a bounded retry loop, and add `S3Storage::put_object_cas` (precondition failure → Ok(false) to retry). Adds the concurrency property test (suggested test #1): N tasks race save_branch with shuffled synced_at; the max must always win. FileRefStore and SQLite run unconditionally (both failed before this change); postgres/ mysql/libsql/S3 arms run when their connection env vars are set. Refs: ADVERSARIAL_REVIEW_2026-06-25.md M1/M2/M3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(queue): guard job settlement and bound attempts to a dead-letter (A1/A2) Two cross-process queue hazards on a shared mirror: - A2 (double-settle): after a time-based reclaim re-owns a job, the original slow-but-alive worker's `finish` still settled it — the UPDATE matched only `WHERE id = ?`. `finish` is now conditional on still owning the claim (`WHERE id = ? AND worker_id = ? AND status = 'claimed'`) and returns whether it actually settled; `ack` threads the worker id and returns that bool. The worker logs and discards a rejected (stale) result instead of clobbering the current owner's build. - A1 (crash-loop): a SIGKILL/OOM build never acks, so its claim goes stale and was requeued forever with no terminal state. Added an `attempts` column (incremented on each claim) and a `RIPCLONE_QUEUE_MAX_ATTEMPTS` cap (default 5): `reclaim_stale` now dead-letters a stale claim that is at/over the cap to terminal `failed` (clearing its credential) and only requeues the rest. Applied identically across all four adapters (sqlite, postgres, mysql, libsql) with a best-effort `attempts` column migration for legacy tables. New tests: a reclaimed worker's late ack is rejected; a hard-killed build dead-letters after the cap. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A1/A2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(server): make two-phase phase 2 durable on ephemeral workers (A3) Two-phase publish acked the job `done` after phase 1 (depth=1) and built the full history in a detached `tokio::spawn`. On a long-lived in-process server that task survives, but an ephemeral/cross-process worker that exits right after acking loses it — the full clonepack is never built and nothing re-drives it. Phase 2 also ran after `do_sync` returned, outside `repo_lock`. `do_sync` / `build_and_publish_two_phase` now take an `inline_full_history` flag. The cross-process worker (`process_build_job` when the queue is not in-process, `inproc_wait() == false`) builds phase 2 inline before returning, so the job is acked `done` only once the full clonepack is durable; if that worker dies mid-build the claim goes stale and is reclaimed/retried (and dead-lettered after the cap — see A1/A2). Running inline keeps phase 2 under the `repo_lock` that `process_build_job` already holds across `do_sync`, so it is serialized against other syncs for the repo. The long-lived in-process server keeps the background spawn, so `/sync` stays fast there. Note: a faithful "SIGKILL the worker right after the phase-1 ack" regression test needs real subprocess control; the in-process e2e harness can't simulate it. The inline path's correctness is covered by the existing cross-process clone e2e (full history present after the job reports Done), and the re-drive-on-death backstop by the A1/A2 dead-letter test. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(io_uring): drain the ring to quiescence before freeing window buffers (U1/U2) A `PendingDirectWindow` owns the buffers the kernel reads/writes for its in-flight SQEs — the file content, the path `CString`, and the kernel-written `statx` results. The kernel owns those until the matching CQEs are reaped, so a window must never be dropped while still in flight. Two paths broke that: - An interrupted wait (`submit_and_wait` returning EINTR) with windows in flight propagated as an error, unwinding and dropping the window while the kernel still owned its buffers → UAF / data race (incl. the kernel writing freed `statx` memory). All submit/wait calls now retry EINTR via `submit_uninterrupted` / `submit_and_wait_uninterrupted` instead of returning. - `Drop` did a best-effort `flush_deferred_writes`, which bails on the first harvest error and leaves later windows in the deque to drop in flight. `Drop` now drains every pending window to `is_complete()` (`drain_pending_to_quiescence`, bounded, EINTR-retrying). The harvest error path likewise quiesces its owned window before returning the error. Adds the missing SAFETY note on the `push_multiple` unsafe block tying each SQE's buffer lifetime to its `PendingDirectWindow` living in `pending_windows` until complete. io_uring is Linux-only (`cfg(target_os = "linux")`); these changes were typechecked with `cargo check --all-targets` in a Linux container (toolchain 1.96.0) since the dev host is macOS. A true UAF regression test needs ASan + kernel EINTR fault injection (report's suggested test #5) and isn't added here; the existing scheduler tests cover happy-path no-regression. Refs: ADVERSARIAL_REVIEW_2026-06-25.md U1/U2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(gc): anchor delete decision on reference time, not object mtime (S1) The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(auth): enforce per-repo read access for private repos (AU1) The shared RIPCLONE_SERVER_TOKEN only authenticates "may talk to this backend"; it is not per-repo authorization. Private-repo reads were gated by the caller's credential only on a cache *miss* (the upstream fetch). On a cache hit the ref response — including signed artifact URLs — was returned before the credential was looked at, so any holder of the shared token could read any already-cached private repo. A standing backend token widened this, and visibility was read from a client-supplied header that fails open to "public". There is no separate auth gateway in this architecture (public repos are anonymous; private repos are gated by the caller's own git credential), so the fix is per-repo verification in the backend, not a signed principal: - New `auth::access::AccessVerifier`. `HttpAccessVerifier` proves access the way a clone does — a `git-upload-pack` `info/refs` probe against the provider (anonymous → public; with the caller's credential → authorized private), provider-agnostic, results cached for `RIPCLONE_REPO_AUTH_TTL_SECS` (default 60s). - `authorize_repo_read` gate runs at every repo-read entry point BEFORE serving or signing — get_ref, sync, git smart-http (info/refs + upload-pack), and repo status — and crucially before the ref-response cache-hit return. Public → serve anonymously; authorized private → serve (private URL TTL); denied → 403. - Visibility is now derived from the verifier, not the client header. - On by default. `RIPCLONE_TRUST_GATEWAY=1` restores single-tenant trust mode (skip the check, header visibility) for a network-isolated self-host; a startup log states which mode is active. Documented in docs/BACKENDS.md. Tests: gate maps public/authorized/denied correctly and falls back to the header in trust mode; a denied caller gets 403 through the real `refs` route (previously 200 from cache). e2e harness uses trust mode (its file:// origins can't be probed over HTTP). Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU1 (+ AU3 visibility fail-open) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(ci): fix lint (rustfmt) and e2e (AU1 trust-mode for file:// origin) - rustfmt across the branch: I had run `cargo test` after each fix but not `cargo fmt --check`, so several files (mostly multi-line SQL/signatures from the queue and auth changes) tripped the lint job. Formatting only. - `e2e_local.sh` clones from a `file://` origin, which the new per-repo access enforcement (AU1) can't probe over HTTP, so the real-binary e2e 403'd on clone. Set `RIPCLONE_TRUST_GATEWAY=1` there — the documented single-tenant escape hatch — matching what the in-process e2e harness already does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

russellromney force-pushed the io-uring-materialization branch from 84682db to 9b5dac4 Compare June 21, 2026 01:19

russellromney changed the title ~~Add Linux io_uring worktree writer~~ Add opt-in Linux io_uring worktree writer Jun 21, 2026

russellromney force-pushed the io-uring-materialization branch from 76c5ea8 to 0322dc8 Compare June 22, 2026 01:22

russellromney marked this pull request as ready for review June 22, 2026 03:05

russellromney added 12 commits June 21, 2026 20:11

feat(client): add opt-in io_uring worktree writer

b44634c

perf(client): batch io_uring writes per extractor thread

ddedd6b

perf(client): link io_uring writes and closes

e87846b

perf(client): use direct descriptors for io_uring writes

77174fa

perf(client): optimize io_uring archive materialization

ac50b2c

perf(client): batch pack materialization writes

f2b4c0c

docs(client): record rebased io_uring benchmarks

ea1703e

perf(client): tune io_uring defaults and ring waits

53a0d21

perf(client): reduce io_uring completions and grow windows

5868554

perf(client): refresh index stats after unstamped writes

8effbf8

docs(client): record io_uring bdf benchmark

277e1f6

perf(client): collect index stats in write path

b0c3677

russellromney force-pushed the io-uring-materialization branch from d25739f to b0c3677 Compare June 22, 2026 03:43

russellromney merged commit 5c072f2 into main Jun 22, 2026
1 of 3 checks passed

russellromney mentioned this pull request Jun 25, 2026

fix: adversarial review — correctness + security cluster (M1/M2/M3, A1/A2, A3, U1/U2, S1, AU1) #62

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add opt-in Linux io_uring worktree writer#4

Add opt-in Linux io_uring worktree writer#4
russellromney merged 12 commits into
mainfrom
io-uring-materialization

russellromney commented Jun 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

russellromney commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fly validation

Local validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

russellromney commented Jun 20, 2026 •

edited

Loading