Skip to content

Add opt-in Linux io_uring worktree writer#4

Merged
russellromney merged 12 commits into
mainfrom
io-uring-materialization
Jun 22, 2026
Merged

Add opt-in Linux io_uring worktree writer#4
russellromney merged 12 commits into
mainfrom
io-uring-materialization

Conversation

@russellromney

@russellromney russellromney commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Summary

Adds an opt-in Linux io_uring worktree writer while keeping POSIX as the portable default for macOS and normal Linux runs.

RIPCLONE_IO_URING behavior:

  • unset or 0: POSIX writer
  • 1: force the Linux io_uring writer
  • auto: probe io_uring and fall back to POSIX if unavailable

The original high-level tokio-uring path has been replaced with raw io-uring batching. Each extractor writer thread now lazily owns a thread-local ring, so it preserves the existing extraction parallelism and avoids a cross-thread request/reply per file.

The current Linux fast path uses one linked chain per regular file:

  • openat into a registered direct/fixed file slot
  • fixed-file write, hard-linked to the open
  • fixed-file close, hard-linked to the write

That keeps each batch in flight without first materializing normal file descriptors. Fly's kernel rejected direct openat while O_CLOEXEC was present, so the io_uring path strips that flag for direct opens. If a kernel still rejects direct descriptors with EINVAL/EOPNOTSUPP, the writer logs once and retries the batch with the normal-fd batched io_uring path.

The implementation keeps the existing path validation, symlink final-component protection, executable modes, deterministic mtimes, and portable POSIX fallback behavior.

Fly validation

Validated on the consolidated Fly setup from main:

  • server: https://ripclone.fly.dev
  • client app/image: ripclone-client-test
  • client region: ewr
  • benchmark machine: 8 shared CPUs, 2 GB, auto-stop/no restart
  • target path: mounted ext4 Fly volume at /data
  • repo: oven-sh/bun synced to 88417471cb28aab8943eb6227c014ac3f1c50cbc

Latest explicit /data comparison after direct descriptors started working:

  • io_uring direct descriptors: extraction 542ms, bench write=755ms, total=2904ms
  • POSIX comparison: extraction 573ms, bench write=722ms, total=2062ms

The write benchmark field is not a pure syscall/write timer; it covers working-tree extraction plus local blob pack work, and total time is noisier because resolve/fetch vary between runs. The direct-descriptor io_uring path is now functioning and is slightly faster in the extractor loop on this Fly volume run, but remaining cost is mixed with decompression, SHA-1 verification, mtime syscalls, and git skip-worktree cleanup.

Temporary benchmark volume vol_re1076l6w6y7mz14 was destroyed after testing. One-off machines exited with --rm; the standing 8-core app machines were stopped after validation.

Local validation

  • rustfmt --edition 2024 src/worktree_writer.rs
  • cargo check
  • Linux Docker cargo check
  • git diff --check

A generic rust:latest Docker image cannot link the focused worktree_writer test binary without the system libfuse package, so the post-direct-descriptor Linux validation used cargo check there.

@russellromney russellromney force-pushed the io-uring-materialization branch from 84682db to 9b5dac4 Compare June 21, 2026 01:19
@russellromney russellromney changed the title Add Linux io_uring worktree writer Add opt-in Linux io_uring worktree writer Jun 21, 2026
@russellromney russellromney force-pushed the io-uring-materialization branch from 76c5ea8 to 0322dc8 Compare June 22, 2026 01:22
@russellromney russellromney marked this pull request as ready for review June 22, 2026 03:05
@russellromney russellromney force-pushed the io-uring-materialization branch from d25739f to b0c3677 Compare June 22, 2026 03:43
@russellromney russellromney merged commit 5c072f2 into main Jun 22, 2026
1 of 3 checks passed
russellromney added a commit that referenced this pull request Jun 26, 2026
The grace cutoff that protects in-flight uploads is anchored on each object's
mtime — when it was last *written*. A sync that re-references an already-stored
object without re-uploading it (a reused pack/chunk) leaves that object's mtime
old, so to GC it looks unreachable-and-aged-out even though a concurrent sync
just started pointing at it. The reachable set is snapshotted before the long
list-every-object + walk-every-manifest pass, so any sync that lands during the
pass is invisible and its reused objects can be deleted out from under it.

Add a reference-time double-check: after building the delete candidates,
re-collect the reachable set reading *through* the ref cache (a new `fresh`
mode on `collect_reachable_hashes` that invalidates each branch before loading,
a no-op for non-caching stores) and drop any candidate that became reachable
during the pass. Freshly *written* objects stay protected by the mtime grace;
this closes the reused-object window, leaving only a sub-second recheck→delete
gap in place of the whole-pass window.

Test (suggested #4): a concurrent sync re-references an aged reused object
mid-pass (reproduced deterministically via a stale ref cache — first scan sees
the pre-sync ref, recheck reads through to the fresh ref); the object must
survive. Verified it fails without the recheck.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
russellromney added a commit that referenced this pull request Jun 26, 2026
…1/A2, A3, U1/U2, S1, AU1) (#62)

* fix(refs): enforce "newer never loses" atomically on every backend (M1/M2/M3)

The README promises "a newer sync never loses to an older one." That was
enforced on none of: the file ref store (the default when no S3), S3 branch
refs, and was a get-then-write TOCTOU on every SQL store — so the same race
resolved differently depending on where refs happened to live.

- Hoist the compare-policy into one shared `should_replace_ref` helper
  (same-commit always wins; otherwise newer-or-equal `synced_at` wins; a
  missing timestamp on either side defers to the backend's atomic tie-break).
- FileRefStore: give save/save_branch the read-compare-then-rename they
  lacked, serialized by an in-process write lock (the file store is per-host
  by design). Also fixes two writers fighting over the shared `.json.tmp`.
- SqlRefStore: replace get-then-unconditional-upsert with a single atomic
  conditional upsert via `MetaDb::save_ordered`. sqlite/postgres/libsql use
  `ON CONFLICT DO UPDATE ... WHERE`; MySQL (no WHERE on ON DUPLICATE KEY)
  uses an `@ripl := (...)` IF() session-variable computed once against the
  original row. Verified policy parity across all four dialects.
- S3RefStore: route branch saves through the same ETag CAS as HEAD; unify
  both into one read-compare-CAS `save_keyed` with a bounded retry loop, and
  add `S3Storage::put_object_cas` (precondition failure → Ok(false) to retry).

Adds the concurrency property test (suggested test #1): N tasks race
save_branch with shuffled synced_at; the max must always win. FileRefStore
and SQLite run unconditionally (both failed before this change); postgres/
mysql/libsql/S3 arms run when their connection env vars are set.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md M1/M2/M3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(queue): guard job settlement and bound attempts to a dead-letter (A1/A2)

Two cross-process queue hazards on a shared mirror:

- A2 (double-settle): after a time-based reclaim re-owns a job, the original
  slow-but-alive worker's `finish` still settled it — the UPDATE matched only
  `WHERE id = ?`. `finish` is now conditional on still owning the claim
  (`WHERE id = ? AND worker_id = ? AND status = 'claimed'`) and returns whether
  it actually settled; `ack` threads the worker id and returns that bool. The
  worker logs and discards a rejected (stale) result instead of clobbering the
  current owner's build.
- A1 (crash-loop): a SIGKILL/OOM build never acks, so its claim goes stale and
  was requeued forever with no terminal state. Added an `attempts` column
  (incremented on each claim) and a `RIPCLONE_QUEUE_MAX_ATTEMPTS` cap (default
  5): `reclaim_stale` now dead-letters a stale claim that is at/over the cap to
  terminal `failed` (clearing its credential) and only requeues the rest.

Applied identically across all four adapters (sqlite, postgres, mysql, libsql)
with a best-effort `attempts` column migration for legacy tables. New tests:
a reclaimed worker's late ack is rejected; a hard-killed build dead-letters
after the cap.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md A1/A2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(server): make two-phase phase 2 durable on ephemeral workers (A3)

Two-phase publish acked the job `done` after phase 1 (depth=1) and built the
full history in a detached `tokio::spawn`. On a long-lived in-process server
that task survives, but an ephemeral/cross-process worker that exits right
after acking loses it — the full clonepack is never built and nothing
re-drives it. Phase 2 also ran after `do_sync` returned, outside `repo_lock`.

`do_sync` / `build_and_publish_two_phase` now take an `inline_full_history`
flag. The cross-process worker (`process_build_job` when the queue is not
in-process, `inproc_wait() == false`) builds phase 2 inline before returning,
so the job is acked `done` only once the full clonepack is durable; if that
worker dies mid-build the claim goes stale and is reclaimed/retried (and
dead-lettered after the cap — see A1/A2). Running inline keeps phase 2 under
the `repo_lock` that `process_build_job` already holds across `do_sync`, so it
is serialized against other syncs for the repo. The long-lived in-process
server keeps the background spawn, so `/sync` stays fast there.

Note: a faithful "SIGKILL the worker right after the phase-1 ack" regression
test needs real subprocess control; the in-process e2e harness can't simulate
it. The inline path's correctness is covered by the existing cross-process
clone e2e (full history present after the job reports Done), and the
re-drive-on-death backstop by the A1/A2 dead-letter test.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md A3

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(io_uring): drain the ring to quiescence before freeing window buffers (U1/U2)

A `PendingDirectWindow` owns the buffers the kernel reads/writes for its
in-flight SQEs — the file content, the path `CString`, and the kernel-written
`statx` results. The kernel owns those until the matching CQEs are reaped, so a
window must never be dropped while still in flight. Two paths broke that:

- An interrupted wait (`submit_and_wait` returning EINTR) with windows in flight
  propagated as an error, unwinding and dropping the window while the kernel
  still owned its buffers → UAF / data race (incl. the kernel writing freed
  `statx` memory). All submit/wait calls now retry EINTR via
  `submit_uninterrupted` / `submit_and_wait_uninterrupted` instead of returning.
- `Drop` did a best-effort `flush_deferred_writes`, which bails on the first
  harvest error and leaves later windows in the deque to drop in flight. `Drop`
  now drains every pending window to `is_complete()` (`drain_pending_to_quiescence`,
  bounded, EINTR-retrying). The harvest error path likewise quiesces its owned
  window before returning the error.

Adds the missing SAFETY note on the `push_multiple` unsafe block tying each
SQE's buffer lifetime to its `PendingDirectWindow` living in `pending_windows`
until complete.

io_uring is Linux-only (`cfg(target_os = "linux")`); these changes were
typechecked with `cargo check --all-targets` in a Linux container (toolchain
1.96.0) since the dev host is macOS. A true UAF regression test needs ASan +
kernel EINTR fault injection (report's suggested test #5) and isn't added here;
the existing scheduler tests cover happy-path no-regression.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md U1/U2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(gc): anchor delete decision on reference time, not object mtime (S1)

The grace cutoff that protects in-flight uploads is anchored on each object's
mtime — when it was last *written*. A sync that re-references an already-stored
object without re-uploading it (a reused pack/chunk) leaves that object's mtime
old, so to GC it looks unreachable-and-aged-out even though a concurrent sync
just started pointing at it. The reachable set is snapshotted before the long
list-every-object + walk-every-manifest pass, so any sync that lands during the
pass is invisible and its reused objects can be deleted out from under it.

Add a reference-time double-check: after building the delete candidates,
re-collect the reachable set reading *through* the ref cache (a new `fresh`
mode on `collect_reachable_hashes` that invalidates each branch before loading,
a no-op for non-caching stores) and drop any candidate that became reachable
during the pass. Freshly *written* objects stay protected by the mtime grace;
this closes the reused-object window, leaving only a sub-second recheck→delete
gap in place of the whole-pass window.

Test (suggested #4): a concurrent sync re-references an aged reused object
mid-pass (reproduced deterministically via a stale ref cache — first scan sees
the pre-sync ref, recheck reads through to the fresh ref); the object must
survive. Verified it fails without the recheck.

Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(auth): enforce per-repo read access for private repos (AU1)

The shared RIPCLONE_SERVER_TOKEN only authenticates "may talk to this backend";
it is not per-repo authorization. Private-repo reads were gated by the caller's
credential only on a cache *miss* (the upstream fetch). On a cache hit the ref
response — including signed artifact URLs — was returned before the credential
was looked at, so any holder of the shared token could read any already-cached
private repo. A standing backend token widened this, and visibility was read
from a client-supplied header that fails open to "public".

There is no separate auth gateway in this architecture (public repos are
anonymous; private repos are gated by the caller's own git credential), so the
fix is per-repo verification in the backend, not a signed principal:

- New `auth::access::AccessVerifier`. `HttpAccessVerifier` proves access the way
  a clone does — a `git-upload-pack` `info/refs` probe against the provider
  (anonymous → public; with the caller's credential → authorized private),
  provider-agnostic, results cached for `RIPCLONE_REPO_AUTH_TTL_SECS` (default
  60s).
- `authorize_repo_read` gate runs at every repo-read entry point BEFORE serving
  or signing — get_ref, sync, git smart-http (info/refs + upload-pack), and
  repo status — and crucially before the ref-response cache-hit return. Public →
  serve anonymously; authorized private → serve (private URL TTL); denied → 403.
- Visibility is now derived from the verifier, not the client header.
- On by default. `RIPCLONE_TRUST_GATEWAY=1` restores single-tenant trust mode
  (skip the check, header visibility) for a network-isolated self-host; a
  startup log states which mode is active. Documented in docs/BACKENDS.md.

Tests: gate maps public/authorized/denied correctly and falls back to the
header in trust mode; a denied caller gets 403 through the real `refs` route
(previously 200 from cache). e2e harness uses trust mode (its file:// origins
can't be probed over HTTP).

Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU1 (+ AU3 visibility fail-open)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* chore(ci): fix lint (rustfmt) and e2e (AU1 trust-mode for file:// origin)

- rustfmt across the branch: I had run `cargo test` after each fix but not
  `cargo fmt --check`, so several files (mostly multi-line SQL/signatures
  from the queue and auth changes) tripped the lint job. Formatting only.
- `e2e_local.sh` clones from a `file://` origin, which the new per-repo access
  enforcement (AU1) can't probe over HTTP, so the real-binary e2e 403'd on
  clone. Set `RIPCLONE_TRUST_GATEWAY=1` there — the documented single-tenant
  escape hatch — matching what the in-process e2e harness already does.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant