Add opt-in Linux io_uring worktree writer#4
Merged
Conversation
84682db to
9b5dac4
Compare
76c5ea8 to
0322dc8
Compare
d25739f to
b0c3677
Compare
russellromney
added a commit
that referenced
this pull request
Jun 26, 2026
The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
russellromney
added a commit
that referenced
this pull request
Jun 26, 2026
…1/A2, A3, U1/U2, S1, AU1) (#62) * fix(refs): enforce "newer never loses" atomically on every backend (M1/M2/M3) The README promises "a newer sync never loses to an older one." That was enforced on none of: the file ref store (the default when no S3), S3 branch refs, and was a get-then-write TOCTOU on every SQL store — so the same race resolved differently depending on where refs happened to live. - Hoist the compare-policy into one shared `should_replace_ref` helper (same-commit always wins; otherwise newer-or-equal `synced_at` wins; a missing timestamp on either side defers to the backend's atomic tie-break). - FileRefStore: give save/save_branch the read-compare-then-rename they lacked, serialized by an in-process write lock (the file store is per-host by design). Also fixes two writers fighting over the shared `.json.tmp`. - SqlRefStore: replace get-then-unconditional-upsert with a single atomic conditional upsert via `MetaDb::save_ordered`. sqlite/postgres/libsql use `ON CONFLICT DO UPDATE ... WHERE`; MySQL (no WHERE on ON DUPLICATE KEY) uses an `@ripl := (...)` IF() session-variable computed once against the original row. Verified policy parity across all four dialects. - S3RefStore: route branch saves through the same ETag CAS as HEAD; unify both into one read-compare-CAS `save_keyed` with a bounded retry loop, and add `S3Storage::put_object_cas` (precondition failure → Ok(false) to retry). Adds the concurrency property test (suggested test #1): N tasks race save_branch with shuffled synced_at; the max must always win. FileRefStore and SQLite run unconditionally (both failed before this change); postgres/ mysql/libsql/S3 arms run when their connection env vars are set. Refs: ADVERSARIAL_REVIEW_2026-06-25.md M1/M2/M3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(queue): guard job settlement and bound attempts to a dead-letter (A1/A2) Two cross-process queue hazards on a shared mirror: - A2 (double-settle): after a time-based reclaim re-owns a job, the original slow-but-alive worker's `finish` still settled it — the UPDATE matched only `WHERE id = ?`. `finish` is now conditional on still owning the claim (`WHERE id = ? AND worker_id = ? AND status = 'claimed'`) and returns whether it actually settled; `ack` threads the worker id and returns that bool. The worker logs and discards a rejected (stale) result instead of clobbering the current owner's build. - A1 (crash-loop): a SIGKILL/OOM build never acks, so its claim goes stale and was requeued forever with no terminal state. Added an `attempts` column (incremented on each claim) and a `RIPCLONE_QUEUE_MAX_ATTEMPTS` cap (default 5): `reclaim_stale` now dead-letters a stale claim that is at/over the cap to terminal `failed` (clearing its credential) and only requeues the rest. Applied identically across all four adapters (sqlite, postgres, mysql, libsql) with a best-effort `attempts` column migration for legacy tables. New tests: a reclaimed worker's late ack is rejected; a hard-killed build dead-letters after the cap. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A1/A2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(server): make two-phase phase 2 durable on ephemeral workers (A3) Two-phase publish acked the job `done` after phase 1 (depth=1) and built the full history in a detached `tokio::spawn`. On a long-lived in-process server that task survives, but an ephemeral/cross-process worker that exits right after acking loses it — the full clonepack is never built and nothing re-drives it. Phase 2 also ran after `do_sync` returned, outside `repo_lock`. `do_sync` / `build_and_publish_two_phase` now take an `inline_full_history` flag. The cross-process worker (`process_build_job` when the queue is not in-process, `inproc_wait() == false`) builds phase 2 inline before returning, so the job is acked `done` only once the full clonepack is durable; if that worker dies mid-build the claim goes stale and is reclaimed/retried (and dead-lettered after the cap — see A1/A2). Running inline keeps phase 2 under the `repo_lock` that `process_build_job` already holds across `do_sync`, so it is serialized against other syncs for the repo. The long-lived in-process server keeps the background spawn, so `/sync` stays fast there. Note: a faithful "SIGKILL the worker right after the phase-1 ack" regression test needs real subprocess control; the in-process e2e harness can't simulate it. The inline path's correctness is covered by the existing cross-process clone e2e (full history present after the job reports Done), and the re-drive-on-death backstop by the A1/A2 dead-letter test. Refs: ADVERSARIAL_REVIEW_2026-06-25.md A3 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(io_uring): drain the ring to quiescence before freeing window buffers (U1/U2) A `PendingDirectWindow` owns the buffers the kernel reads/writes for its in-flight SQEs — the file content, the path `CString`, and the kernel-written `statx` results. The kernel owns those until the matching CQEs are reaped, so a window must never be dropped while still in flight. Two paths broke that: - An interrupted wait (`submit_and_wait` returning EINTR) with windows in flight propagated as an error, unwinding and dropping the window while the kernel still owned its buffers → UAF / data race (incl. the kernel writing freed `statx` memory). All submit/wait calls now retry EINTR via `submit_uninterrupted` / `submit_and_wait_uninterrupted` instead of returning. - `Drop` did a best-effort `flush_deferred_writes`, which bails on the first harvest error and leaves later windows in the deque to drop in flight. `Drop` now drains every pending window to `is_complete()` (`drain_pending_to_quiescence`, bounded, EINTR-retrying). The harvest error path likewise quiesces its owned window before returning the error. Adds the missing SAFETY note on the `push_multiple` unsafe block tying each SQE's buffer lifetime to its `PendingDirectWindow` living in `pending_windows` until complete. io_uring is Linux-only (`cfg(target_os = "linux")`); these changes were typechecked with `cargo check --all-targets` in a Linux container (toolchain 1.96.0) since the dev host is macOS. A true UAF regression test needs ASan + kernel EINTR fault injection (report's suggested test #5) and isn't added here; the existing scheduler tests cover happy-path no-regression. Refs: ADVERSARIAL_REVIEW_2026-06-25.md U1/U2 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(gc): anchor delete decision on reference time, not object mtime (S1) The grace cutoff that protects in-flight uploads is anchored on each object's mtime — when it was last *written*. A sync that re-references an already-stored object without re-uploading it (a reused pack/chunk) leaves that object's mtime old, so to GC it looks unreachable-and-aged-out even though a concurrent sync just started pointing at it. The reachable set is snapshotted before the long list-every-object + walk-every-manifest pass, so any sync that lands during the pass is invisible and its reused objects can be deleted out from under it. Add a reference-time double-check: after building the delete candidates, re-collect the reachable set reading *through* the ref cache (a new `fresh` mode on `collect_reachable_hashes` that invalidates each branch before loading, a no-op for non-caching stores) and drop any candidate that became reachable during the pass. Freshly *written* objects stay protected by the mtime grace; this closes the reused-object window, leaving only a sub-second recheck→delete gap in place of the whole-pass window. Test (suggested #4): a concurrent sync re-references an aged reused object mid-pass (reproduced deterministically via a stale ref cache — first scan sees the pre-sync ref, recheck reads through to the fresh ref); the object must survive. Verified it fails without the recheck. Refs: ADVERSARIAL_REVIEW_2026-06-25.md S1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(auth): enforce per-repo read access for private repos (AU1) The shared RIPCLONE_SERVER_TOKEN only authenticates "may talk to this backend"; it is not per-repo authorization. Private-repo reads were gated by the caller's credential only on a cache *miss* (the upstream fetch). On a cache hit the ref response — including signed artifact URLs — was returned before the credential was looked at, so any holder of the shared token could read any already-cached private repo. A standing backend token widened this, and visibility was read from a client-supplied header that fails open to "public". There is no separate auth gateway in this architecture (public repos are anonymous; private repos are gated by the caller's own git credential), so the fix is per-repo verification in the backend, not a signed principal: - New `auth::access::AccessVerifier`. `HttpAccessVerifier` proves access the way a clone does — a `git-upload-pack` `info/refs` probe against the provider (anonymous → public; with the caller's credential → authorized private), provider-agnostic, results cached for `RIPCLONE_REPO_AUTH_TTL_SECS` (default 60s). - `authorize_repo_read` gate runs at every repo-read entry point BEFORE serving or signing — get_ref, sync, git smart-http (info/refs + upload-pack), and repo status — and crucially before the ref-response cache-hit return. Public → serve anonymously; authorized private → serve (private URL TTL); denied → 403. - Visibility is now derived from the verifier, not the client header. - On by default. `RIPCLONE_TRUST_GATEWAY=1` restores single-tenant trust mode (skip the check, header visibility) for a network-isolated self-host; a startup log states which mode is active. Documented in docs/BACKENDS.md. Tests: gate maps public/authorized/denied correctly and falls back to the header in trust mode; a denied caller gets 403 through the real `refs` route (previously 200 from cache). e2e harness uses trust mode (its file:// origins can't be probed over HTTP). Refs: ADVERSARIAL_REVIEW_2026-06-25.md AU1 (+ AU3 visibility fail-open) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(ci): fix lint (rustfmt) and e2e (AU1 trust-mode for file:// origin) - rustfmt across the branch: I had run `cargo test` after each fix but not `cargo fmt --check`, so several files (mostly multi-line SQL/signatures from the queue and auth changes) tripped the lint job. Formatting only. - `e2e_local.sh` clones from a `file://` origin, which the new per-repo access enforcement (AU1) can't probe over HTTP, so the real-binary e2e 403'd on clone. Set `RIPCLONE_TRUST_GATEWAY=1` there — the documented single-tenant escape hatch — matching what the in-process e2e harness already does. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in Linux
io_uringworktree writer while keeping POSIX as the portable default for macOS and normal Linux runs.RIPCLONE_IO_URINGbehavior:0: POSIX writer1: force the Linuxio_uringwriterauto: probeio_uringand fall back to POSIX if unavailableThe original high-level
tokio-uringpath has been replaced with rawio-uringbatching. Each extractor writer thread now lazily owns a thread-local ring, so it preserves the existing extraction parallelism and avoids a cross-thread request/reply per file.The current Linux fast path uses one linked chain per regular file:
openatinto a registered direct/fixed file slotwrite, hard-linked to the openclose, hard-linked to the writeThat keeps each batch in flight without first materializing normal file descriptors. Fly's kernel rejected direct
openatwhileO_CLOEXECwas present, so the io_uring path strips that flag for direct opens. If a kernel still rejects direct descriptors withEINVAL/EOPNOTSUPP, the writer logs once and retries the batch with the normal-fd batched io_uring path.The implementation keeps the existing path validation, symlink final-component protection, executable modes, deterministic mtimes, and portable POSIX fallback behavior.
Fly validation
Validated on the consolidated Fly setup from
main:https://ripclone.fly.devripclone-client-testewr/dataoven-sh/bunsynced to88417471cb28aab8943eb6227c014ac3f1c50cbcLatest explicit
/datacomparison after direct descriptors started working:542ms, benchwrite=755ms,total=2904ms573ms, benchwrite=722ms,total=2062msThe
writebenchmark field is not a pure syscall/write timer; it covers working-tree extraction plus local blob pack work, and total time is noisier because resolve/fetch vary between runs. The direct-descriptor io_uring path is now functioning and is slightly faster in the extractor loop on this Fly volume run, but remaining cost is mixed with decompression, SHA-1 verification, mtime syscalls, and git skip-worktree cleanup.Temporary benchmark volume
vol_re1076l6w6y7mz14was destroyed after testing. One-off machines exited with--rm; the standing 8-core app machines were stopped after validation.Local validation
rustfmt --edition 2024 src/worktree_writer.rscargo checkcargo checkgit diff --checkA generic
rust:latestDocker image cannot link the focusedworktree_writertest binary without the systemlibfusepackage, so the post-direct-descriptor Linux validation usedcargo checkthere.