Skip to content

perf(pm): probe — #2836 minus OnceMap#2837

Closed
elrrrrrrr wants to merge 104 commits intonextfrom
perf/strip-wp-oncemap
Closed

perf(pm): probe — #2836 minus OnceMap#2837
elrrrrrrr wants to merge 104 commits intonextfrom
perf/strip-wp-oncemap

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Summary

Surgical strip of OnceMap dedup from `UnifiedRegistry::resolve_full_manifest` on top of #2836 (which was #2834 minus worker-pool). Direct fetch on every caller, no per-name coalescing.

Driver hunt scoreboard

probe npmjs p1_resolve
baseline (origin/next) 5.45s
#2832 mt-pool only 4.59s ±1.66
#2835 aws-lc-rs only 6.13s ±1.00
#2834 all 101 commits 2.62s ±0.07 (-52%)
#2836 = #2834 − worker-pool 4.27s ±0.05 (-22%)
this = #2836 − OnceMap TBD

If perf collapses back to ~5.4s → OnceMap is the remaining synergy partner with worker-pool.
If perf holds ~4.3s → driver is elsewhere (aws-lc-rs / DNS / cache hot paths).

🤖 Generated with Claude Code

elrrrrrrr and others added 30 commits April 27, 2026 18:02
Replace intra-package `par_iter` with a sequential loop when writing
extracted tar entries to disk. Each tar entry is typically small and
writes complete in microseconds, so splitting them into rayon tasks
was causing heavy work-stealing (futex park/unpark) and dominating
context switches on large dep graphs. Cross-package parallelism is
preserved by the outer `rayon::spawn` in `extract_tarball`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cold bench: drop `| tail -1` so hyperfine's full summary (mean,
  stddev, range) reaches the log. Failure detection now uses exit
  status instead of piping.
- `BENCH_WARM_RUNS=0` skips the warm phase entirely (previously the
  warm function always ran and hyperfine would reject --runs 0).
- Result aggregator tolerates empty or malformed export-json files
  (e.g. when a PM's cold install fails): the offending file is
  reported and skipped instead of crashing the whole summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the sequential `for` loop over extracted tar entries with
`par_chunks(WRITE_CHUNK_SIZE)` — each rayon task writes a contiguous
run of 32 files sequentially. This retains multi-core IO overlap for
large packages while cutting the rayon task count (and its work-
stealing futex traffic) by the chunk factor versus a per-file
par_iter. Cross-package parallelism is preserved by the outer
rayon::spawn in extract_tarball.

Local (macOS, antd-test, 3 runs avg):
  before par_iter: wall 17.2s  sys 6.18s  ivcsw 208k
  for-loop:        wall 15.3s  sys 2.36s  ivcsw  61k
  par_chunks(32):  wall 13.9s  sys 5.77s  ivcsw 191k

chunks wins wall but loses the ctx-switch reduction relative to the
pure sequential version; CI with a large dep graph (ant-design-x)
is the authoritative measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accumulate wall microseconds for download, extract, and clone across
all packages during install. Print a one-line summary alongside the
existing `added / reused / downloaded` counts, e.g.

  + 513 added · 3017 reused · 123 downloaded
    download 135.8s · extract 2.3s · clone 0.4s · 19.0 MB fetched

The sums are non-exclusive across cores: dividing by wall clock
gives the effective concurrency for each phase, and the ratio
between phases shows where cold-install CPU time actually lands.
Overhead is three atomics per downloaded tarball.

Local antd-test (macOS, npmmirror, 77 packages, wall 16s): download
dominates 98% of the CPU budget, extract 1.6%, clone 0.3% — reshapes
where we should look for cold-install wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed so the per-phase timings line (`download · extract · clone · bytes`)
printed at the end of each install reaches the CI log. Trade-off is noisier
logs — registry INFO/WARN lines come through — but that's the price for
visibility into where cold-install CPU actually lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Separates three independent measurements for utoo vs bun so each
phase's improvement can be judged on its own baseline:

  Phase 1 · resolve     utoo deps          / bun install --lockfile-only
  Phase 3 · cold install utoo install      / bun install   (empty cache)
  Phase 4 · warm link    utoo install      / bun install   (cache warm)

Phase 3 uses the lockfile generated by phase 1, with cache reset
between iterations. Phase 4 resets only node_modules so only the
cache → node_modules link step is measured.

Uses hyperfine --show-output so utoo's phase-timings line
(\`download · extract · clone · bytes\`) reaches the CI log alongside
the wall-clock summary.

Triggered via workflow_dispatch with configurable project / registry
/ runs. Defaults to ant-design against npmjs.org, 3 runs per phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anch merge

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous inline bash -c prepare was silently no-op on CI: utoo's run 2/3
showed '3280 reused' meaning the cache wasn't actually cleared, and bun hit
InvalidNPMLockfile because utoo's package-lock.json leaked across
iterations.

Now each phase writes a dedicated prepare shell script per-PM that:
- always drops node_modules (incl. workspace package trees),
- clears exactly the lockfiles that would confuse this PM,
- wipes the right cache for this phase,
- prints a '[prep]' line so the CI log proves prepare ran.

Also factored out seed_for_phase so lockfile / cache warmup happens once
before the benchmark, not leaking into the measurement.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che wipe

Path-based rm -rf of $HOME/.cache/nm wasn't actually emptying the cache
on the CI runner — utoo runs 2/3 of phase 3 still showed '3280 reused',
wall was 0.8-1.1s instead of the 10s cold-install baseline, hyperfine
itself warned about caches not being filled until after run 1.

Let each PM clean its own cache via its CLI so we don't rely on
guessing where it stores things.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`utoo clean` / `bun pm cache rm` didn't empty the cache on the CI
runner either — so now use explicit bench-local paths the rm -rf
prepare can guarantee to wipe:

  utoo: --cache-dir=/tmp/utoo-bench-cache on every invocation
  bun:  BUN_INSTALL_CACHE_DIR=/tmp/bun-bench-cache (env var)

Gets us deterministic cold/warm state between hyperfine iterations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop into diagnostic mode to figure out why hyperfine's --prepare
still leaves utoo's cache intact across iterations despite the
explicit --cache-dir. Prints the generated prepare script, and logs
each per-iteration invocation's before/after du -sh of both caches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The case $phase in p1) p3) p4) \-style patterns never matched
against actual phase strings like "p1_resolve" / "p3_cold_install" /
"p4_warm_link". Result: write_prepare produced a script containing
only the common header and no phase-specific cache-wipe logic, so
every run after the first hit a warm cache and timings collapsed.

Same off-by-name bug in seed_for_phase: "p3:utoo" pattern never
matched "p3_cold_install:utoo", skipping lockfile seeding and
warm-cache priming. Switched both to "p*_*" globs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-size before/after logs + generated-script dumps were
diagnostic scaffolding used to trace the p* vs p*_resolve pattern
mismatch. With that fixed, keep the plain hyperfine --prepare
invocation so CI logs are readable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…time

Each hyperfine iteration now runs inside a metrics wrapper that greps
/usr/bin/time -v output for RSS, voluntary/involuntary context switches,
page faults, and IO read/write counts. Per-PM per-phase averages across
the 3 runs are shown alongside the wall-clock table so we can see, e.g.,
whether utoo's resolve phase costs more syscalls than bun's, or whether
its warm-link advantage comes at a memory cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Expand the metrics wrapper to collect everything that's cheap on Linux:

- user / sys CPU seconds (from /usr/bin/time -v, lets us see CPU share)
- RSS, voluntary + involuntary ctx, major + minor page faults
- network RX / TX bytes (system-wide /proc/net/dev delta, excludes lo)
- disk page-in / page-out bytes (/proc/vmstat pgpg{in,out} × 4K pages)

Summary prints two tables per phase:
  A. wall / ±σ / user / sys / RSS / minor faults
  B. vCtx / iCtx / net RX / net TX / disk R / disk W

This makes resolve-phase vs link-phase comparison legible: e.g. network
cost should dominate download phases while disk writes dominate link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous run attributed 525MB of writes to utoo's resolve phase when
local check showed utoo only wrote ~28MB to its cache. The overshoot
came from /proc/vmstat pgpgout being system-wide — it picked up ext4
journal, page-cache writeback, and other kernel activity unrelated to
the benchmarked process.

Switch to du-before/after on the paths that matter (cache dir, project
node_modules, lockfiles) for a per-PM figure that reflects what the
install actually produced. Summary now shows Δcache / Δnode_mod / Δlock
per phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Measuring disk footprint via du before+after each iteration added
2-3s of traversal to every run (wall jumped from 2.3s → 4.9s on the
warm-link phase). Both snapshots happened inside hyperfine's timed
region because the wrapper runs as the benchmark command.

Hot path keeps only /usr/bin/time + /proc/net/dev snapshots now. After
hyperfine exits, capture_footprint does one du pass per phase/PM to
record the final on-disk size of the cache, node_modules, and
lockfile. Summary prints absolute sizes instead of per-iteration
deltas — single sample is enough to compare what each PM produced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
parseKey matched both `_${phase}_${pm}.json` (hyperfine export) and
`_${phase}_${pm}_footprint.json` (our new du snapshot), so the loop
tried to read .results[0] off the footprint and crashed the whole
summary. Add footprint suffix to the exclusion filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
npm registries compress manifest responses ~13× (antd abbreviated goes
from 4.2MB to 309KB with gzip), but ruborist's reqwest client had
neither compression feature enabled — so it never advertised
`Accept-Encoding: gzip,br` and the server delivered raw JSON.

Adding `gzip` + `brotli` to the feature list cuts the cold
`utoo deps` manifest traffic on ant-design from ~275 MB of JSON
over the wire to ~21 MB. Wall improvement is modest on high-latency
links (connection setup dominates) but the bandwidth reduction is
real and the CPU cost of decompression is negligible next to simd_json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest's HTTP/2 client multiplexes every manifest fetch over a SINGLE
TCP connection to each registry host. Bun opens ~10 parallel HTTP/2
connections and gets proportional extra bandwidth; we can't reproduce
that through reqwest without custom pooling.

Falling back to HTTP/1.1 with pool_max_idle_per_host(64) lets the pool
open independent connections (one request per connection, 64 parallel).
Local cold `utoo deps` on ant-design against registry.antgroup-inc.cn:

  HTTP/2 single connection: 4.9s avg
  HTTP/1.1 + pool of 64:    4.0s avg  (-18%)
  bun (reference):          3.2s

Full parity with bun still wants multi-connection HTTP/2 (bun's
strategy), which reqwest doesn't expose without a custom client pool —
future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Temporary diagnostic. Tracks send_us / body_us / bytes per
fetch_full_manifest call and prints p50/p90/p99/max every 500 samples
so the final output reflects the tail distribution of the full run.

Remove before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest multiplexes all requests over a single HTTP/2 connection by
default, which causes head-of-line blocking on npm registries with
high RTT: a slow tail response stalls the whole manifest fetch phase.

An HTTP/1.1 pool lets concurrent manifest requests open independent
TCP streams, so a single slow response no longer blocks the rest.
Locally on ant-design with npmjs, this cut cold deps-resolve from
~121s (H2 single) to ~21s (H1 pool) — 5.75× faster. On low-latency
registries (antgroup) the two are neutral, so there is no downside.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a per-name single-flight gate to UnifiedRegistry::resolve_full_manifest.
Concurrent callers for the same package name now serialize on a per-name
mutex; the first caller hits the network and populates the memory cache,
the rest re-check the cache after the gate and return the cached manifest.

On ant-design cold deps this eliminates ~100+ duplicate full-manifest
fetches observed when many deps point at the same transitive package.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the temporary record_sample() and per-request timing diagnostics
added in 14f2777 / 50a7014. The distribution data was used to identify
HTTP/2 head-of-line blocking; now that H1 + pool and dedup are in, the
diagnostic prints are no longer needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs the complete cold install (utoo install / bun install) with
everything wiped — lockfile, all caches, node_modules. Matches the
end-to-end "freshly cloned repo" user scenario and is directly
comparable to pm-bench.yml's cold install number.

Reported alongside the existing p1_resolve / p3_cold_install / p4_warm_link
phases; does not replace any of them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reqwest pins every new connection to the first resolved IP even when
DNS returns multiple A records. On registries backed by a CDN with
many IPs (antgroup returns 8, npm/Cloudflare returns 2-4) this means
all concurrent pool connections land on one IP, which caps effective
parallelism regardless of `pool_max_idle_per_host`.

Rotate the returned address list by an atomic counter on every
`resolve` call so reqwest's connect loop picks a different IP per
new connection. Connections end up uniformly distributed across all
A records returned by DNS.

Measured on ant-design / antgroup registry (cold deps, local):
- utoo-h1 (single IP): 5.38s HTTP phase, 120 conn on 1 IP
- utoo-h1 + DNS rotation: 3.95s HTTP phase, 8 IPs × 8 conn each
- bun baseline: 3.72s HTTP phase, 4 IPs × 64 conn each

Total deps-resolve wall time now matches bun (~3.3s vs 3.3s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local antgroup runs show DNS rotation cuts utoo's resolve HTTP phase
from 5.38s to 3.95s (matching bun). On CI against npmjs however the
resolve wall time is flat — possibly because:
  - npmjs from GH Actions returns fewer A records (Cloudflare Anycast)
  - low RTT already masks HOL tail

Capture a single cold resolve run per PM under tcpdump so we can see
the actual connection topology on CI and compare against the local
antgroup evidence. Output uploaded as pm-bench-pcap artifact.

Runs once after the main phased bench; reuses the already-cloned
project directory and wipes lockfiles + caches itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pcap comparison against bun on both local (antgroup) and CI (npmjs)
consistently shows bun opens ~256 parallel TCP connections during
a cold install (4 IPs × 64 conn each), while utoo was capped at
64 — ~1/4 the effective parallelism even after the DNS round-robin
fix, because reqwest treats all addresses of a host as a single pool
rather than per-IP like bun.

Raise the default concurrent manifest fetch count from 64 to 256 to
match bun's observed network footprint. The CLI flag
`--manifests-concurrency-limit` still overrides it. Pool idle cap
bumped to 256 so the keep-alive pool can park every in-flight
connection without churning.

Risk: with DNS returning few A records the 256 connections may
concentrate on one IP and trigger per-IP rate limits. Pushing to
CI to measure before committing to this as the default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
elrrrrrrr and others added 23 commits April 27, 2026 18:03
Standalone manifest-bench cap=128 hits avg_conc=95 with the same
reqwest stack; ruborist stalls at avg_conc=56. Per-completion
indicatif Mutex contention is the remaining gap source after
dropping log_progress(format!()) (commit f455a0b) and reverting
the over-aggressive dedup-by-name.

Each PreloadQueued / PreloadProgress event calls
PROGRESS_BAR.inc[_length](1), each grabbing indicatif's internal
ProgressBar Mutex. With 4571 dispatches + 4571 completions the
main task pays ~9000 lock acquisitions during a 3-4 s phase, all
contending with the steady_tick draw thread (100 ms). That cap on
main loop throughput is what holds avg_conc at 56 vs the
standalone reqwest-only sweep's 95.

Drop the per-event bar updates entirely during preload. Phase
spinner still animates via steady_tick so the user sees activity;
PreloadComplete prints the final ok/fail summary. The numeric
during-preload counter is gone but the phase is short (3-4 s) and
the user sees the finished totals.

Expected: ruborist p1_resolve preload wall drops toward standalone
manifest-bench's 2.4 s, closing most of the remaining gap to bun.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Standalone manifest-bench cap=128 hits avg_conc=95 with the same
reqwest stack; ruborist stuck at avg_conc=56 even after dropping
indicatif Mutex calls (commit 2b89d0b). Same-CI-run comparison
under matched Cloudflare conditions: standalone wall=2.06s vs
ruborist wall=3.09s — 15-conc gap that isn't HTTP, isn't parse, and
isn't progress-bar lock contention.

Hypothesis: `MemoryCache::get_full_manifest` returned `FullManifest`
by value, deep-cloning the per-version `HashMap<String,
Arc<simd_json::OwnedValue>>` (100-500 entries, key Strings + Arc
bumps per entry) on every cache hit. Each `resolve_package` call
issues this read at line 226 of registry.rs as its first sync step,
running on the main task that owns `FuturesUnordered` — so the
deep clone serialises directly with the fill-and-drain loop and
caps in-flight count.

Change cache storage to `Arc<FullManifest>`:
- `MemoryCache.full_manifests: RwLock<HashMap<String, Arc<FullManifest>>>`
- `get_full_manifest -> Option<Arc<FullManifest>>` (atomic-bump clone)
- `set_full_manifest(name, Arc<FullManifest>)` (avoid wrapping at boundary)
- `FullManifestResult::Full(Arc<FullManifest>)` so OnceMap dedup also
  hands shared `Arc`s to coalesced waiters instead of cloning the
  whole struct per caller

`UnifiedRegistry::resolve_full_manifest` constructs the `Arc` once
on the network path (line 281, 318) and passes the same handle to
both `cache.set` and `Ok(FullManifestResult::Full)`. Trait method
`get_cached_full_manifest` keeps its `Option<FullManifest>`
signature (one external caller is `ut view`, off the hot path) and
deep-clones on demand from the `Arc`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Final hypothesis after Arc<FullManifest> didn't lift the avg_conc=56
ceiling: ruborist hot paths emit ~5-10 `tracing::debug!()` per
resolved manifest (cache hits, preload events, BFS dispatch). With
2730+ manifests during cold preload that's 15-30k events. Even
through tracing_appender's non_blocking channel, each event pays
format/serialise CPU on the resolving thread before the channel
send. The standalone manifest-bench has zero tracing calls and
hits avg_conc=92 at cap=128 with the same reqwest stack.

Drop file-layer default from `utoo=debug` to `utoo=info`. The hot
debug events stop firing entirely (no format, no channel send).

Override path preserved: `UTOO_FILE_LOG=debug` (or any
RUST_LOG-style spec) re-enables verbose file capture when actually
diagnosing. Console filter behaviour unchanged.

Expected: avg_conc lifts from 56 toward standalone's 92, p1_resolve
preload wall drops toward standalone's 2.0-2.4 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`resolve_package`'s full-manifest cache-hit branch (registry.rs:541)
was cloning the entire `versions.keys: Vec<String>` (100-500 entries
per package) just to pass `&[String]` to `resolve_target_version`.

Cold ant-design preload hits this branch ~1800 times (every dep
beyond the first unique-(name) pop falls through here once preload
has populated the full manifest). 1800 × ~200 entries = ≈360k
String allocations on the resolver worker pool — global allocator
contention that doesn't show up in our HTTP/parse diag because it
runs on resumed-future threads, not the main task.

Borrow `&full_manifest.versions.keys` directly; `Arc<FullManifest>`
auto-derefs and the slice coercion satisfies the API. Zero alloc.

Diagnostic context: standalone manifest-bench cap=128 hits
avg_conc=92 with the same reqwest stack; ruborist held at 55-57
even after Mutex/clone hot-path eliminations elsewhere. Allocator
pressure on resolver threads is a remaining structural source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`normalize_spec` unconditionally allocated `(String, String)` —
including the ~99 % case where the spec has no `npm:` or
`workspace:` prefix and no normalisation is needed. ~5460 String
allocs per ant-design preload (2 per `resolve_package` call ×
2730 unique deps), all on resolver futures driven by main task's
cooperative polling.

Switch return type to `(Cow<'a, str>, Cow<'a, str>)`. Common path
returns `Cow::Borrowed` and pays zero allocations. `npm:` /
`workspace:` prefix paths still build the substring borrow without
allocating (they're already slices into the input). Callers (3
sites: traits/registry.rs, service/registry.rs, resolver/registry.rs)
work unchanged thanks to Cow's `Deref<Target=str>`.

Diagnostic context: standalone manifest-bench cap=128 reaches
avg_conc=92 with the same reqwest stack; ruborist held at 55-58
even after Mutex / FullManifest / progress-bar / tracing /
keys.clone() eliminations. Allocator pressure on the resolver
worker pool — each per-future hot-path String alloc compounds
across 2700+ futures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` +
`bytes` + parse `spawn_blocking`) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone manifest-bench
(same reqwest stack, no resolver) hit 92 at the same cap.

New design: N long-lived `tokio::spawn` workers pulling from a
shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker
owns an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched`/`completed: AtomicUsize` and
park on a shared `Notify` when the queue is empty. When the last
completion makes `completed == dispatched` and the queue is empty,
the finishing worker raises a `shutdown` flag and wakes others; all
workers drop their result_tx clones, the channel closes, and the
main `recv().await` loop exits.

Trait surface change:
- `RegistryClient`'s default-method futures gained `+ Send` bounds
  (and `Self: Sync` where blanket-default fn calls into `&self`)
- `MockRegistryClient` + `MockPackage` now `derive(Clone)` so tests
  can wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site
  in `run_preload_phase` clones the borrowed registry into a fresh
  `Arc`. Bound at every public surface up the chain bumped to
  `R: RegistryClient + Clone + Send + Sync + 'static`,
  `R::Error: Send`.
- `resolve_package` / `resolve_registry_dep` / `process_dependency`
  helper bounds gained `+ Sync` (their `R::Future: Send` bounds are
  inherited from the trait change above).

Local npmmirror smoke (cap=256 via DEFAULT_CONCURRENCY): avg_conc
jumped from ~55 (old) to 86.8 (new). Worker-pool delivers the
parallelism standalone manifest-bench was already showing.

Tests use `#[tokio::test(flavor = "multi_thread", worker_threads = 2)]`
since worker-pool needs spawn-able runtime; ruborist's
dev-dependencies on `tokio` add the `rt-multi-thread` feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload (ruborist ed7b551) sustains avg_conc=66 at
cap=96 on CI vs the prior FuturesUnordered's 58 — and same-run
standalone manifest-bench reached 93/2.14s at cap=128 with the
identical reqwest stack. With workers running independently on
tokio's global executor (no cooperative-poll serialisation through
one task), more cap slots translate directly to more parallel
TCP requests in flight.

The Cloudflare per-req throttle curve we measured under the old
architecture (per-req wall doubled at cap 128→256) was conflated
with the FuturesUnordered ceiling. With workers decoupled the
curve needs re-measurement; cap=128 is the cheapest experiment
that brings ruborist to standalone parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool sweep on CI ant-design p1_resolve:
  cap=96:  wall=2.23s avg_conc=66 per-req=53ms
  cap=128: wall=2.15s avg_conc=84 per-req=66ms
  → per-req drops with cap (refutes the FuturesUnordered-era
    "server throttle past 70 conc" reading; that was main-task
    saturation). Same-run standalone manifest-bench cap=192 hit
    130 conc / 2.10s, so cap=160 should bring another 0.1-0.2s
    out of preload before the curve flattens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Worker-pool preload at cap=160 surfaced parse blocking-pool queue
saturation: parse diag showed `queue p95=200ms sum=70-89s` over
2730 manifests — ~26ms average queue wait per parse. That accounted
for the entire ruborist-vs-standalone per-req gap (55ms vs 28ms
under identical Cloudflare conditions).

Cause: blocking pool is sized to `worker_threads` (= num_cpus = 4 on
CI). Worker-pool preload sustains 80+ concurrent fetches; each
spawn_blocking parse goes into a 4-slot queue and waits behind
others. Original spawn_blocking offload was justified under
FuturesUnordered + main-task polling (would have stalled the single
poll loop), but worker-pool runs each future on tokio's global
executor — a brief 1-5ms sync CPU burst on a worker is cheaper than
spawn_blocking dispatch + queue wait.

Inline simd_json parse on the resolving worker. Each worker thread
parses its own response immediately after `bytes().await`; no extra
hop. Worker-pool's independent task scheduling means one stalled
worker doesn't starve the others — we just lose ~5ms of one
worker's cycle, which is far less than the dispatch-and-queue
round-trip we were paying.

Both fetch sites updated (`fetch_full_manifest` for npmjs full
manifest path, `fetch_version_manifest` for semver registries like
npmmirror).

Expected: ruborist preload per-req drops from 55-66ms → ~30-40ms
(matching standalone), wall toward ~1.7s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cap=160 + inline parse pushed avg_conc to 119 — past the
per-source Cloudflare throttle threshold. Per-req inflated
55 ms → 93 ms; net wall flat at 2.14s.

cap=128 + inline parse: avg_conc target ~85-95 (matching standalone
manifest-bench cap=128 = 70-90 / 1.6-2.0s under similar Cloudflare
conditions). Inline parse alone (no spawn_blocking queue) plus
sane concurrency should land preload at ~1.7-1.8s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`find_workspaces_from_pkg` was reading every workspace's package.json
sequentially in a `for path in matched_paths { read_package_json(...).await }`
loop. Ant-design has ~200 workspace packages; at ~1 ms per single-file
async FS round-trip on CI runners that's ~150-200 ms of serial I/O —
the largest unmeasured chunk between preload completion and lockfile
write (hyperfine total p1 minus instrumented sub-phases).

Collect workspace paths from every glob pattern first, then dispatch
all `read_package_json` calls into a `FuturesUnordered` for parallel
execution. Each read is small (typical workspace package.json < 4 KB)
so completion order is irrelevant — just push results as they land.

Expected: ant-design p1_resolve hyperfine wall drops by 100-150 ms
(toward ~2.40s vs current 2.58s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
p1_resolve hyperfine still has ~80 ms of unmeasured wall after
parallel workspace reads (commit bf14995). Suspected: 2-3 MB
package-lock.json serialize + atomic-write-rename. Add per-step
timing log so we know which knob to turn (compact-json,
to_writer streaming, async fs::rename quirks, etc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add timer covering find_root_path → read root package.json → engines
inject → graph init → root edges → workspace discovery → workspace
nodes/edges. This is the chunk between hyperfine start and
build_deps entry — currently uninstrumented and the residual ~85ms
gap source after lockfile timing showed save is only 11ms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Linter-applied formatting cleanup, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Original cap was sized for the FuturesUnordered preload that
dispatched 128 simd_json parses through `spawn_blocking` in a
burst — letting the default 512 cap run gave bimodal wall (M2:
2.7s fast / 6.9s thrash). Capping at `worker_threads` eliminated
the thrash peak.

After commit f3f616d (inline parse) preload no longer uses the
blocking pool. The dominant consumer is now `cloner.rs` during
the install phase: every file's hardlink / clonefile / copy goes
through `spawn_blocking`, ~50000 short syscalls per ant-design
install. Each syscall is near-instant, so the cap rarely
backpressures, but cap=4 on CI does limit how fast cloner can
fire syscalls in parallel.

Raise cap to `max(worker_threads * 4, 32)`: enough headroom for
cloner to keep multiple syscalls in flight, low enough that the
historical thrash regime (hundreds of churning threads) stays
avoided. Pool is per-runtime; idle threads die after 10s.

Expected: small p3_cold_install improvement (current utoo 5.74s
vs bun 7.71s); preload phase unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A/B test: replace `entries.par_chunks(WRITE_CHUNK_SIZE).try_for_each`
with a plain sequential `for entry in &entries` loop. Each tarball
still runs in its own outer `rayon::spawn` task (cross-package
parallelism preserved); only the within-tarball write fan-out is
removed.

Goal: measure whether rayon's intra-package parallelism still earns
its keep after the worker-pool preload rewrite. Cross-package
parallelism alone may already saturate IO; if so, removing the
inner par_chunks cuts work-stealing futex traffic + thread sync
overhead with zero throughput cost.

If p3_cold_install regresses ≥0.3s → intra-package writes are
genuinely IO-bound across cores, restore par_chunks.
If p3 unchanged or improves → simpler sequential code wins.

This is a test commit. Will be reverted if regression measured.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`clone_dir` (Linux hardlink/copy path) was using
`tokio::task::spawn_blocking` per package — at default cap=4 on CI,
only 4 packages cloned at once, each running all file hardlinks
sequentially internally. ~3500 packages × N files per install all
funneled through that bounded pool.

Switch to the same pattern `extractor.rs` already uses:
- `rayon::spawn` per package replaces `spawn_blocking` (cross-package
  parallelism via rayon work-stealing — global pool, not capped at
  worker_threads)
- `par_chunks(CLONE_CHUNK_SIZE)` for the inner hardlink/copy loop
  (intra-package fan-out across cores; same chunk size = 32 as
  extractor)

Trade-offs:
- EXDEV `force_copy` latch is now per-chunk instead of global per
  clone — chunks each rediscover cross-device errors and fall back
  locally. A few extra hardlink-then-copy round-trips at chunk
  boundaries, acceptable for the rare cross-device install.
- Pool unification: tokio blocking pool now mostly idle (just git +
  http tarball + a few one-shot commands), rayon handles all the
  high-volume IO. Cuts the 3-pool fragmentation observed earlier.

Tested:
- Iter 1 of this loop (cap bump from N to max(N*4, 32)): no p3 win,
  p4 regressed → cap raise alone wasn't the answer.
- Iter 2 (drop intra-package par_chunks in extractor): p3 +3.67s,
  σ exploded 0.04 → 2.85s → intra-package fan-out is essential.
- This commit applies the same fan-out to clone_dir for the same
  reason.

macOS `clonefile` path (target_os = "macos") unchanged — clonefile
is a single syscall per file, different perf profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- delete crates/manifest-bench (debug-only, never merged)
- tombi format crates/ruborist/Cargo.toml
- typos: unparseable → unparsable in bench/pm-bench.sh
@elrrrrrrr elrrrrrrr added benchmark Run pm-bench on PR bench-phases Run pm-bench-phases workflow labels Apr 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a wide range of performance optimizations to the package manager and the ruborist library, targeting bottlenecks in manifest parsing, dependency resolution, and network throughput. Notable improvements include a lazy manifest parsing strategy with memoization, the use of lock-free queues, and a custom DNS resolver with round-robin rotation. The HTTP stack was also updated to use aws-lc-rs for TLS and forced HTTP/1.1 to improve concurrency. Review feedback identifies a regression caused by the removal of request deduping in the registry client, leading to redundant fetches and dead code. Concerns were also raised regarding the reliability of fire-and-forget disk cache writes in a CLI tool, and a more efficient implementation for DNS address partitioning was suggested.

Comment on lines +231 to +232
// PROBE: OnceMap dedup stripped — direct fetch on every caller.
self.fetch_full_manifest_network(name).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The removal of OnceMap deduping in resolve_full_manifest is a significant regression in network efficiency. In a typical dependency graph, many packages share common dependencies. Without coalescing, concurrent requests for the same package manifest (especially during the preload phase) will result in redundant network fetches, increasing load on the registry and potentially triggering rate limits. If this is intended as a performance 'probe', it should be reverted before merging to production.

/// lock-free, avoiding the serialisation the previous per-name
/// `tokio::sync::Mutex<()>` gate imposed on the hot dispatch path.
#[cfg(not(target_arch = "wasm32"))]
inflight: Arc<OnceMap<String, FullManifestResult>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The inflight field in UnifiedRegistry appears to be dead code now that OnceMap deduping has been stripped from resolve_full_manifest. It is still being initialized in the builder and cloned in the Clone implementation, but it is never used in the hot path. This should be removed to maintain code clarity and avoid unnecessary overhead.

} else {
tracing::debug!("Wrote versions to disk cache: {name}");
}
tokio::spawn(async move {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using tokio::spawn for fire-and-forget disk cache writes in a CLI tool is risky. Since the main process does not wait for these tasks to complete, it is highly likely that the process will exit before the cache is fully written to disk, leading to an unreliable or corrupted disk cache. Consider collecting the JoinHandles and awaiting them before the program terminates, or using a dedicated background worker that can be gracefully shut down.

Comment on lines +128 to +129
let v6: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv6()).copied().collect();
let v4: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv4()).copied().collect();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The address family separation can be performed more efficiently in a single pass using partition instead of two separate filter calls.

Suggested change
let v6: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv6()).copied().collect();
let v4: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv4()).copied().collect();
let (v6, v4): (Vec<_>, Vec<_>) = addrs.iter().copied().partition(|a| a.is_ipv6());

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 027ded9 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 11.18s 3.04s 9.96s 9.67s 599M 302.7K
utoo-npm 12.53s 1.49s 11.59s 13.73s 1.19G 165.8K
utoo 11.29s 1.04s 11.28s 12.70s 2.24G 243.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 17.6K 17.2K 1.16G 6M 1.83G 1.72G 1M
utoo-npm 209.5K 170.8K 1.14G 6M 1.68G 1.68G 2M
utoo 159.4K 81.5K 1.19G 7M 1.68G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.60s 0.12s 3.67s 1.08s 495M 181.6K
utoo-npm 7.23s 2.27s 5.12s 1.78s 426M 75.8K
utoo 5.95s 1.46s 4.63s 2.11s 1.37G 161.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.3K 2.9K 201M 3M 104M - 1M
utoo-npm 69.1K 2.2K 204M 2M 9M 5M 2M
utoo 88.4K 5.4K 224M 3M 7M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 7.27s 0.79s 6.11s 9.55s 576M 205.5K
utoo-npm 9.28s 3.11s 5.60s 11.42s 992M 120.7K
utoo 10.11s 3.40s 5.59s 10.75s 880M 105.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 6.9K 7.1K 993M 4M 1.73G 1.73G 1M
utoo-npm 142.0K 88.1K 965M 4M 1.67G 1.67G 2M
utoo 130.9K 52.4K 966M 4M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.39s 0.09s 0.19s 2.44s 137M 32.4K
utoo-npm 2.64s 0.12s 0.59s 3.86s 81M 18.7K
utoo 2.07s 0.07s 0.40s 3.37s 61M 13.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 258 60 7M 59K 1.88G 1.72G 1M
utoo-npm 47.8K 21.8K 16K 9K 1.67G 1.67G 2M
utoo 16.6K 9.1K 16K 9K 1.68G 1.67G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 24.68s 2.74s 9.06s 9.50s 540M 385.7K
utoo-npm 21.72s 4.72s 8.03s 13.37s 785M 109.2K
utoo 20.54s 7.78s 7.23s 11.53s 711M 114.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 57.7K 4.9K 1.12G 10M 1.83G 1.72G 2M
utoo-npm 246.7K 104.4K 1.01G 8M 1.67G 1.68G 2M
utoo 163.3K 63.9K 983M 9M 1.67G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 3.27s 2.89s 3.92s 1.10s 594M 191.5K
utoo-npm 5.51s 1.10s 1.54s 0.78s 75M 16.0K
utoo 0.98s 0.11s 0.88s 0.33s 81M 17.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.2K 5.7K 152M 3M 106M - 2M
utoo-npm 48.2K 553 13M 2M - 4M 2M
utoo 17.0K 315 16M 3M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 18.37s 1.81s 5.96s 8.93s 251M 103.9K
utoo-npm 24.43s 2.70s 6.22s 12.23s 606M 88.0K
utoo 19.44s 2.11s 5.88s 10.88s 662M 98.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 35.9K 3.9K 998M 7M 1.73G 1.73G 2M
utoo-npm 197.6K 107.7K 966M 6M 1.67G 1.67G 2M
utoo 134.6K 60.1K 968M 6M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.35s 0.11s 0.20s 2.39s 136M 31.5K
utoo-npm 2.36s 0.00s 0.60s 3.85s 82M 18.7K
utoo 2.09s 0.10s 0.42s 3.36s 61M 13.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 406 27 7M 44K 1.88G 1.72G 2M
utoo-npm 48.5K 20.3K 41K 12K 1.67G 1.67G 2M
utoo 16.3K 8.9K 42K 14K 1.67G 1.67G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 027ded9 · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 13.95s 0.39s 5.14s 13.35s 759M 49.0K
utoo-npm 14.99s 1.15s 7.64s 15.16s 1.01G 103.7K
utoo 16.11s 0.35s 7.76s 14.89s 1.99G 178.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.9K 140.0K - - 1.76G 1.91G 1M
utoo-npm 11.8K 411.8K - - 1.63G 1.87G 2M
utoo 10.3K 235.7K - - 1.63G 1.85G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.48s 0.19s 2.21s 0.90s 471M 30.6K
utoo-npm 7.47s 2.72s 4.00s 1.94s 551M 37.2K
utoo 4.88s 0.59s 3.71s 1.91s 1.63G 107.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 8 23.5K - - 110M - 1M
utoo-npm 14 80.6K - - 28M 5M 2M
utoo 38 83.5K - - 27M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 15.62s 4.66s 3.09s 14.01s 485M 31.3K
utoo-npm 13.44s 4.02s 3.20s 12.59s 780M 75.7K
utoo 10.09s 0.47s 3.08s 12.98s 724M 76.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 6.5K 131.1K - - 1.70G 1.94G 1M
utoo-npm 1.6K 253.4K - - 1.61G 1.87G 2M
utoo 1.4K 153.9K - - 1.61G 1.87G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.20s 0.66s 0.09s 1.95s 51M 3.9K
utoo-npm 3.47s 0.48s 0.49s 2.54s 90M 6.6K
utoo 3.21s 0.47s 0.34s 2.23s 87M 6.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.4K 933 - - 1.86G 1.90G 1M
utoo-npm 12.8K 71.7K - - 1.61G 1.82G 2M
utoo 13.7K 19.2K - - 1.63G 1.82G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 34.34s 6.82s 6.97s 21.51s 567M 36.7K
utoo-npm 42.12s 20.63s 6.58s 18.09s 643M 75.4K
utoo 23.40s 4.66s 4.77s 13.22s 859M 80.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.9K 166.0K - - 1.82G 1.94G 2M
utoo-npm 4.2K 448.3K - - 1.61G 1.87G 2M
utoo 3.3K 305.1K - - 1.61G 1.84G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.59s 0.13s 2.43s 1.15s 595M 38.7K
utoo-npm 10.04s 11.19s 1.41s 0.71s 75M 5.5K
utoo 5.67s 8.08s 0.98s 0.30s 86M 6.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 19 22.9K - - 111M - 2M
utoo-npm 5 43.7K - - - 4M 2M
utoo 24 20.2K - - - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 22.19s 2.25s 3.45s 14.92s 261M 17.3K
utoo-npm 37.15s 2.32s 5.76s 21.21s 731M 78.6K
utoo 35.24s 5.55s 4.50s 15.77s 697M 80.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 1.9K 152.1K - - 1.65G 1.92G 2M
utoo-npm 1.6K 342.6K - - 1.60G 1.88G 2M
utoo 1.6K 270.9K - - 1.60G 1.88G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.54s 0.56s 0.07s 1.73s 44M 3.4K
utoo-npm 3.79s 0.92s 0.49s 2.54s 94M 6.9K
utoo 3.73s 0.94s 0.31s 2.11s 92M 6.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 12.9K 718 - - 1.78G 1.91G 2M
utoo-npm 12.5K 72.1K - - 1.61G 1.83G 2M
utoo 12.7K 19.5K - - 1.61G 1.83G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bench-phases Run pm-bench-phases workflow benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant