Skip to content

perf(pm): manifest cache & resolver alloc cleanup#2826

Draft
elrrrrrrr wants to merge 4 commits intonextfrom
perf/manifest-cache
Draft

perf(pm): manifest cache & resolver alloc cleanup#2826
elrrrrrrr wants to merge 4 commits intonextfrom
perf/manifest-cache

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Summary

Third of 4 split PRs from #2818. Independently-motivated allocator + cache hot-path optimisations for the resolver. Each landed during the worker-pool exploration but stands alone — they do not depend on the worker-pool architecture.

Changes (each ~50ms preload savings, cumulative ~200ms)

  • TLS provider: aws-lc-rs instead of ring (~420ms saved on cold preload TLS handshakes — measured CCS→AppData 78ms→17ms)
  • DNS per-family rotation: cycle v4 and v6 independently so connection pool spreads evenly across all addresses (matches bun's pcap-observed 4×64 distribution)
  • Disk-cache bulk-readdir ETag index: lazy HashSet<String> of cached names from one read_dir, restores warm 304 path without per-package try_exists storm
  • Lazy per-version CoreVersionManifest parse via simd_json::OwnedValue + DashMap memoisation — resolver typically reads 1-3 of ~500 versions per manifest
  • Arc<FullManifest> in MemoryCache — atomic-bump clone instead of deep HashMap clone (~500k allocs eliminated)
  • normalize_spec returns Cow<'_, str> — common path now zero-alloc (~5460 allocs eliminated)
  • Drop versions.keys.clone() on cache-hit path (~360k String allocs eliminated)
  • OnceMap dedup for concurrent resolve_full_manifest callers
  • tracing file_filter info+ default — drops format/serialize CPU for ~15-30k hot-path debug events per cold preload (override via UTOO_FILE_LOG=debug)
  • indicatif progress bar: drop per-package message updates (was ~9000 lock acquisitions per ant-design preload)
  • HTTP + parse diagnostic infrastructure for #PR4 to wire in

Trait surface change

RegistryClient's default-method futures gain + Send and Self: Sync bounds. Required by spawn use in #PR4 but works equally for single-threaded resolvers. Adds + Sync bound on resolve_package / resolve_registry_dep / process_dependency / preload helpers.

Test plan

Stacking

  • Base: next
  • Stacked-on-top: PR4 (perf/preload-worker-pool) targets perf/manifest-cache and adds the worker-pool spawn refactor + Send/Clone/Sync/'static bound propagation.

Context

Full exploration journey + failed-experiments catalog: #2818
Bench infrastructure: #2824

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a series of performance optimizations for the package resolver, focusing on reducing memory allocations, improving concurrency, and optimizing network and disk I/O. Key enhancements include lazy manifest parsing using simd_json, a OnceMap utility to deduplicate concurrent fetches, and a round-robin DNS resolver for better connection distribution. Memory efficiency is improved by utilizing Arc in caches and Cow for string normalization. Review feedback points out a compilation risk from using unstable let_chains, potential data loss due to fire-and-forget background writes, and an opportunity to further optimize DNS rotation allocations.

Comment on lines +113 to +140
fn rotate_addrs(addrs: &[SocketAddr], offset: usize) -> Vec<SocketAddr> {
if addrs.is_empty() {
return Vec::new();
}
let rotate = |slice: &[SocketAddr]| -> Vec<SocketAddr> {
if slice.is_empty() {
return Vec::new();
}
let start = offset % slice.len();
slice[start..]
.iter()
.chain(&slice[..start])
.copied()
.collect()
};
let v6: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv6()).copied().collect();
let v4: Vec<SocketAddr> = addrs.iter().filter(|a| a.is_ipv4()).copied().collect();
let v6_rot = rotate(&v6);
let v4_rot = rotate(&v4);
// Preserve v6-first ordering if that's what the resolver gave us;
// Happy Eyeballs will still prefer v6 when it's reachable.
let v6_first = addrs.first().map(|a| a.is_ipv6()).unwrap_or(true);
if v6_first {
v6_rot.into_iter().chain(v4_rot).collect()
} else {
v4_rot.into_iter().chain(v6_rot).collect()
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rotate_addrs function performs multiple Vec allocations (filtering, collecting, and rotating) on every DNS resolution, including cache hits. Since this is a hot path and the PR aims for allocation cleanup, this can be optimized to use fewer allocations by leveraging partition and rotate_left.

    fn rotate_addrs(addrs: &[SocketAddr], offset: usize) -> Vec<SocketAddr> {
        if addrs.is_empty() {
            return Vec::new();
        }

        let (mut v6, mut v4): (Vec<_>, Vec<_>) = addrs.iter().copied().partition(|a| a.is_ipv6());

        if !v6.is_empty() {
            let mid = offset % v6.len();
            v6.rotate_left(mid);
        }
        if !v4.is_empty() {
            let mid = offset % v4.len();
            v4.rotate_left(mid);
        }

        let v6_first = addrs.first().map(|a| a.is_ipv6()).unwrap_or(true);
        if v6_first {
            v6.into_iter().chain(v4).collect()
        } else {
            v4.into_iter().chain(v6).collect()
        }
    }

Comment on lines +203 to +207
if let Some(entry) = self.map.get(&key)
&& let Value::Done(result) = entry.value()
{
return Some(Arc::clone(result));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of let chains (e.g., if let ... && let ...) is an unstable Rust feature (let_chains). Unless the project is explicitly using a nightly compiler, this will cause compilation errors on stable Rust. It is recommended to use nested if let blocks instead. This also applies to lines 235-239 and 245-249.

Suggested change
if let Some(entry) = self.map.get(&key)
&& let Value::Done(result) = entry.value()
{
return Some(Arc::clone(result));
}
if let Some(entry) = self.map.get(&key) {
if let Value::Done(result) = entry.value() {
return Some(Arc::clone(result));
}
}

/// tuning showed the previous inline `.await` + `serde_json::to_string_pretty`
/// burned ~1–3 ms per call on the hot path, stalling the main preload
/// task and causing the 24..62 active-stream dip observed on CI.
pub fn set_versions_to_disk(&self, name: &str, info: &VersionsInfo) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using tokio::spawn for fire-and-forget disk writes in a CLI tool can lead to data loss or cache corruption if the process exits before the background tasks complete. Since there is no mechanism to await these tasks during shutdown, the disk cache might not be reliably updated. Consider tracking these tasks or providing a flush mechanism to ensure writes complete before the program terminates.

@elrrrrrrr elrrrrrrr marked this pull request as draft April 25, 2026 15:22
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
The headline architectural change of #2818. ruborist's preload
phase shifts from a single-task `FuturesUnordered` cooperative
poller to N long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all
preload futures cooperatively, and ran every per-future
continuation (post-await body, completion handler, dispatch
refill) on the same single task. The deeper await chain inside
`resolve_package` (cache check + `OnceMap::get_or_init` +
`RetryIf` + `request.send` + `bytes` + parse spawn_blocking)
made each future yield 5+ times, and every yield round-tripped
through main — saturating it. CI ant-design preload sustained
avg_conc=55-61 even after Mutex / allocator hot-path
eliminations, while the standalone manifest-bench (same reqwest
stack, no resolver — see #2824) hit 92 at the same cap.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker
owns an `Arc<R>` clone and runs `resolve_package` on tokio's
global executor — futures progress fully independently, no
cooperative poll bottleneck. Main task only drains an
`mpsc::unbounded_channel` of completions to fire receiver events
+ on_manifest callback.

Termination: workers track `dispatched` / `completed:
AtomicUsize` and park on a shared `Notify` when the queue is
empty. When the last completion makes `completed == dispatched`
and the queue is empty, the finishing worker raises a `shutdown`
flag and wakes others; all workers drop their result_tx clones,
the channel closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` now `derive(Clone)` so
  tests can wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into
  a fresh `Arc`. Bound at every public surface up the chain
  bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync +
  'static`, `R::Error: MaybeSend`. The `MaybeSend` /
  `MaybeSync` shims (added in #2826) keep the trait surface
  wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker
  thread eliminates dispatch + queue overhead; 1-5ms CPU per
  manifest is acceptable on async worker.
- **Workspace package.json parallel reads** — `find_workspaces_from_pkg`
  switched from sequential `for path in matched_paths { read }`
  loop to `FuturesUnordered` fan-out. ant-design has ~200
  workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool
  delivered the parallelism that justifies the cap raise. CI
  ant-design avg_conc 84 at cap=128 (up from 55 under the old
  architecture); preload wall 3.10s → 2.15s.

## Tests

`#[tokio::test(flavor = "multi_thread", worker_threads = 2)]`
since worker-pool needs a spawn-able runtime; ruborist's
dev-dependencies on `tokio` add the `rt-multi-thread` feature.

164 ruborist + 10 doctests + 248/249 utoo-pm pass (1 pre-existing
flake on `test_update_package_binary_fsevents`, runs green alone).

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.
`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr force-pushed the perf/manifest-cache branch from 3be6b63 to 2831262 Compare April 25, 2026 15:30
@elrrrrrrr elrrrrrrr changed the base branch from next to perf/bench-infra April 25, 2026 15:30
elrrrrrrr added a commit that referenced this pull request Apr 25, 2026
The headline architectural change of #2818 — preload phase shifts
from a single-task `FuturesUnordered` cooperative poller to N
long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send`
+ `bytes` + parse spawn_blocking) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone
manifest-bench (#2824) hit 92 on the same reqwest stack.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns
an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched` / `completed: AtomicUsize`
and park on a shared `Notify` when the queue is empty. When the
last completion makes `completed == dispatched` and the queue is
empty, the finishing worker raises a `shutdown` flag and wakes
others; all workers drop their result_tx clones, the channel
closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can
  wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into a
  fresh `Arc`. Bound at every public surface up the chain bumped
  to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`,
  `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims
  (added in #2826) keep the trait surface wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker thread
  eliminates dispatch + queue overhead.
- **Workspace package.json parallel reads** — switch the per-pattern
  `for path in matched_paths` serial loop to `FuturesUnordered`
  fan-out. ant-design has ~200 workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool delivers
  the parallelism that justifies the cap raise. CI ant-design
  avg_conc 84 at cap=128 (up from 55 under the old architecture);
  preload wall 3.10s → 2.15s.

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.

`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Tests: 164 ruborist + 10 doctests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr changed the title perf(ruborist): manifest cache & resolver alloc cleanup perf(pm): manifest cache & resolver alloc cleanup Apr 25, 2026
Base automatically changed from perf/bench-infra to next April 27, 2026 02:14
elrrrrrrr and others added 4 commits April 27, 2026 11:30
Compares this PR's utoo against next-branch HEAD (the merged baseline)
instead of just utoo-npm (latest published, can be days/weeks behind).
The utoo-next column isolates THIS PR's perf delta from any other
unmerged-since-publish work.

Two new build jobs (build-next-{linux,mac-arm64}) checkout origin/next
and build utoo from there in parallel with the main builds. Bench
phases pick up both artifacts via the new setup-utoo-next-baseline
composite action and pass utoo-next through PM_LIST.

Build jobs gate on the same `benchmark` label / dispatch trigger as
bench-phases — they only fire when bench-phases will actually run.

Bench script (bench/pm-bench-phases.sh) gets parallel utoo-next
support: UTOO_NEXT_BIN env, UTOO_NEXT_CACHE, and case statements
mirroring the existing utoo-npm pattern across install_cmd,
resolve_cmd, write_prepare, capture_footprint, seed_for_phase.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle of independently-motivated allocator + cache hot-path
optimisations from the parent perf branch (#2818). Each landed
during the worker-pool exploration but doesn't depend on the
worker-pool architecture itself — they stand alone as
straightforward perf wins for the resolver.

## TLS provider — `aws-lc-rs` instead of `ring`

`reqwest` 0.12's default `rustls-tls-native-roots` feature pins
`ring` via Cargo's feature unification. Switch to
`rustls-tls-native-roots-no-provider`, build our own
`rustls::ClientConfig` with the `aws_lc_rs` provider, pass via
`Client::use_preconfigured_tls`. CI measurement (4-core ubuntu vs
npmjs.org): ring's per-handshake CCS→AppData was 78 ms p50 / 154
ms max, all 128 parallel handshakes serialising across 4 cores.
aws-lc-rs (BoringSSL primitives) is ~3× faster on x86_64. Saved
~420 ms preload on cold ant-design.

## DNS — per-family rotation

`getaddrinfo` typically returns 10 v6 + 12 v4 for npmjs.org. A
flat rotation across the joined list meant offsets 0..10 all
started inside the v6 range; on hosts where v6 routing fails
(GitHub Actions runners), every connection fell through to the
*same* first-reachable v4. Rotate per-family so v4 conns cycle
across all v4 addresses (and v6 over v6) — observed pcap on bun
shows the same 4×64 distribution we now produce.

## Disk-cache bulk-readdir ETag index

`PackageCache` lazy-builds a `HashSet<String>` of names with
existing disk cache entries from a single `read_dir(cache_dir)` +
per-`@scope` recurse. `get_versions_from_disk` and
`get_version_manifest_from_disk` short-circuit via the index.
Restores the warm-run 304 path that was temporarily removed in
46cb803 (per-package `try_exists` was 16 ms avg on the cold-run
critical path; now zero).

## Lazy per-version `CoreVersionManifest` via `simd_json::OwnedValue`

`Versions` now stores `keys: Vec<String>` (ordered version list)
+ `trees: HashMap<String, Arc<simd_json::OwnedValue>>`
(pre-parsed JSON subtrees). Strongly-typed `CoreVersionManifest`
is materialised on demand via
`CoreVersionManifest::deserialize(tree.as_ref())` — zero-copy
through `simd_json::OwnedValue`'s `Deserializer` impl, memoised in
a `DashMap`. Resolver typically reads 1-3 of the ~500 versions
per manifest; previous design built every one eagerly.

## `Arc<FullManifest>` in `MemoryCache`

Cache previously returned `FullManifest` by value, deep-cloning
the per-version HashMap (100-500 entries × String key clone + Arc
bump per cache hit) on the resolver hot path. ~2730 cache hits
during cold preload × ~200-entry HashMap clone =
~500k allocations on shared resolver threads, contending the
allocator. Wrap in `Arc<FullManifest>`; cache hit becomes one
atomic bump.

## `normalize_spec` returns `Cow<'a, str>`

Was unconditionally allocating `(String, String)` even for the
~99 % of deps with no `npm:` / `workspace:` prefix. ~5460 String
allocations per ant-design preload, all on resolver hot path.
Common path now returns `Cow::Borrowed`.

## Drop `versions.keys.clone()` from cache-hit path

`resolve_package`'s full-manifest cache-hit branch was cloning
the entire `versions.keys: Vec<String>` (~200 entries) just to
pass `&[String]` to `resolve_target_version`. Borrow directly via
Arc auto-deref. ~360k String allocs eliminated (~1800 cache hits
× ~200 entries).

## OnceMap dedup

New `crate::util::oncemap` module: `DashMap` + `tokio::sync::Notify`
coalescer for concurrent `resolve_full_manifest` callers of the
same name. First caller fetches the network; others wait on the
shared `Notify` and read the cached `Arc<V>`. Replaces the prior
per-name `tokio::sync::Mutex<()>` gate that serialised the hot
dispatch path.

## tracing file_filter info+ default

File-layer log filter dropped from `utoo=debug` to `utoo=info`.
Hot-path `tracing::debug!()` calls (cache hits, BFS dispatch,
preload events) emit ~5-10 events per resolved manifest. With
2730+ manifests during cold preload that's 15-30k events that —
even routed through the non_blocking appender's channel — pay
format/serialise CPU on the resolving thread before the channel
send. Override via `UTOO_FILE_LOG=debug` for diagnostics.

## indicatif progress bar — drop per-package message updates

`PreloadFetching` and `PreloadProgress` used to call
`format!("fetching/resolved {}", name)` + `PROGRESS_BAR.set_message()`
per event. With ~9000 such calls per ant-design preload and an
indicatif-internal `Mutex` per call, this serialised the main
loop's fill-and-drain rate. The user can't visually parse 5460
message swaps in 3 seconds anyway. Counter still ticks via
`PROGRESS_BAR.inc(1)`.

## HTTP + parse diagnostic infrastructure (used by PR4)

`service/http.rs` ships `start_http_trace` / `finish_http_trace`
+ `start_parse_trace` / `finish_parse_trace` plus
`record_http_interval` + `record_parse_interval` callbacks.
`#[allow(dead_code)]` on the start/finish for now — the preload
worker-pool refactor in the next PR (#TBD) wires them in.

Also bumps the `+ Sync` bound on `RegistryClient` callers in
`builder.rs` / `preload.rs` / `resolver/registry.rs` — required
because the trait's default-method futures gained `+ Send`
(needed downstream by tokio::spawn, but already correct for
single-threaded resolvers too).

Tests: 164 ruborist + 248/249 utoo-pm pass (1 pre-existing flake
on `test_update_package_binary_fsevents` when run in parallel,
passes alone).

Stacks: PR4 (preload worker-pool architecture) targets this
branch and adds the bound propagation + spawn refactor on top.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two adjustments needed for the wasm target after introducing
aws-lc-rs and Send/Sync trait bounds:

1. Move `rustls` (with `aws-lc-rs` feature) and `rustls-native-certs`
   under `[target.'cfg(not(target_arch = "wasm32"))'.dependencies]`.
   `aws-lc-sys` builds BoringSSL via `cc` and doesn't support the
   `wasm32-unknown-unknown` target (no `stdlib.h` etc). The wasm
   reqwest path uses the browser fetch API and ignores rustls.

2. Add `MaybeSend` / `MaybeSync` shim traits in `util::maybe_send`:
   on native they expand to `Send` / `Sync`; on wasm32 they are
   vacuous (impl for every type). wasm-bindgen's `JsFuture` is
   `!Send` so the trait surface had to either drop the bound on
   wasm or use a conditional shim. Replace `+ Send` and
   `Self: Sync` in the `RegistryClient` trait + caller bounds in
   `builder.rs` / `preload.rs` / `resolver/registry.rs` with
   `+ MaybeSend` / `Self: MaybeSync`.

3. cfg-gate `service/cache.rs` `tokio::spawn` for fire-and-forget
   disk writes — wasm uses `wasm_bindgen_futures::spawn_local`
   instead since the futures are `!Send`.

4. cfg-gate the `OnceMap` coalescer in `service/registry.rs` —
   wasm runs single-threaded so coalescing concurrent fetches is
   a no-op anyway; call the network path directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr force-pushed the perf/manifest-cache branch from 2831262 to 2d5befb Compare April 27, 2026 03:31
elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
The headline architectural change of #2818 — preload phase shifts
from a single-task `FuturesUnordered` cooperative poller to N
long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send`
+ `bytes` + parse spawn_blocking) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone
manifest-bench (#2824) hit 92 on the same reqwest stack.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns
an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched` / `completed: AtomicUsize`
and park on a shared `Notify` when the queue is empty. When the
last completion makes `completed == dispatched` and the queue is
empty, the finishing worker raises a `shutdown` flag and wakes
others; all workers drop their result_tx clones, the channel
closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can
  wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into a
  fresh `Arc`. Bound at every public surface up the chain bumped
  to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`,
  `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims
  (added in #2826) keep the trait surface wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker thread
  eliminates dispatch + queue overhead.
- **Workspace package.json parallel reads** — switch the per-pattern
  `for path in matched_paths` serial loop to `FuturesUnordered`
  fan-out. ant-design has ~200 workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool delivers
  the parallelism that justifies the cap raise. CI ant-design
  avg_conc 84 at cap=128 (up from 55 under the old architecture);
  preload wall 3.10s → 2.15s.

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.

`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Tests: 164 ruborist + 10 doctests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label Apr 27, 2026
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · cf7889e · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-next (next-branch baseline) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 8.88s 0.20s 10.08s 10.11s 735M 325.9K
utoo-next 9.86s 0.11s 11.50s 13.05s 1.27G 154.5K
utoo-npm 9.90s 0.24s 11.46s 13.08s 1.30G 157.2K
utoo 9.24s 0.70s 10.74s 13.06s 2.28G 268.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.1K 17.4K 1.17G 6M 1.85G 1.73G 1M
utoo-next 165.6K 151.6K 1.15G 4M 1.70G 1.69G 2M
utoo-npm 165.0K 143.6K 1.15G 4M 1.70G 1.69G 2M
utoo 152.5K 97.1K 1.13G 5M 1.70G 1.69G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.98s 0.05s 3.90s 1.05s 496M 168.3K
utoo-next 5.45s 0.41s 6.04s 1.13s 430M 75.0K
utoo-npm 5.16s 0.08s 6.00s 1.15s 431M 74.2K
utoo 5.88s 2.39s 4.53s 1.95s 1.37G 169.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 8.1K 4.6K 200M 3M 104M - 1M
utoo-next 66.3K 2.9K 204M 2M 9M 5M 2M
utoo-npm 65.4K 2.9K 201M 2M 9M 5M 2M
utoo 76.9K 7.3K 196M 3M 7M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 6.86s 0.41s 6.20s 9.88s 607M 197.9K
utoo-next 6.89s 1.23s 5.51s 11.11s 754M 117.6K
utoo-npm 7.03s 1.16s 5.59s 11.16s 856M 120.5K
utoo 6.89s 0.72s 5.46s 11.31s 952M 119.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 4.5K 7.0K 1004M 4M 1.75G 1.75G 1M
utoo-next 104.1K 64.7K 975M 2M 1.69G 1.69G 2M
utoo-npm 106.6K 69.9K 975M 3M 1.69G 1.69G 2M
utoo 116.2K 86.7K 975M 3M 1.69G 1.69G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.16s 0.07s 0.22s 2.30s 137M 32.1K
utoo-next 2.48s 0.06s 0.62s 3.89s 81M 19.0K
utoo-npm 2.46s 0.02s 0.64s 3.89s 82M 19.0K
utoo 2.32s 0.05s 0.55s 3.84s 82M 19.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 245 30 7M 24K 1.90G 1.71G 1M
utoo-next 46.9K 19.7K 16K 27K 1.69G 1.69G 2M
utoo-npm 48.8K 21.5K 16K 15K 1.69G 1.69G 2M
utoo 49.6K 21.1K 15K 10K 1.70G 1.69G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 42.63s 8.15s 10.05s 10.61s 551M 364.2K
utoo-next 81.74s 51.23s 8.50s 14.40s 879M 118.7K
utoo-npm 24.07s 7.92s 8.33s 13.81s 882M 118.2K
utoo 142.78s 117.12s 8.00s 13.86s 792M 105.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 131.2K 6.1K 1.13G 16M 1.85G 1.74G 2M
utoo-next 244.9K 126.8K 991M 10M 1.69G 1.69G 2M
utoo-npm 221.1K 123.8K 990M 8M 1.69G 1.69G 2M
utoo 238.7K 87.6K 1019M 12M 1.69G 1.69G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.03s 0.32s 3.85s 1.17s 592M 195.9K
utoo-next 5.47s 0.28s 2.09s 0.56s 75M 15.8K
utoo-npm 3.34s 0.12s 1.96s 0.57s 75M 16.1K
utoo 9.29s 12.21s 1.28s 0.23s 78M 17.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 8.4K 5.2K 151M 3M 106M - 2M
utoo-next 47.5K 575 13M 2M - 4M 2M
utoo-npm 43.9K 1.1K 13M 2M - 4M 2M
utoo 24.6K 67 16M 2M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 22.69s 5.05s 6.12s 9.25s 244M 101.8K
utoo-next 44.32s 38.85s 6.42s 12.93s 644M 100.3K
utoo-npm 72.66s 20.68s 6.60s 13.60s 578M 86.9K
utoo 70.07s 19.12s 6.42s 13.33s 673M 91.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 75.7K 3.5K 996M 10M 1.71G 1.71G 2M
utoo-next 199.2K 100.9K 1005M 8M 1.69G 1.69G 2M
utoo-npm 227.3K 92.6K 979M 10M 1.69G 1.69G 2M
utoo 225.8K 79.6K 990M 11M 1.69G 1.69G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.18s 0.08s 0.20s 2.26s 135M 30.9K
utoo-next 2.41s 0.22s 0.63s 3.94s 82M 19.1K
utoo-npm 2.46s 0.04s 0.63s 3.93s 82M 19.2K
utoo 2.35s 0.06s 0.58s 3.84s 83M 19.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 643 26 3M 50K 1.84G 1.74G 2M
utoo-next 46.9K 20.4K 47K 38K 1.69G 1.69G 2M
utoo-npm 48.5K 22.3K 43K 13K 1.69G 1.69G 2M
utoo 50.0K 21.6K 52K 17K 1.69G 1.69G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · cf7889e · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-next (next-branch baseline) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 18.58s 2.53s 6.48s 18.91s 742M 47.9K
utoo-next 23.68s 6.53s 10.98s 25.76s 1.17G 111.5K
utoo-npm 16.68s 1.54s 8.33s 17.22s 1.02G 101.0K
utoo 29.78s 1.53s 13.70s 34.22s 1.98G 172.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.8K 143.4K - - 1.79G 1.91G 1M
utoo-next 13.1K 359.9K - - 1.64G 1.84G 2M
utoo-npm 12.7K 361.7K - - 1.64G 1.88G 2M
utoo 10.9K 341.1K - - 1.64G 1.84G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.62s 0.11s 2.99s 1.39s 476M 30.9K
utoo-next 9.47s 0.47s 7.13s 4.93s 542M 36.5K
utoo-npm 6.94s 0.62s 5.38s 3.61s 545M 37.1K
utoo 7.65s 0.90s 5.76s 4.72s 1.37G 95.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 27 23.7K - - 110M - 1M
utoo-next 20 70.8K - - 28M 5M 2M
utoo-npm 11 72.0K - - 28M 5M 2M
utoo 38 82.2K - - 27M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 19.22s 2.62s 3.73s 19.20s 541M 35.2K
utoo-next 13.59s 2.83s 3.62s 14.83s 747M 75.4K
utoo-npm 14.29s 3.45s 3.51s 14.16s 810M 77.0K
utoo 15.56s 5.52s 4.36s 19.87s 754M 74.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.7K 137.1K - - 1.70G 1.94G 1M
utoo-next 1.5K 231.2K - - 1.61G 1.87G 2M
utoo-npm 1.4K 230.5K - - 1.61G 1.87G 2M
utoo 1.4K 226.2K - - 1.61G 1.87G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.94s 0.49s 0.10s 2.12s 53M 4.0K
utoo-next 5.78s 1.03s 0.73s 4.03s 94M 6.8K
utoo-npm 4.20s 0.58s 0.51s 2.79s 89M 6.7K
utoo 5.97s 0.52s 0.64s 4.23s 89M 6.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.8K 843 - - 1.87G 1.92G 1M
utoo-next 12.4K 74.7K - - 1.61G 1.86G 2M
utoo-npm 12.5K 75.8K - - 1.61G 1.86G 2M
utoo 12.9K 71.5K - - 1.63G 1.86G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 29.56s 6.65s 6.63s 19.34s 615M 39.8K
utoo-next 28.27s 3.54s 7.87s 24.40s 693M 76.6K
utoo-npm 25.46s 1.06s 8.09s 25.60s 813M 77.0K
utoo 30.32s 11.86s 6.67s 20.11s 697M 75.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.9K 146.5K - - 1.77G 1.91G 2M
utoo-next 4.3K 384.7K - - 1.61G 1.84G 2M
utoo-npm 994 370.2K - - 1.61G 1.87G 2M
utoo 4.8K 376.2K - - 1.61G 1.84G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.50s 0.07s 3.05s 1.73s 601M 39.1K
utoo-next 11.70s 13.43s 1.56s 0.86s 79M 5.8K
utoo-npm 4.91s 0.51s 1.41s 0.81s 79M 5.7K
utoo 10.15s 14.81s 1.26s 0.43s 84M 6.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 20 22.2K - - 111M - 2M
utoo-next 5 46.8K - - - 4M 2M
utoo-npm 6 45.3K - - - 4M 2M
utoo 25 25.4K - - - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 20.26s 2.51s 4.03s 19.09s 296M 19.5K
utoo-next 26.42s 2.12s 4.72s 16.88s 598M 73.0K
utoo-npm 28.27s 2.14s 4.59s 16.37s 674M 74.9K
utoo 27.41s 1.57s 4.99s 18.54s 647M 74.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 2.1K 146.6K - - 1.65G 1.92G 2M
utoo-next 1.6K 345.5K - - 1.61G 1.83G 2M
utoo-npm 1.6K 366.2K - - 1.61G 1.83G 2M
utoo 1.6K 339.1K - - 1.61G 1.83G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.56s 0.06s 0.08s 1.94s 44M 3.4K
utoo-next 3.80s 0.51s 0.52s 2.68s 95M 7.1K
utoo-npm 4.56s 0.65s 0.57s 3.01s 90M 6.8K
utoo 4.12s 0.78s 0.46s 3.01s 95M 7.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.5K 634 - - 1.78G 1.91G 2M
utoo-next 12.2K 72.3K - - 1.61G 1.83G 2M
utoo-npm 12.2K 73.1K - - 1.61G 1.83G 2M
utoo 12.3K 81.2K - - 1.61G 1.83G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant