Skip to content

perf(pm): preload multi-thread worker pool (tokio::spawn)#2832

Closed
elrrrrrrr wants to merge 2 commits intonextfrom
perf/pm-mt-worker-pool
Closed

perf(pm): preload multi-thread worker pool (tokio::spawn)#2832
elrrrrrrr wants to merge 2 commits intonextfrom
perf/pm-mt-worker-pool

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Summary

  • Preload phase: `FuturesUnordered` → N `tokio::spawn` workers on lock-free `SegQueue` + `DashSet` dedup
  • Multi-thread parse parallelism (simd_json across cores) is the actual perf driver — confirmed by closed perf(pm): preload worker pool replaces FuturesUnordered #2830 (spawn_local-only) showing nop
  • New `MaybeSend`/`MaybeSync` shim keeps wasm32 building (`spawn_local` fallback)

Iteration history

PR Status Mechanism Bench result
#2827 closed tokio::spawn + bundled changes (cap, inline parse, workspace, cache) -30% claimed; silent linux crash
#2830 closed `spawn_local` only (single thread) nop on p1_resolve
this open `tokio::spawn` + minimal `MaybeSend` shim, no other changes TBD

Why this should work where #2830 didn't

Single-thread `spawn_local` keeps all parsing serial in one OS thread, same as `FuturesUnordered`. Multi-thread `tokio::spawn` lets parse work run on whatever tokio worker thread is free while other workers drive their network IO. Local verification: `utoo deps` on ant-design ran across 11 worker threads (`Caching` log events distributed: TID 02 through 12+, ~1.3K events each).

Risk: PR #2827 silent crash

PR #2827 silently exited 1 on linux Case 2 ant-design e2e (root cause never diagnosed). This PR is a clean rebuild from `origin/next` with only the worker-pool change + `MaybeSend` shim. macOS local verification passes (preload 108.9s, 4571 manifests, exit 0). CI `utoopm-e2e-ubuntu-latest-x86_64-unknown-linux-gnu` is the smoke test that matters.

Test plan

  • `cargo fmt -p utoo-ruborist -p utoo-pm`
  • `cargo clippy -p utoo-ruborist -p utoo-pm --all-targets -- -D warnings --no-deps`
  • `cargo test -p utoo-ruborist` — 160 unit + 10 doctests pass
  • `cargo build --release -p utoo-pm`
  • ant-design `utoo deps` macOS: exit 0, 4571 success / 0 failed, multi-thread distribution confirmed
  • CI `pm-e2e-*` (4 platforms; linux is the silent-crash smoke)
  • CI `utooweb-ci-build-wasm`
  • CI `pm-bench-phases-*` head-to-head vs `utoo-next`

🤖 Generated with Claude Code

N long-lived `tokio::spawn` workers pull work from a shared
lock-free `SegQueue`. Each worker is independent on tokio's
multi-thread runtime, so when one worker is parsing manifest
JSON (CPU-bound, simd_json), other workers continue driving
their network IO and other parses run on different cores.
Replaces the prior `FuturesUnordered` design where the main
task owned all preload futures and polled them cooperatively
— every per-future await continuation (including parse) ran
serialised in that single task.

The previous spawn_local-only PR (#2830, closed) showed no
measurable gain on p1_resolve, confirming that the
architectural fix alone — N independent task identities —
isn't the perf driver. The actual driver is multi-thread
parse parallelism across cores; that requires `tokio::spawn`,
which requires `Send` bounds.

## Trait surface

A new `crate::maybe_send::{MaybeSend, MaybeSync}` shim
expands to `Send`/`Sync` on native and to no-op marker
traits on `wasm32`. The `RegistryClient` trait's
`impl Future<...>` returns gain `+ MaybeSend`, and the
default impls add `where Self: MaybeSync, Self::Error:
MaybeSend`. Callers use a new `PreloadRegistry` trait alias
that bundles `RegistryClient + Clone + MaybeSend + MaybeSync
+ 'static`, applied at the 5 resolver entry points
(`build_deps*`, `resolve*`, `run_preload_phase`).

## Wasm

`tokio::spawn` is cfg-gated to
`wasm_bindgen_futures::spawn_local` on wasm32. Workers run
independently on the JS event loop — single-thread there,
since `JsFuture` is `!Send` — but the queue + Notify + mpsc
termination story is identical.

## Replaces #2827, closed #2830

- #2827 (silent linux e2e crash, root cause undiagnosed) —
  this PR re-derives the architectural change with the same
  shim approach but on a clean origin/next baseline,
  skipping unrelated companion changes (cap raise / inline
  parse / workspace parallel reads / manifest cache stack).
- #2830 (spawn_local minimal) — bench showed nop, confirming
  multi-thread parse is the real driver.

## Local verification

- `cargo fmt -p utoo-ruborist -p utoo-pm`
- `cargo clippy -p utoo-ruborist -p utoo-pm --all-targets -- -D warnings --no-deps`
- `cargo test -p utoo-ruborist` — 160 unit + 10 doctests pass
- `cargo build --release -p utoo-pm`
- `utoo deps` on ant-design (Case 2 silent-crash test from
  #2827) passes — preload 108.9s wall on local flaky network,
  4571 success / 0 failed, 11 worker threads scheduled.

CI will run `pm-e2e-*` (4 platforms, especially linux which
crashed on #2827) plus `pm-bench-phases-*` for the real
head-to-head against `utoo-next`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label Apr 27, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the manifest preloading logic to use a multi-threaded worker pool based on a lock-free queue (SegQueue) and a concurrent set (DashSet), replacing the previous FuturesUnordered implementation. It also introduces a maybe_send shim to abstract Send and Sync requirements for compatibility between native multi-threaded runtimes and WASM. Feedback identifies a critical race condition in the worker termination logic where the dispatch counter is updated after items are queued, as well as an inefficient snapshotting of the processed set that should be replaced with a simple length check.

Comment on lines +187 to +196
for tdep in extract_transitive_deps(&resolved.manifest, &config_for_worker)
{
let key = format!("{}@{}", tdep.0, tdep.1);
if processed.insert(key) {
pending.push(tdep);
count += 1;
}
}
if count > 0 {
dispatched.fetch_add(count, Ordering::Release);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential race condition in the termination logic. The dispatched counter is incremented after items are pushed to the pending queue. If a worker thread finishes its current task and checks the termination condition (done == dispatched.load(Ordering::Acquire)) before the producer thread has incremented dispatched for the new transitive dependencies, the worker pool might shut down prematurely while there is still work in the queue. To fix this, always increment the dispatched counter before pushing the item to the queue.

Suggested change
for tdep in extract_transitive_deps(&resolved.manifest, &config_for_worker)
{
let key = format!("{}@{}", tdep.0, tdep.1);
if processed.insert(key) {
pending.push(tdep);
count += 1;
}
}
if count > 0 {
dispatched.fetch_add(count, Ordering::Release);
for tdep in extract_transitive_deps(&resolved.manifest, &config_for_worker)
{
let key = format!("{}@{}", tdep.0, tdep.1);
if processed.insert(key) {
dispatched.fetch_add(1, Ordering::Release);
pending.push(tdep);
count += 1;
}
}
if count > 0 {
// Wake parked workers so they pick up the new
// work before checking the termination condition.
notify.notify_waiters();
}

Comment on lines +301 to +307
stats.total_processed = {
let mut set: HashSet<String> = HashSet::with_capacity(processed.len());
for entry in processed.iter() {
set.insert(entry.key().clone());
}
set.len()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calculating total_processed by cloning the entire DashSet into a new HashSet is highly inefficient, especially for large dependency trees. Since the workers have finished at this point, processed.len() provides an accurate count without the overhead of allocations and string cloning.

    stats.total_processed = processed.len();

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 906381d · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.43s 0.28s 10.04s 10.17s 669M 322.8K
utoo-npm 10.34s 0.21s 11.54s 13.34s 1.17G 154.9K
utoo 10.75s 1.49s 12.58s 13.03s 1.42G 145.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.1K 18.6K 1.16G 7M 1.83G 1.72G 1M
utoo-npm 176.0K 165.0K 1.14G 5M 1.68G 1.68G 2M
utoo 156.7K 122.4K 1.15G 4M 1.68G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.21s 0.04s 3.88s 1.06s 480M 174.7K
utoo-npm 5.31s 0.05s 5.91s 1.04s 433M 76.9K
utoo 4.59s 1.66s 6.86s 1.61s 551M 67.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 10.2K 3.9K 200M 3M 104M - 1M
utoo-npm 65.8K 2.9K 204M 2M 9M 5M 2M
utoo 66.7K 14.5K 211M 2M 9M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 7.51s 1.52s 6.12s 9.98s 598M 203.7K
utoo-npm 8.19s 1.54s 5.47s 11.51s 920M 111.1K
utoo 6.17s 0.24s 5.37s 10.89s 817M 111.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 7.2K 7.8K 993M 4M 1.73G 1.73G 1M
utoo-npm 124.6K 81.2K 965M 3M 1.67G 1.67G 2M
utoo 101.0K 68.1K 964M 2M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.34s 0.12s 0.20s 2.33s 136M 32.4K
utoo-npm 2.65s 0.41s 0.59s 3.85s 82M 19.0K
utoo 2.39s 0.04s 0.58s 3.84s 81M 19.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 521 24 7M 31K 1.88G 1.72G 1M
utoo-npm 49.2K 21.8K 19K 46K 1.67G 1.67G 2M
utoo 46.6K 19.4K 14K 3K 1.68G 1.67G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 24.21s 4.20s 9.25s 9.81s 545M 379.5K
utoo-npm 25.56s 4.24s 8.36s 13.78s 771M 113.8K
utoo 26.26s 4.52s 8.13s 13.86s 711M 103.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 69.3K 6.0K 1.12G 11M 1.84G 1.72G 2M
utoo-npm 238.9K 112.8K 992M 8M 1.67G 1.67G 2M
utoo 233.8K 96.3K 988M 8M 1.67G 1.68G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.57s 0.20s 3.94s 1.07s 643M 189.5K
utoo-npm 3.32s 0.12s 1.97s 0.61s 76M 15.7K
utoo 3.37s 0.10s 1.94s 0.70s 76M 16.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 4.9K 5.9K 151M 2M 106M - 2M
utoo-npm 43.8K 1.2K 14M 2M - 4M 2M
utoo 40.0K 2.6K 14M 2M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 22.26s 13.17s 5.94s 9.43s 240M 92.9K
utoo-npm 22.22s 2.01s 6.12s 12.60s 566M 103.4K
utoo 21.98s 1.53s 6.13s 12.59s 577M 98.0K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 46.8K 4.9K 999M 8M 1.73G 1.73G 2M
utoo-npm 182.7K 98.4K 984M 6M 1.67G 1.67G 2M
utoo 186.7K 98.1K 978M 6M 1.67G 1.67G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.08s 0.20s 0.21s 2.30s 135M 32.1K
utoo-npm 2.16s 0.09s 0.59s 3.79s 82M 19.1K
utoo 2.27s 0.11s 0.58s 3.84s 82M 18.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 700 31 7M 51K 1.88G 1.72G 2M
utoo-npm 48.2K 21.3K 56K 36K 1.67G 1.67G 2M
utoo 49.2K 20.3K 55K 13K 1.67G 1.67G 2M

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 906381d · mac (macos-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 21.95s 0.35s 6.90s 22.76s 777M 50.2K
utoo-npm 23.47s 0.59s 11.17s 27.31s 1.01G 102.2K
utoo 23.53s 2.57s 11.91s 28.32s 1.18G 106.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 16.3K 137.0K - - 1.82G 1.94G 1M
utoo-npm 13.4K 367.8K - - 1.63G 1.83G 2M
utoo 8.9K 345.5K - - 1.63G 1.87G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.55s 0.09s 3.18s 1.56s 495M 32.1K
utoo-npm 6.77s 0.06s 5.40s 4.02s 540M 36.7K
utoo 4.40s 0.12s 5.57s 3.15s 601M 40.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 37 18.9K - - 111M - 1M
utoo-npm 58 72.8K - - 28M 5M 2M
utoo 14 71.9K - - 28M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 22.59s 3.85s 4.09s 23.47s 538M 35.0K
utoo-npm 20.51s 2.26s 5.09s 24.19s 857M 79.5K
utoo 17.87s 0.74s 4.92s 23.92s 741M 72.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 6.0K 135.0K - - 1.70G 1.94G 1M
utoo-npm 1.9K 242.9K - - 1.61G 1.83G 2M
utoo 1.3K 231.5K - - 1.61G 1.83G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 8.95s 0.39s 0.17s 3.51s 51M 3.9K
utoo-npm 6.81s 0.40s 0.88s 4.95s 87M 6.5K
utoo 6.96s 0.50s 0.79s 5.06s 92M 6.9K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 17.6K 1.2K - - 1.86G 1.91G 1M
utoo-npm 12.6K 81.1K - - 1.61G 1.83G 2M
utoo 13.4K 81.5K - - 1.63G 1.83G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 127.93s 115.64s 7.35s 19.55s 558M 36.1K
utoo-npm 39.29s 8.02s 6.62s 18.01s 780M 77.1K
utoo 44.33s 7.65s 8.31s 23.56s 761M 77.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.8K 166.1K - - 1.79G 1.90G 2M
utoo-npm 4.3K 460.3K - - 1.61G 1.84G 2M
utoo 4.1K 438.6K - - 1.61G 1.87G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 5.90s 5.16s 2.52s 1.39s 545M 35.5K
utoo-npm 5.44s 0.26s 1.35s 0.72s 78M 5.7K
utoo 6.12s 1.47s 1.68s 0.92s 78M 5.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 28 27.2K - - 111M - 2M
utoo-npm 5 44.0K - - - 4M 2M
utoo 17 37.1K - - - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 23.94s 1.81s 4.16s 19.44s 309M 20.3K
utoo-npm 58.43s 30.68s 6.90s 24.48s 751M 76.9K
utoo 90.61s 39.32s 7.01s 23.66s 631M 74.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 1.9K 162.5K - - 1.65G 1.92G 2M
utoo-npm 1.7K 368.6K - - 1.61G 1.87G 2M
utoo 1.7K 404.1K - - 1.61G 1.87G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 4.05s 0.89s 0.08s 1.82s 44M 3.4K
utoo-npm 3.40s 0.57s 0.49s 2.70s 97M 7.1K
utoo 3.71s 0.15s 0.45s 2.77s 97M 7.3K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 13.2K 955 - - 1.78G 1.90G 2M
utoo-npm 12.1K 72.2K - - 1.61G 1.82G 2M
utoo 12.2K 71.4K - - 1.61G 1.82G 2M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant