Skip to content

perf(pm): preload worker-pool replaces FuturesUnordered#2827

Draft
elrrrrrrr wants to merge 1 commit intoperf/manifest-cachefrom
perf/preload-worker-pool
Draft

perf(pm): preload worker-pool replaces FuturesUnordered#2827
elrrrrrrr wants to merge 1 commit intoperf/manifest-cachefrom
perf/preload-worker-pool

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

Summary

Fourth and final perf PR from #2818. The headline architectural change. Stacks on #2826 (perf/manifest-cache) — the Send/Sync trait surface and MaybeSend/MaybeSync shims live there.

p1_resolve preload wall: 3.10s → 2.15s (−31%); avg_conc 55 → 84 on CI. ruborist now matches the standalone manifest-bench HTTP stack ceiling.

Why

Old FuturesUnordered-on-main-task design saturated under the deeper await chain inside resolve_package (cache check + OnceMap::get_or_init + RetryIf + send + bytes + parse). avg_conc plateaued at 55-61 even after every Mutex / allocator hot-path was eliminated, while standalone manifest-bench (#2824) sustained 92 on the same reqwest stack.

How

N long-lived tokio::spawn workers pulling from Arc<SegQueue<Dep>> with DashSet dedup. Termination via dispatched / completed: AtomicUsize + Notify + mpsc close. Main task only drains completions for receiver events + on_manifest callback — no more cooperative-poll bottleneck.

Trait surface change

  • MockRegistryClient derives Clone
  • preload_manifests takes registry: Arc<R>; R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static, R::Error: MaybeSend. Bound propagated up the public API chain.

Companion changes folded in (all motivated by the worker-pool shift)

  • Inline simd_json parse — drop spawn_blocking (cap=4 blocking pool was queue-saturated under worker-pool concurrency)
  • Workspace package.json parallel reads — ~150ms saved
  • Setup + lockfile-write timing logs — completes per-phase wall account for the auto-comment in ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824
  • Manifests concurrency cap 64 → 128 — worker-pool unblocks the cap raise

Wasm CI

cfg-gates tokio::spawnwasm_bindgen_futures::spawn_local on wasm32 (JsFuture is !Send). Single-threaded but the queue + Notify + mpsc termination is unchanged. cargo check -p utoo-wasm --target wasm32-unknown-unknown clean.

Stack order

  1. ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824 (bench-infra) — independent base
  2. perf(pm): chunked parallel writes within tarball extraction + Tar Slip guard #2825 (install path) — independent base
  3. perf(pm): manifest cache & resolver alloc cleanup #2826 (manifest cache cleanup) — base for this PR
  4. This PR — targets perf/manifest-cache; will rebase onto next once perf(pm): manifest cache & resolver alloc cleanup #2826 lands

Test plan

  • cargo fmt + cargo clippy --all-targets -- -D warnings --no-deps clean
  • cargo test -p utoo-ruborist 164 + 10 doctests pass
  • cargo test -p utoo-pm 248/249 pass (1 pre-existing flake)
  • cargo check -p utoo-wasm --target wasm32-unknown-unknown clean
  • CI bench-phases auto-comment with p1 wall delta vs next baseline (depends on ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824 landed; will trigger on bench-phases label)

Context

Full journey + failed-experiments catalog: #2818

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a worker-pool architecture for manifest preloading to improve parallelism and performance, replacing the previous FuturesUnordered approach. It also parallelizes workspace discovery and adds detailed diagnostic logging and timing instrumentation across various phases of the resolution process. Feedback focuses on aligning concurrency defaults to avoid throttling, restoring Wasm compatibility by reverting explicit Send/Sync trait bounds to the previous MaybeSend/MaybeSync shims, and ensuring proper worker shutdown on channel failures to prevent task leaks.

///
/// Preload now runs N long-lived worker tasks; this is N. Each worker
/// processes one resolve_package at a time on tokio's global pool.
pub const DEFAULT_CONCURRENCY: usize = 256;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The DEFAULT_CONCURRENCY has been increased to 256, which appears to be a discrepancy with the PR description and the pm crate's configuration where 128 is identified as the "sweet spot" to avoid Cloudflare throttling. Using 256 as a library default might lead to unexpected throttling for other consumers of the library. Consider aligning this default with the 128 limit used in user_config.rs.

Suggested change
pub const DEFAULT_CONCURRENCY: usize = 256;
pub const DEFAULT_CONCURRENCY: usize = 128;

Comment thread crates/ruborist/src/resolver/preload.rs Outdated
Comment on lines +117 to +118
R: RegistryClient + Send + Sync + 'static,
R::Error: Send,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The trait bounds for R and R::Error have been changed from the MaybeSend/MaybeSync shims to explicit Send and Sync. This is a regression for Wasm support, as it prevents using RegistryClient implementations that are not Send or Sync (e.g., those using Rc or RefCell), which is common in single-threaded Wasm environments. Since wasm_bindgen_futures::spawn_local is used on Wasm, these strict bounds are not necessary there.

Suggested change
R: RegistryClient + Send + Sync + 'static,
R::Error: Send,
R: RegistryClient + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static,
R::Error: crate::util::maybe_send::MaybeSend,

Comment on lines +202 to +205
if result_tx.send((name, result, elapsed_ms)).is_err() {
// Main task dropped the receiver — done collecting.
break;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If result_tx.send fails, it indicates that the main task has dropped the receiver (e.g., due to an error or early return). In this case, the worker should signal a shutdown to ensure that other workers waiting on notify.notified() can also exit, preventing a potential leak of worker tasks that would otherwise hang until the process terminates.

Suggested change
if result_tx.send((name, result, elapsed_ms)).is_err() {
// Main task dropped the receiver — done collecting.
break;
}
if result_tx.send((name, result, elapsed_ms)).is_err() {
// Main task dropped the receiver — signal shutdown to other workers.
shutdown.store(true, Ordering::Release);
notify.notify_waiters();
break;
}

Comment thread crates/ruborist/src/resolver/builder.rs Outdated
Comment on lines +710 to +717
pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>(
graph: &mut DependencyGraph,
registry: &R,
peer_deps: PeerDeps,
) -> Result<(), ResolveError<R::Error>> {
) -> Result<(), ResolveError<R::Error>>
where
R::Error: Send,
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The trait bounds for R and R::Error have been changed to explicit Send and Sync, which breaks compatibility with non-thread-safe registry clients on Wasm. Reverting to the MaybeSend and MaybeSync shims (which were used previously in this file) maintains the intended cross-platform flexibility while still satisfying tokio::spawn requirements on native targets.

Suggested change
pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>(
graph: &mut DependencyGraph,
registry: &R,
peer_deps: PeerDeps,
) -> Result<(), ResolveError<R::Error>> {
) -> Result<(), ResolveError<R::Error>>
where
R::Error: Send,
{
pub async fn build_deps<R: RegistryClient + Clone + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static>(
graph: &mut DependencyGraph,
registry: &R,
peer_deps: PeerDeps,
) -> Result<(), ResolveError<R::Error>>
where
R::Error: crate::util::maybe_send::MaybeSend,
{

/// println!("Resolved to {}@{}", resolved.name, resolved.version);
/// ```
pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>(
pub async fn resolve_package<R: RegistryClient + Sync>(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Replacing crate::util::maybe_send::MaybeSync with Sync here restricts the registry client to be Sync even on Wasm, where it might not be necessary or possible for certain implementations. Using the shim preserves the intended cross-platform abstraction.

Suggested change
pub async fn resolve_package<R: RegistryClient + Sync>(
pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>(

@elrrrrrrr elrrrrrrr marked this pull request as draft April 25, 2026 15:22
@elrrrrrrr elrrrrrrr force-pushed the perf/preload-worker-pool branch from 0c25e43 to ce129d1 Compare April 25, 2026 15:24
@elrrrrrrr elrrrrrrr force-pushed the perf/manifest-cache branch from 3be6b63 to 2831262 Compare April 25, 2026 15:30
@elrrrrrrr elrrrrrrr force-pushed the perf/preload-worker-pool branch from ce129d1 to d141a6d Compare April 25, 2026 15:34
@elrrrrrrr elrrrrrrr changed the title perf(ruborist): preload worker-pool replaces FuturesUnordered perf(pm): preload worker-pool replaces FuturesUnordered Apr 25, 2026
@elrrrrrrr elrrrrrrr added the benchmark Run pm-bench on PR label Apr 27, 2026
@elrrrrrrr elrrrrrrr force-pushed the perf/manifest-cache branch from 2831262 to 2d5befb Compare April 27, 2026 03:31
The headline architectural change of #2818 — preload phase shifts
from a single-task `FuturesUnordered` cooperative poller to N
long-lived `tokio::spawn` workers (or
`wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't
satisfied). Stacks on top of #2826.

## Why

Old design: main task owned `FuturesUnordered`, polled all preload
futures cooperatively, and ran every per-future continuation
(post-await body, completion handler, dispatch refill) on the same
single task. The deeper await chain inside `resolve_package`
(cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send`
+ `bytes` + parse spawn_blocking) made each future yield 5+ times,
and every yield round-tripped through main — saturating it. CI
ant-design preload sustained avg_conc=55-61 even after Mutex /
allocator hot-path eliminations, while the standalone
manifest-bench (#2824) hit 92 on the same reqwest stack.

## How

N long-lived `tokio::spawn` workers pulling from a shared
lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns
an `Arc<R>` clone and runs `resolve_package` on tokio's global
executor — futures progress fully independently, no cooperative
poll bottleneck. Main task only drains an `mpsc::unbounded_channel`
of completions to fire receiver events + on_manifest callback.

Termination: workers track `dispatched` / `completed: AtomicUsize`
and park on a shared `Notify` when the queue is empty. When the
last completion makes `completed == dispatched` and the queue is
empty, the finishing worker raises a `shutdown` flag and wakes
others; all workers drop their result_tx clones, the channel
closes, and the main `recv().await` loop exits.

## Trait surface change

- `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can
  wrap the mock in `Arc` for the new signature
- `preload_manifests` takes `registry: Arc<R>` (was `&R`); call
  site in `run_preload_phase` clones the borrowed registry into a
  fresh `Arc`. Bound at every public surface up the chain bumped
  to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`,
  `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims
  (added in #2826) keep the trait surface wasm-compatible.

## Companion changes folded in

- **Inline simd_json parse** — drop `tokio::task::spawn_blocking`
  in `service/manifest.rs`. Worker-pool surfaced parse blocking-
  pool queue saturation: `queue p95=200ms sum=70-89s` over 2730
  manifests on cap=4 CI runners. Inline parse on the worker thread
  eliminates dispatch + queue overhead.
- **Workspace package.json parallel reads** — switch the per-pattern
  `for path in matched_paths` serial loop to `FuturesUnordered`
  fan-out. ant-design has ~200 workspace packages; saved ~150ms.
- **Setup phase + lockfile-write timing logs** — round out the
  per-phase wall account for the bench-comment infrastructure.
- **Manifests concurrency cap 64 → 128** — worker-pool delivers
  the parallelism that justifies the cap raise. CI ant-design
  avg_conc 84 at cap=128 (up from 55 under the old architecture);
  preload wall 3.10s → 2.15s.

## Wasm CI

cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local`
on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers
still run independently — single-threaded under wasm but the
queue + Notify + mpsc termination story is unchanged.

`cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean.

Tests: 164 ruborist + 10 doctests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@elrrrrrrr elrrrrrrr force-pushed the perf/preload-worker-pool branch from d141a6d to 9990542 Compare April 27, 2026 03:34
@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · 1828361 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-next (next-branch baseline) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.13s 0.08s 10.08s 10.12s 739M 344.4K
utoo-next 10.14s 0.20s 11.46s 13.32s 1.29G 156.2K
utoo-npm 9.94s 0.41s 11.48s 13.22s 1.29G 164.3K
utoo 27.01s 32.11s 11.02s 12.92s 2.45G 280.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 14.2K 16.8K 1.17G 6M 1.86G 1.75G 1M
utoo-next 171.4K 161.3K 1.15G 4M 1.70G 1.69G 2M
utoo-npm 170.4K 160.2K 1.15G 4M 1.70G 1.69G 2M
utoo 125.7K 70.8K 1.15G 5M 1.70G 1.69G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 1.87s 0.02s 3.91s 1.00s 484M 161.7K
utoo-next 5.03s 0.06s 5.87s 1.09s 432M 69.7K
utoo-npm 4.99s 0.09s 5.76s 1.10s 430M 74.0K
utoo 2.55s 0.19s 5.39s 2.01s 1.48G 190.5K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 7.9K 4.3K 200M 3M 104M - 1M
utoo-next 64.7K 2.7K 203M 2M 9M 5M 2M
utoo-npm 64.4K 3.0K 204M 2M 9M 5M 2M
utoo 23.7K 17.6K 196M 3M 7M 5M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 7.11s 1.25s 6.04s 9.83s 635M 203.8K
utoo-next 7.26s 1.53s 5.35s 11.30s 954M 112.5K
utoo-npm 7.37s 0.85s 5.48s 11.41s 1018M 127.8K
utoo 6.59s 0.44s 5.42s 11.24s 937M 134.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 4.0K 6.4K 1003M 3M 1.75G 1.75G 1M
utoo-next 105.7K 65.3K 974M 2M 1.69G 1.69G 2M
utoo-npm 115.0K 77.5K 974M 3M 1.69G 1.69G 2M
utoo 106.9K 70.1K 975M 3M 1.69G 1.69G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.28s 0.13s 0.20s 2.33s 137M 33.8K
utoo-next 2.72s 0.56s 0.56s 3.86s 81M 19.3K
utoo-npm 2.34s 0.16s 0.57s 3.82s 81M 19.2K
utoo 2.29s 0.02s 0.49s 3.77s 82M 18.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 334 22 7M 27K 1.90G 1.71G 1M
utoo-next 47.7K 19.5K 37K 13K 1.69G 1.69G 2M
utoo-npm 47.0K 20.5K 38K 17K 1.69G 1.69G 2M
utoo 47.5K 19.7K 36K 10K 1.70G 1.69G 2M

npmmirror.com

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 31.58s 4.75s 9.76s 10.79s 517M 357.3K
utoo-next 37.01s 4.11s 8.22s 14.59s 722M 118.1K
utoo-npm 37.74s 4.24s 8.33s 14.68s 671M 139.2K
utoo 27.76s 3.49s 7.54s 13.28s 789M 127.6K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 149.8K 3.6K 1.13G 17M 1.85G 1.74G 2M
utoo-next 261.7K 89.0K 1006M 11M 1.69G 1.69G 2M
utoo-npm 268.0K 96.5K 1023M 11M 1.69G 1.69G 2M
utoo 206.6K 79.4K 1.01G 10M 1.69G 1.69G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 3.73s 0.04s 3.62s 1.28s 473M 219.7K
utoo-next 6.66s 0.19s 2.22s 0.63s 75M 16.1K
utoo-npm 6.46s 0.04s 2.14s 0.56s 75M 16.0K
utoo 3.55s 0.84s 1.21s 0.38s 84M 17.8K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 24.6K 2.2K 151M 4M 106M - 2M
utoo-next 48.6K 547 14M 2M - 4M 2M
utoo-npm 48.4K 524 14M 2M - 4M 2M
utoo 24.7K 149 16M 3M - 4M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 22.42s 0.21s 6.18s 10.01s 229M 82.5K
utoo-next 42.61s 0.46s 6.05s 13.50s 626M 118.3K
utoo-npm 47.34s 7.52s 6.06s 13.28s 630M 111.2K
utoo 37.18s 1.95s 5.93s 12.44s 590M 116.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 112.5K 3.3K 997M 12M 1.71G 1.71G 2M
utoo-next 216.6K 81.0K 978M 10M 1.69G 1.69G 2M
utoo-npm 203.0K 81.2K 977M 9M 1.69G 1.69G 2M
utoo 168.8K 72.6K 992M 8M 1.69G 1.69G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.50s 0.40s 0.18s 2.31s 135M 30.9K
utoo-next 2.37s 0.07s 0.60s 3.86s 82M 18.7K
utoo-npm 2.35s 0.08s 0.57s 3.85s 82M 19.0K
utoo 2.14s 0.08s 0.49s 3.81s 83M 19.1K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 876 23 3M 66K 1.84G 1.74G 2M
utoo-next 48.1K 20.4K 52K 8K 1.69G 1.69G 2M
utoo-npm 47.7K 21.4K 55K 22K 1.69G 1.69G 2M
utoo 49.1K 20.1K 50K 13K 1.69G 1.69G 2M

elrrrrrrr added a commit that referenced this pull request Apr 27, 2026
Probe whether the concurrency cap (not the worker-pool
architecture) is the actual perf driver behind PR #2827's
claimed -30% preload wall. Spawn_local worker-pool alone
showed no measurable change vs FuturesUnordered baseline
on p1_resolve (utoo ≈ utoo-next), so isolate the cap bump
on top of our existing minimal architecture change.

If bench shows p1_resolve gain at cap=128 → cap raise is
the driver; we can keep this and consider whether worker-
pool itself adds anything. If still no gain → cap isn't
the driver either, and PR #2827's number probably came
from inline simd_json parse / workspace parallel reads /
manifest cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant