perf(pm): preload worker-pool replaces FuturesUnordered by elrrrrrrr · Pull Request #2827 · utooland/utoo

elrrrrrrr · 2026-04-25T14:57:29Z

Summary

Fourth and final perf PR from #2818. The headline architectural change. Stacks on #2826 (perf/manifest-cache) — the Send/Sync trait surface and MaybeSend/MaybeSync shims live there.

p1_resolve preload wall: 3.10s → 2.15s (−31%); avg_conc 55 → 84 on CI. ruborist now matches the standalone manifest-bench HTTP stack ceiling.

Why

Old FuturesUnordered-on-main-task design saturated under the deeper await chain inside resolve_package (cache check + OnceMap::get_or_init + RetryIf + send + bytes + parse). avg_conc plateaued at 55-61 even after every Mutex / allocator hot-path was eliminated, while standalone manifest-bench (#2824) sustained 92 on the same reqwest stack.

How

N long-lived tokio::spawn workers pulling from Arc<SegQueue<Dep>> with DashSet dedup. Termination via dispatched / completed: AtomicUsize + Notify + mpsc close. Main task only drains completions for receiver events + on_manifest callback — no more cooperative-poll bottleneck.

Trait surface change

MockRegistryClient derives Clone
preload_manifests takes registry: Arc<R>; R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static, R::Error: MaybeSend. Bound propagated up the public API chain.

Companion changes folded in (all motivated by the worker-pool shift)

Inline simd_json parse — drop spawn_blocking (cap=4 blocking pool was queue-saturated under worker-pool concurrency)
Workspace package.json parallel reads — ~150ms saved
Setup + lockfile-write timing logs — completes per-phase wall account for the auto-comment in ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824
Manifests concurrency cap 64 → 128 — worker-pool unblocks the cap raise

Wasm CI

cfg-gates tokio::spawn → wasm_bindgen_futures::spawn_local on wasm32 (JsFuture is !Send). Single-threaded but the queue + Notify + mpsc termination is unchanged. cargo check -p utoo-wasm --target wasm32-unknown-unknown clean.

Stack order

ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824 (bench-infra) — independent base
perf(pm): chunked parallel writes within tarball extraction + Tar Slip guard #2825 (install path) — independent base
perf(pm): manifest cache & resolver alloc cleanup #2826 (manifest cache cleanup) — base for this PR
This PR — targets perf/manifest-cache; will rebase onto next once perf(pm): manifest cache & resolver alloc cleanup #2826 lands

Test plan

cargo fmt + cargo clippy --all-targets -- -D warnings --no-deps clean
cargo test -p utoo-ruborist 164 + 10 doctests pass
cargo test -p utoo-pm 248/249 pass (1 pre-existing flake)
cargo check -p utoo-wasm --target wasm32-unknown-unknown clean
CI bench-phases auto-comment with p1 wall delta vs next baseline (depends on ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824 landed; will trigger on bench-phases label)

Context

Full journey + failed-experiments catalog: #2818

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces a worker-pool architecture for manifest preloading to improve parallelism and performance, replacing the previous FuturesUnordered approach. It also parallelizes workspace discovery and adds detailed diagnostic logging and timing instrumentation across various phases of the resolution process. Feedback focuses on aligning concurrency defaults to avoid throttling, restoring Wasm compatibility by reverting explicit Send/Sync trait bounds to the previous MaybeSend/MaybeSync shims, and ensuring proper worker shutdown on channel failures to prevent task leaks.

gemini-code-assist · 2026-04-25T15:03:12Z

+///
+/// Preload now runs N long-lived worker tasks; this is N. Each worker
+/// processes one resolve_package at a time on tokio's global pool.
+pub const DEFAULT_CONCURRENCY: usize = 256;


The DEFAULT_CONCURRENCY has been increased to 256, which appears to be a discrepancy with the PR description and the pm crate's configuration where 128 is identified as the "sweet spot" to avoid Cloudflare throttling. Using 256 as a library default might lead to unexpected throttling for other consumers of the library. Consider aligning this default with the 128 limit used in user_config.rs.

Suggested change

pub const DEFAULT_CONCURRENCY: usize = 256;

pub const DEFAULT_CONCURRENCY: usize = 128;

gemini-code-assist · 2026-04-25T15:03:13Z

+    R: RegistryClient + Send + Sync + 'static,
+    R::Error: Send,


The trait bounds for R and R::Error have been changed from the MaybeSend/MaybeSync shims to explicit Send and Sync. This is a regression for Wasm support, as it prevents using RegistryClient implementations that are not Send or Sync (e.g., those using Rc or RefCell), which is common in single-threaded Wasm environments. Since wasm_bindgen_futures::spawn_local is used on Wasm, these strict bounds are not necessary there.

Suggested change

R: RegistryClient + Send + Sync + 'static,

R::Error: Send,

R: RegistryClient + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static,

R::Error: crate::util::maybe_send::MaybeSend,

gemini-code-assist · 2026-04-25T15:03:13Z

+                    if result_tx.send((name, result, elapsed_ms)).is_err() {
+                        // Main task dropped the receiver — done collecting.
+                        break;
+                    }


If result_tx.send fails, it indicates that the main task has dropped the receiver (e.g., due to an error or early return). In this case, the worker should signal a shutdown to ensure that other workers waiting on notify.notified() can also exit, preventing a potential leak of worker tasks that would otherwise hang until the process terminates.

Suggested change

if result_tx.send((name, result, elapsed_ms)).is_err() {

// Main task dropped the receiver — done collecting.

break;

}

if result_tx.send((name, result, elapsed_ms)).is_err() {

// Main task dropped the receiver — signal shutdown to other workers.

shutdown.store(true, Ordering::Release);

notify.notify_waiters();

break;

}

gemini-code-assist · 2026-04-25T15:03:13Z

+pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>(
    graph: &mut DependencyGraph,
    registry: &R,
    peer_deps: PeerDeps,
-) -> Result<(), ResolveError<R::Error>> {
+) -> Result<(), ResolveError<R::Error>>
+where
+    R::Error: Send,
+{


The trait bounds for R and R::Error have been changed to explicit Send and Sync, which breaks compatibility with non-thread-safe registry clients on Wasm. Reverting to the MaybeSend and MaybeSync shims (which were used previously in this file) maintains the intended cross-platform flexibility while still satisfying tokio::spawn requirements on native targets.

Suggested change

pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>(

graph: &mut DependencyGraph,

registry: &R,

peer_deps: PeerDeps,

) -> Result<(), ResolveError<R::Error>> {

) -> Result<(), ResolveError<R::Error>>

where

R::Error: Send,

{

pub async fn build_deps<R: RegistryClient + Clone + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static>(

graph: &mut DependencyGraph,

registry: &R,

peer_deps: PeerDeps,

) -> Result<(), ResolveError<R::Error>>

where

R::Error: crate::util::maybe_send::MaybeSend,

{

gemini-code-assist · 2026-04-25T15:03:13Z

 /// println!("Resolved to {}@{}", resolved.name, resolved.version);
 /// ```
-pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>(
+pub async fn resolve_package<R: RegistryClient + Sync>(


Replacing crate::util::maybe_send::MaybeSync with Sync here restricts the registry client to be Sync even on Wasm, where it might not be necessary or possible for certain implementations. Using the shim preserves the intended cross-platform abstraction.

Suggested change

pub async fn resolve_package<R: RegistryClient + Sync>(

pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>(

The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-27T04:13:23Z

📊 pm-bench-phases · `1828361` · linux (`ubuntu-latest`)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-next (next-branch baseline) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	9.13s	0.08s	10.08s	10.12s	739M	344.4K
utoo-next	10.14s	0.20s	11.46s	13.32s	1.29G	156.2K
utoo-npm	9.94s	0.41s	11.48s	13.22s	1.29G	164.3K
utoo	27.01s	32.11s	11.02s	12.92s	2.45G	280.5K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	14.2K	16.8K	1.17G	6M	1.86G	1.75G	1M
utoo-next	171.4K	161.3K	1.15G	4M	1.70G	1.69G	2M
utoo-npm	170.4K	160.2K	1.15G	4M	1.70G	1.69G	2M
utoo	125.7K	70.8K	1.15G	5M	1.70G	1.69G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	1.87s	0.02s	3.91s	1.00s	484M	161.7K
utoo-next	5.03s	0.06s	5.87s	1.09s	432M	69.7K
utoo-npm	4.99s	0.09s	5.76s	1.10s	430M	74.0K
utoo	2.55s	0.19s	5.39s	2.01s	1.48G	190.5K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	7.9K	4.3K	200M	3M	104M	-	1M
utoo-next	64.7K	2.7K	203M	2M	9M	5M	2M
utoo-npm	64.4K	3.0K	204M	2M	9M	5M	2M
utoo	23.7K	17.6K	196M	3M	7M	5M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	7.11s	1.25s	6.04s	9.83s	635M	203.8K
utoo-next	7.26s	1.53s	5.35s	11.30s	954M	112.5K
utoo-npm	7.37s	0.85s	5.48s	11.41s	1018M	127.8K
utoo	6.59s	0.44s	5.42s	11.24s	937M	134.1K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	4.0K	6.4K	1003M	3M	1.75G	1.75G	1M
utoo-next	105.7K	65.3K	974M	2M	1.69G	1.69G	2M
utoo-npm	115.0K	77.5K	974M	3M	1.69G	1.69G	2M
utoo	106.9K	70.1K	975M	3M	1.69G	1.69G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.28s	0.13s	0.20s	2.33s	137M	33.8K
utoo-next	2.72s	0.56s	0.56s	3.86s	81M	19.3K
utoo-npm	2.34s	0.16s	0.57s	3.82s	81M	19.2K
utoo	2.29s	0.02s	0.49s	3.77s	82M	18.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	334	22	7M	27K	1.90G	1.71G	1M
utoo-next	47.7K	19.5K	37K	13K	1.69G	1.69G	2M
utoo-npm	47.0K	20.5K	38K	17K	1.69G	1.69G	2M
utoo	47.5K	19.7K	36K	10K	1.70G	1.69G	2M

npmmirror.com

p0_full_cold

PM	wall	±σ	user	sys	RSS	pgMinor
bun	31.58s	4.75s	9.76s	10.79s	517M	357.3K
utoo-next	37.01s	4.11s	8.22s	14.59s	722M	118.1K
utoo-npm	37.74s	4.24s	8.33s	14.68s	671M	139.2K
utoo	27.76s	3.49s	7.54s	13.28s	789M	127.6K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	149.8K	3.6K	1.13G	17M	1.85G	1.74G	2M
utoo-next	261.7K	89.0K	1006M	11M	1.69G	1.69G	2M
utoo-npm	268.0K	96.5K	1023M	11M	1.69G	1.69G	2M
utoo	206.6K	79.4K	1.01G	10M	1.69G	1.69G	2M

p1_resolve

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.73s	0.04s	3.62s	1.28s	473M	219.7K
utoo-next	6.66s	0.19s	2.22s	0.63s	75M	16.1K
utoo-npm	6.46s	0.04s	2.14s	0.56s	75M	16.0K
utoo	3.55s	0.84s	1.21s	0.38s	84M	17.8K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	24.6K	2.2K	151M	4M	106M	-	2M
utoo-next	48.6K	547	14M	2M	-	4M	2M
utoo-npm	48.4K	524	14M	2M	-	4M	2M
utoo	24.7K	149	16M	3M	-	4M	2M

p3_cold_install

PM	wall	±σ	user	sys	RSS	pgMinor
bun	22.42s	0.21s	6.18s	10.01s	229M	82.5K
utoo-next	42.61s	0.46s	6.05s	13.50s	626M	118.3K
utoo-npm	47.34s	7.52s	6.06s	13.28s	630M	111.2K
utoo	37.18s	1.95s	5.93s	12.44s	590M	116.2K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	112.5K	3.3K	997M	12M	1.71G	1.71G	2M
utoo-next	216.6K	81.0K	978M	10M	1.69G	1.69G	2M
utoo-npm	203.0K	81.2K	977M	9M	1.69G	1.69G	2M
utoo	168.8K	72.6K	992M	8M	1.69G	1.69G	2M

p4_warm_link

PM	wall	±σ	user	sys	RSS	pgMinor
bun	3.50s	0.40s	0.18s	2.31s	135M	30.9K
utoo-next	2.37s	0.07s	0.60s	3.86s	82M	18.7K
utoo-npm	2.35s	0.08s	0.57s	3.85s	82M	19.0K
utoo	2.14s	0.08s	0.49s	3.81s	83M	19.1K

PM	vCtx	iCtx	netRX	netTX	cache	node_mod	lock
bun	876	23	3M	66K	1.84G	1.74G	2M
utoo-next	48.1K	20.4K	52K	8K	1.69G	1.69G	2M
utoo-npm	47.7K	21.4K	55K	22K	1.69G	1.69G	2M
utoo	49.1K	20.1K	50K	13K	1.69G	1.69G	2M

Probe whether the concurrency cap (not the worker-pool architecture) is the actual perf driver behind PR #2827's claimed -30% preload wall. Spawn_local worker-pool alone showed no measurable change vs FuturesUnordered baseline on p1_resolve (utoo ≈ utoo-next), so isolate the cap bump on top of our existing minimal architecture change. If bench shows p1_resolve gain at cap=128 → cap raise is the driver; we can keep this and consider whether worker- pool itself adds anything. If still no gain → cap isn't the driver either, and PR #2827's number probably came from inline simd_json parse / workspace parallel reads / manifest cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Apr 25, 2026

View reviewed changes

elrrrrrrr marked this pull request as draft April 25, 2026 15:22

elrrrrrrr force-pushed the perf/preload-worker-pool branch from 0c25e43 to ce129d1 Compare April 25, 2026 15:24

elrrrrrrr force-pushed the perf/manifest-cache branch from 3be6b63 to 2831262 Compare April 25, 2026 15:30

elrrrrrrr force-pushed the perf/preload-worker-pool branch from ce129d1 to d141a6d Compare April 25, 2026 15:34

elrrrrrrr changed the title ~~perf(ruborist): preload worker-pool replaces FuturesUnordered~~ perf(pm): preload worker-pool replaces FuturesUnordered Apr 25, 2026

elrrrrrrr mentioned this pull request Apr 26, 2026

ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824

Merged

5 tasks

elrrrrrrr added the benchmark Run pm-bench on PR label Apr 27, 2026

elrrrrrrr force-pushed the perf/manifest-cache branch from 2831262 to 2d5befb Compare April 27, 2026 03:31

elrrrrrrr force-pushed the perf/preload-worker-pool branch from d141a6d to 9990542 Compare April 27, 2026 03:34

elrrrrrrr mentioned this pull request Apr 27, 2026

perf(pm): preload worker pool replaces FuturesUnordered #2830

Closed

7 tasks

This was referenced Apr 27, 2026

perf(pm): parallelize workspace package.json reads #2831

Closed

perf(pm): preload multi-thread worker pool (tokio::spawn) #2832

Closed

perf(pm): minimum-viable preload bundle (worker-pool + OnceMap + aws-lc-rs) #2838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pm): preload worker-pool replaces FuturesUnordered#2827

perf(pm): preload worker-pool replaces FuturesUnordered#2827
elrrrrrrr wants to merge 1 commit intoperf/manifest-cachefrom
perf/preload-worker-pool

elrrrrrrr commented Apr 25, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	pub const DEFAULT_CONCURRENCY: usize = 256;
	pub const DEFAULT_CONCURRENCY: usize = 128;

	pub async fn resolve_package<R: RegistryClient + Sync>(
	pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>(

Conversation

elrrrrrrr commented Apr 25, 2026

Summary

Why

How

Trait surface change

Companion changes folded in (all motivated by the worker-pool shift)

Wasm CI

Stack order

Test plan

Context

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 27, 2026

📊 pm-bench-phases · 1828361 · linux (ubuntu-latest)

npmjs.org

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

npmmirror.com

p0_full_cold

p1_resolve

p3_cold_install

p4_warm_link

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

📊 pm-bench-phases · `1828361` · linux (`ubuntu-latest`)