perf(pm): preload worker-pool replaces FuturesUnordered#2827
perf(pm): preload worker-pool replaces FuturesUnordered#2827elrrrrrrr wants to merge 1 commit intoperf/manifest-cachefrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a worker-pool architecture for manifest preloading to improve parallelism and performance, replacing the previous FuturesUnordered approach. It also parallelizes workspace discovery and adds detailed diagnostic logging and timing instrumentation across various phases of the resolution process. Feedback focuses on aligning concurrency defaults to avoid throttling, restoring Wasm compatibility by reverting explicit Send/Sync trait bounds to the previous MaybeSend/MaybeSync shims, and ensuring proper worker shutdown on channel failures to prevent task leaks.
| /// | ||
| /// Preload now runs N long-lived worker tasks; this is N. Each worker | ||
| /// processes one resolve_package at a time on tokio's global pool. | ||
| pub const DEFAULT_CONCURRENCY: usize = 256; |
There was a problem hiding this comment.
The DEFAULT_CONCURRENCY has been increased to 256, which appears to be a discrepancy with the PR description and the pm crate's configuration where 128 is identified as the "sweet spot" to avoid Cloudflare throttling. Using 256 as a library default might lead to unexpected throttling for other consumers of the library. Consider aligning this default with the 128 limit used in user_config.rs.
| pub const DEFAULT_CONCURRENCY: usize = 256; | |
| pub const DEFAULT_CONCURRENCY: usize = 128; |
| R: RegistryClient + Send + Sync + 'static, | ||
| R::Error: Send, |
There was a problem hiding this comment.
The trait bounds for R and R::Error have been changed from the MaybeSend/MaybeSync shims to explicit Send and Sync. This is a regression for Wasm support, as it prevents using RegistryClient implementations that are not Send or Sync (e.g., those using Rc or RefCell), which is common in single-threaded Wasm environments. Since wasm_bindgen_futures::spawn_local is used on Wasm, these strict bounds are not necessary there.
| R: RegistryClient + Send + Sync + 'static, | |
| R::Error: Send, | |
| R: RegistryClient + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static, | |
| R::Error: crate::util::maybe_send::MaybeSend, |
| if result_tx.send((name, result, elapsed_ms)).is_err() { | ||
| // Main task dropped the receiver — done collecting. | ||
| break; | ||
| } |
There was a problem hiding this comment.
If result_tx.send fails, it indicates that the main task has dropped the receiver (e.g., due to an error or early return). In this case, the worker should signal a shutdown to ensure that other workers waiting on notify.notified() can also exit, preventing a potential leak of worker tasks that would otherwise hang until the process terminates.
| if result_tx.send((name, result, elapsed_ms)).is_err() { | |
| // Main task dropped the receiver — done collecting. | |
| break; | |
| } | |
| if result_tx.send((name, result, elapsed_ms)).is_err() { | |
| // Main task dropped the receiver — signal shutdown to other workers. | |
| shutdown.store(true, Ordering::Release); | |
| notify.notify_waiters(); | |
| break; | |
| } |
| pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>( | ||
| graph: &mut DependencyGraph, | ||
| registry: &R, | ||
| peer_deps: PeerDeps, | ||
| ) -> Result<(), ResolveError<R::Error>> { | ||
| ) -> Result<(), ResolveError<R::Error>> | ||
| where | ||
| R::Error: Send, | ||
| { |
There was a problem hiding this comment.
The trait bounds for R and R::Error have been changed to explicit Send and Sync, which breaks compatibility with non-thread-safe registry clients on Wasm. Reverting to the MaybeSend and MaybeSync shims (which were used previously in this file) maintains the intended cross-platform flexibility while still satisfying tokio::spawn requirements on native targets.
| pub async fn build_deps<R: RegistryClient + Clone + Send + Sync + 'static>( | |
| graph: &mut DependencyGraph, | |
| registry: &R, | |
| peer_deps: PeerDeps, | |
| ) -> Result<(), ResolveError<R::Error>> { | |
| ) -> Result<(), ResolveError<R::Error>> | |
| where | |
| R::Error: Send, | |
| { | |
| pub async fn build_deps<R: RegistryClient + Clone + crate::util::maybe_send::MaybeSend + crate::util::maybe_send::MaybeSync + 'static>( | |
| graph: &mut DependencyGraph, | |
| registry: &R, | |
| peer_deps: PeerDeps, | |
| ) -> Result<(), ResolveError<R::Error>> | |
| where | |
| R::Error: crate::util::maybe_send::MaybeSend, | |
| { |
| /// println!("Resolved to {}@{}", resolved.name, resolved.version); | ||
| /// ``` | ||
| pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>( | ||
| pub async fn resolve_package<R: RegistryClient + Sync>( |
There was a problem hiding this comment.
Replacing crate::util::maybe_send::MaybeSync with Sync here restricts the registry client to be Sync even on Wasm, where it might not be necessary or possible for certain implementations. Using the shim preserves the intended cross-platform abstraction.
| pub async fn resolve_package<R: RegistryClient + Sync>( | |
| pub async fn resolve_package<R: RegistryClient + crate::util::maybe_send::MaybeSync>( |
0c25e43 to
ce129d1
Compare
3be6b63 to
2831262
Compare
ce129d1 to
d141a6d
Compare
2831262 to
2d5befb
Compare
The headline architectural change of #2818 — preload phase shifts from a single-task `FuturesUnordered` cooperative poller to N long-lived `tokio::spawn` workers (or `wasm_bindgen_futures::spawn_local` on wasm32 where Send isn't satisfied). Stacks on top of #2826. ## Why Old design: main task owned `FuturesUnordered`, polled all preload futures cooperatively, and ran every per-future continuation (post-await body, completion handler, dispatch refill) on the same single task. The deeper await chain inside `resolve_package` (cache check + `OnceMap::get_or_init` + `RetryIf` + `request.send` + `bytes` + parse spawn_blocking) made each future yield 5+ times, and every yield round-tripped through main — saturating it. CI ant-design preload sustained avg_conc=55-61 even after Mutex / allocator hot-path eliminations, while the standalone manifest-bench (#2824) hit 92 on the same reqwest stack. ## How N long-lived `tokio::spawn` workers pulling from a shared lock-free `SegQueue<Dep>` with `DashSet` dedup. Each worker owns an `Arc<R>` clone and runs `resolve_package` on tokio's global executor — futures progress fully independently, no cooperative poll bottleneck. Main task only drains an `mpsc::unbounded_channel` of completions to fire receiver events + on_manifest callback. Termination: workers track `dispatched` / `completed: AtomicUsize` and park on a shared `Notify` when the queue is empty. When the last completion makes `completed == dispatched` and the queue is empty, the finishing worker raises a `shutdown` flag and wakes others; all workers drop their result_tx clones, the channel closes, and the main `recv().await` loop exits. ## Trait surface change - `MockRegistryClient` + `MockPackage` `derive(Clone)` so tests can wrap the mock in `Arc` for the new signature - `preload_manifests` takes `registry: Arc<R>` (was `&R`); call site in `run_preload_phase` clones the borrowed registry into a fresh `Arc`. Bound at every public surface up the chain bumped to `R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static`, `R::Error: MaybeSend`. The `MaybeSend` / `MaybeSync` shims (added in #2826) keep the trait surface wasm-compatible. ## Companion changes folded in - **Inline simd_json parse** — drop `tokio::task::spawn_blocking` in `service/manifest.rs`. Worker-pool surfaced parse blocking- pool queue saturation: `queue p95=200ms sum=70-89s` over 2730 manifests on cap=4 CI runners. Inline parse on the worker thread eliminates dispatch + queue overhead. - **Workspace package.json parallel reads** — switch the per-pattern `for path in matched_paths` serial loop to `FuturesUnordered` fan-out. ant-design has ~200 workspace packages; saved ~150ms. - **Setup phase + lockfile-write timing logs** — round out the per-phase wall account for the bench-comment infrastructure. - **Manifests concurrency cap 64 → 128** — worker-pool delivers the parallelism that justifies the cap raise. CI ant-design avg_conc 84 at cap=128 (up from 55 under the old architecture); preload wall 3.10s → 2.15s. ## Wasm CI cfg-gates `tokio::spawn` to `wasm_bindgen_futures::spawn_local` on wasm32 since wasm-bindgen's `JsFuture` is `!Send`. Workers still run independently — single-threaded under wasm but the queue + Notify + mpsc termination story is unchanged. `cargo check -p utoo-wasm --target wasm32-unknown-unknown` clean. Tests: 164 ruborist + 10 doctests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d141a6d to
9990542
Compare
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 9.13s | 0.08s | 10.08s | 10.12s | 739M | 344.4K |
| utoo-next | 10.14s | 0.20s | 11.46s | 13.32s | 1.29G | 156.2K |
| utoo-npm | 9.94s | 0.41s | 11.48s | 13.22s | 1.29G | 164.3K |
| utoo | 27.01s | 32.11s | 11.02s | 12.92s | 2.45G | 280.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.2K | 16.8K | 1.17G | 6M | 1.86G | 1.75G | 1M |
| utoo-next | 171.4K | 161.3K | 1.15G | 4M | 1.70G | 1.69G | 2M |
| utoo-npm | 170.4K | 160.2K | 1.15G | 4M | 1.70G | 1.69G | 2M |
| utoo | 125.7K | 70.8K | 1.15G | 5M | 1.70G | 1.69G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.87s | 0.02s | 3.91s | 1.00s | 484M | 161.7K |
| utoo-next | 5.03s | 0.06s | 5.87s | 1.09s | 432M | 69.7K |
| utoo-npm | 4.99s | 0.09s | 5.76s | 1.10s | 430M | 74.0K |
| utoo | 2.55s | 0.19s | 5.39s | 2.01s | 1.48G | 190.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 7.9K | 4.3K | 200M | 3M | 104M | - | 1M |
| utoo-next | 64.7K | 2.7K | 203M | 2M | 9M | 5M | 2M |
| utoo-npm | 64.4K | 3.0K | 204M | 2M | 9M | 5M | 2M |
| utoo | 23.7K | 17.6K | 196M | 3M | 7M | 5M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 7.11s | 1.25s | 6.04s | 9.83s | 635M | 203.8K |
| utoo-next | 7.26s | 1.53s | 5.35s | 11.30s | 954M | 112.5K |
| utoo-npm | 7.37s | 0.85s | 5.48s | 11.41s | 1018M | 127.8K |
| utoo | 6.59s | 0.44s | 5.42s | 11.24s | 937M | 134.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.0K | 6.4K | 1003M | 3M | 1.75G | 1.75G | 1M |
| utoo-next | 105.7K | 65.3K | 974M | 2M | 1.69G | 1.69G | 2M |
| utoo-npm | 115.0K | 77.5K | 974M | 3M | 1.69G | 1.69G | 2M |
| utoo | 106.9K | 70.1K | 975M | 3M | 1.69G | 1.69G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.28s | 0.13s | 0.20s | 2.33s | 137M | 33.8K |
| utoo-next | 2.72s | 0.56s | 0.56s | 3.86s | 81M | 19.3K |
| utoo-npm | 2.34s | 0.16s | 0.57s | 3.82s | 81M | 19.2K |
| utoo | 2.29s | 0.02s | 0.49s | 3.77s | 82M | 18.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 334 | 22 | 7M | 27K | 1.90G | 1.71G | 1M |
| utoo-next | 47.7K | 19.5K | 37K | 13K | 1.69G | 1.69G | 2M |
| utoo-npm | 47.0K | 20.5K | 38K | 17K | 1.69G | 1.69G | 2M |
| utoo | 47.5K | 19.7K | 36K | 10K | 1.70G | 1.69G | 2M |
npmmirror.com
p0_full_cold
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 31.58s | 4.75s | 9.76s | 10.79s | 517M | 357.3K |
| utoo-next | 37.01s | 4.11s | 8.22s | 14.59s | 722M | 118.1K |
| utoo-npm | 37.74s | 4.24s | 8.33s | 14.68s | 671M | 139.2K |
| utoo | 27.76s | 3.49s | 7.54s | 13.28s | 789M | 127.6K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 149.8K | 3.6K | 1.13G | 17M | 1.85G | 1.74G | 2M |
| utoo-next | 261.7K | 89.0K | 1006M | 11M | 1.69G | 1.69G | 2M |
| utoo-npm | 268.0K | 96.5K | 1023M | 11M | 1.69G | 1.69G | 2M |
| utoo | 206.6K | 79.4K | 1.01G | 10M | 1.69G | 1.69G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.73s | 0.04s | 3.62s | 1.28s | 473M | 219.7K |
| utoo-next | 6.66s | 0.19s | 2.22s | 0.63s | 75M | 16.1K |
| utoo-npm | 6.46s | 0.04s | 2.14s | 0.56s | 75M | 16.0K |
| utoo | 3.55s | 0.84s | 1.21s | 0.38s | 84M | 17.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 24.6K | 2.2K | 151M | 4M | 106M | - | 2M |
| utoo-next | 48.6K | 547 | 14M | 2M | - | 4M | 2M |
| utoo-npm | 48.4K | 524 | 14M | 2M | - | 4M | 2M |
| utoo | 24.7K | 149 | 16M | 3M | - | 4M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 22.42s | 0.21s | 6.18s | 10.01s | 229M | 82.5K |
| utoo-next | 42.61s | 0.46s | 6.05s | 13.50s | 626M | 118.3K |
| utoo-npm | 47.34s | 7.52s | 6.06s | 13.28s | 630M | 111.2K |
| utoo | 37.18s | 1.95s | 5.93s | 12.44s | 590M | 116.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 112.5K | 3.3K | 997M | 12M | 1.71G | 1.71G | 2M |
| utoo-next | 216.6K | 81.0K | 978M | 10M | 1.69G | 1.69G | 2M |
| utoo-npm | 203.0K | 81.2K | 977M | 9M | 1.69G | 1.69G | 2M |
| utoo | 168.8K | 72.6K | 992M | 8M | 1.69G | 1.69G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.50s | 0.40s | 0.18s | 2.31s | 135M | 30.9K |
| utoo-next | 2.37s | 0.07s | 0.60s | 3.86s | 82M | 18.7K |
| utoo-npm | 2.35s | 0.08s | 0.57s | 3.85s | 82M | 19.0K |
| utoo | 2.14s | 0.08s | 0.49s | 3.81s | 83M | 19.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 876 | 23 | 3M | 66K | 1.84G | 1.74G | 2M |
| utoo-next | 48.1K | 20.4K | 52K | 8K | 1.69G | 1.69G | 2M |
| utoo-npm | 47.7K | 21.4K | 55K | 22K | 1.69G | 1.69G | 2M |
| utoo | 49.1K | 20.1K | 50K | 13K | 1.69G | 1.69G | 2M |
Probe whether the concurrency cap (not the worker-pool architecture) is the actual perf driver behind PR #2827's claimed -30% preload wall. Spawn_local worker-pool alone showed no measurable change vs FuturesUnordered baseline on p1_resolve (utoo ≈ utoo-next), so isolate the cap bump on top of our existing minimal architecture change. If bench shows p1_resolve gain at cap=128 → cap raise is the driver; we can keep this and consider whether worker- pool itself adds anything. If still no gain → cap isn't the driver either, and PR #2827's number probably came from inline simd_json parse / workspace parallel reads / manifest cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Fourth and final perf PR from #2818. The headline architectural change. Stacks on #2826 (perf/manifest-cache) — the Send/Sync trait surface and
MaybeSend/MaybeSyncshims live there.p1_resolve preload wall: 3.10s → 2.15s (−31%); avg_conc 55 → 84 on CI. ruborist now matches the standalone manifest-bench HTTP stack ceiling.
Why
Old
FuturesUnordered-on-main-task design saturated under the deeper await chain insideresolve_package(cache check +OnceMap::get_or_init+RetryIf+ send + bytes + parse). avg_conc plateaued at 55-61 even after every Mutex / allocator hot-path was eliminated, while standalone manifest-bench (#2824) sustained 92 on the same reqwest stack.How
N long-lived
tokio::spawnworkers pulling fromArc<SegQueue<Dep>>withDashSetdedup. Termination viadispatched/completed: AtomicUsize+Notify+mpscclose. Main task only drains completions for receiver events + on_manifest callback — no more cooperative-poll bottleneck.Trait surface change
MockRegistryClientderivesClonepreload_manifeststakesregistry: Arc<R>;R: RegistryClient + Clone + MaybeSend + MaybeSync + 'static,R::Error: MaybeSend. Bound propagated up the public API chain.Companion changes folded in (all motivated by the worker-pool shift)
spawn_blocking(cap=4 blocking pool was queue-saturated under worker-pool concurrency)Wasm CI
cfg-gates
tokio::spawn→wasm_bindgen_futures::spawn_localon wasm32 (JsFutureis!Send). Single-threaded but the queue + Notify + mpsc termination is unchanged.cargo check -p utoo-wasm --target wasm32-unknown-unknownclean.Stack order
perf/manifest-cache; will rebase ontonextonce perf(pm): manifest cache & resolver alloc cleanup #2826 landsTest plan
cargo fmt+cargo clippy --all-targets -- -D warnings --no-depscleancargo test -p utoo-ruborist164 + 10 doctests passcargo test -p utoo-pm248/249 pass (1 pre-existing flake)cargo check -p utoo-wasm --target wasm32-unknown-unknowncleannextbaseline (depends on ci(pm): pm-e2e-bench unified workflow + phase-isolated bench infrastructure #2824 landed; will trigger onbench-phaseslabel)Context
Full journey + failed-experiments catalog: #2818
🤖 Generated with Claude Code