Skip to content

perf(pm): batch install clone completions#2989

Draft
elrrrrrrr wants to merge 2 commits into
exp/pm-install-inline-download-futuresfrom
exp/pm-install-clone-batch-completions
Draft

perf(pm): batch install clone completions#2989
elrrrrrrr wants to merge 2 commits into
exp/pm-install-inline-download-futuresfrom
exp/pm-install-clone-batch-completions

Conversation

@elrrrrrrr
Copy link
Copy Markdown
Contributor

@elrrrrrrr elrrrrrrr commented May 19, 2026

Summary

  • Batch install clone completions so the scheduler receives one completion message for a small group of clone jobs instead of one per package.
  • Keep clone materialization on rayon workers; the scheduler only dispatches work and records completion state.
  • Use a bounded clone worker count derived from the existing clone concurrency limit, with a small fixed batch size to avoid long serialized hardlink runs.
  • Integrate perf(pm): prioritize clone unblockers #2991: prioritize parent clone unblockers and send them as single-job batches so nested dependency placement is not delayed behind unrelated hardlink work in the same batch.

Why

p3/p4 are dominated by materialization pressure after the earlier install scheduler changes. Batching clone completions reduces scheduler wakeups, but parent packages are on the nested placement critical path. If a parent clone completes early inside a batch but its completion is reported only after other leaf clones finish, its children cannot enter the queue. The #2991 follow-up fixes that without disabling batching for ordinary leaf clones.

Validation

  • cargo fmt
  • cargo test -p utoo-pm install_scheduler
  • cargo clippy -p utoo-pm --all-targets -- -D warnings --no-deps

Note: full workspace cargo clippy --all-targets -- -D warnings --no-deps is currently blocked in this worktree by pack-core/next.js submodule API mismatch, unrelated to the PM scheduler change.

Benchmark

Latest #2989 reference before #2991 integration, GHA Linux npmjs run 26089858682:

phase utoo wall utoo vCtx/iCtx
p0_full_cold 7.48s 66.5K / 43.2K
p1_resolve 2.37s 15.2K / 18.2K
p3_cold_install 5.29s 53.0K / 32.9K
p4_warm_link 2.23s 7.8K / 4.5K

#2991 AB before integration, GHA Linux npmjs run 26093544293:

phase utoo wall utoo vCtx/iCtx
p3_cold_install 5.21s 48.4K / 34.3K
p4_warm_link 1.86s 7.6K / 4.4K

Integrated #2989 head (fd38cfff), GHA Linux npmjs run 26094747554:

phase utoo wall utoo vCtx/iCtx
p0_full_cold 7.28s 67.4K / 44.2K
p1_resolve 2.39s 14.5K / 19.7K
p3_cold_install 5.30s 52.1K / 35.0K
p4_warm_link 2.07s 7.5K / 4.5K

Latest full green integrated run 26095851789:

phase utoo wall utoo vCtx/iCtx
p0_full_cold 10.51s 89.5K / 57.4K
p1_resolve 2.43s 15.0K / 21.2K
p3_cold_install 5.66s 54.7K / 37.6K
p4_warm_link 2.16s 7.8K / 4.6K

Conclusion: #2991 remains directionally positive for p4/warm materialize (2.23s -> 2.07s/2.16s, ctx stable). p3 is neutral to slightly noisy, which is acceptable because this change targets clone/materialize ordering rather than cold extract/cache population.

Additional GHA Reruns

Label-triggered Linux npmjs rerun 26102023925:

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 9.15s 11.98s 8.32s 7.55s 69.8K / 45.2K
p1 resolve 2.09s 3.12s 3.10s 2.41s 14.1K / 18.5K
p3 cold install 6.68s 6.46s 6.13s 8.07s 71.9K / 42.5K
p4 warm link 3.38s 2.44s 2.35s 2.08s 7.6K / 4.4K

Label-triggered Linux npmjs rerun 26102983933:

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 9.30s 8.06s 8.32s 7.34s 65.1K / 45.7K
p1 resolve 1.97s 3.05s 3.04s 2.39s 14.9K / 19.3K
p3 cold install 6.68s 9.25s 6.95s 5.30s 52.4K / 33.8K
p4 warm link 3.34s 2.37s 2.37s 2.20s 7.9K / 4.6K

Latest label-triggered Linux npmjs rerun 26110858368:

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 8.88s 8.10s 9.20s 8.56s 80.8K / 51.0K
p1 resolve 1.94s 2.94s 3.00s 3.04s 14.2K / 19.7K
p3 cold install 6.46s 6.93s 5.96s 6.63s 60.9K / 39.8K
p4 warm link 3.35s 2.33s 2.30s 2.11s 7.5K / 4.5K

Label-triggered Linux npmjs rerun 26112228488:

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 7.38s 7.44s 7.22s 6.79s 88.2K / 53.6K
p1 resolve 2.53s 3.09s 3.81s 2.69s 17.7K / 18.9K
p3 cold install 5.53s 5.63s 5.63s 6.17s 76.2K / 43.3K
p4 warm link 2.11s 1.74s 1.60s 1.67s 8.5K / 4.1K
p4 remains stable across reruns (2.07s/2.16s/2.08s/2.20s/2.11s, ctx about 7.5K/4.5K). p3 is noisier because it includes network download plus cache extraction writes. Use multiple runs for p3 rather than a single sample.

Label-triggered Linux npmjs rerun 26115911963 (bench job succeeded):

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 9.36s 8.13s 8.90s 7.36s 83.3K / 51.8K
p1 resolve 2.19s 3.26s 3.07s 2.50s 15.6K / 19.6K
p3 cold install 6.50s 7.44s 7.71s 5.90s 65.5K / 41.3K
p4 warm link 3.55s 2.17s 2.34s 2.17s 7.7K / 4.5K

GHA pcap diagnostic run 26116045424 (diagnostic-only; pcap overhead/noise makes wall time unsuitable for ranking):

phase wall streams zero-window / retransmit p99 stream gap max IO util max write await
utoo install 17.86s 70 6 / 48 1.96ms 97.3% 481ms
utoo-next install 12.45s 69 4 / 9 2.41ms 77.9% 215ms
bun install 13.58s 259 10 / 605 41.97ms 69.2% 123ms

Conclusion from pcap: p3 cold install does not show an obvious socket-drain starvation issue for utoo; utoo has fewer zero-window/retransmit events than bun in the install capture. The stronger signal is disk-write tail latency (io_util_max=97.3%, w_await_max=481ms), so the remaining p3 work should focus on cache/extract/write scheduling rather than raising generic tarball concurrency.

Label-triggered Linux npmjs rerun 26117122045 (bench job succeeded; p3 wall noisy):

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 9.47s 9.28s 9.29s 11.05s 71.7K / 48.4K
p1 resolve 2.11s 3.10s 3.39s 2.49s 14.7K / 20.4K
p3 cold install 6.96s 10.11s 6.30s 8.32s 66.5K / 40.8K
p4 warm link 3.52s 2.27s 2.28s 1.93s 7.3K / 4.3K

Label-triggered Linux npmjs rerun 26118716599 (bench job succeeded):

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 10.29s 8.36s 8.20s 7.94s 75.8K / 55.6K
p1 resolve 2.24s 3.28s 3.25s 2.46s 16.1K / 20.9K
p3 cold install 6.47s 6.62s 11.02s 5.83s 59.7K / 39.0K
p4 warm link 3.47s 2.41s 2.31s 2.18s 7.6K / 4.6K

Rejected / Inconclusive Follow-ups

PR idea result
#2992 Pump clones before extracts p3 regressed (6.22s, 56.8K/36.5K); reject for now.
#2993 Halve extract slots while clone work exists p3 wall improved (4.84s) but ctx regressed (59.0K/39.2K) and p4 no-op control moved heavily; inconclusive.
#2994 Flush parent clone unblockers early from a worker Does not beat #2991 clearly (p3 5.38s, p4 2.12s); extra completion state is not justified.
#2995 Clone first waiter immediately after extract p3 does not improve and vCtx rises (5.75s, 59.7K/33.9K); do not integrate.
#2996 Cap install worker pressure GHA regression vs #2989 (p3 5.68s, 62.3K/40.6K; p0 also regressed); do not integrate.
#2997 Prioritize demand downloads over preload queue Two GHA runs did not improve p3; rerun regressed (8.02s, 78.6K/46.1K); do not integrate.
#2998 Raise install tarball default concurrency to 256 N=2 GHA/npmjs did not validate p3 (8.03s, 77.1K/40.2K; rerun 6.35s, 68.7K/39.8K); do not use as a global/non-semver default.
#2999 Spawn install download workers instead of polling download futures in scheduler No target ctx/wall signal (p3 7.88s, 82.2K/29.4K) and p4 control was noisy; do not integrate.
#3000 Use bounded chunked file writes for scheduler indexed extracts GHA p3 wall improved but ctx was neutral (5.56s, 64.6K/41.2K) and p0 ctx regressed (90.5K/56.3K); poolab internal AB also showed higher ctx/sys. Do not integrate.
#3001 Bound downloaded tarball backlog to 4 * extract_limit GHA p3 regressed vs #2989 (6.12s, 74.5K/43.4K vs 5.83s, 59.7K/39.0K); poolab internal AB also showed slow-tarball tail risk. Do not integrate.

Current pick status: #2991 is already integrated into this PR. No other follow-up experiment has enough GHA evidence to pick into #2989.

Label-triggered Linux npmjs rerun 26119947793 (bench job succeeded):

phase bun utoo-next utoo-npm utoo utoo ctx
p0 full cold 9.00s 7.94s 8.07s 7.65s 81.0K / 51.4K
p1 resolve 2.09s 3.11s 3.00s 2.30s 15.8K / 19.7K
p3 cold install 6.68s 6.48s 6.14s 5.47s 68.8K / 42.3K
p4 warm link 3.54s 2.48s 2.38s 2.13s 7.6K / 4.5K

This keeps #2989 in the same GHA/npmjs band: p3 remains noisy but competitive on wall time, while p4 stays stable around 2.1s and 7.5K/4.5K ctx. This run did not capture npmmirror output.

@elrrrrrrr elrrrrrrr added A-Pkg Manager Area: Package Manager benchmark Run pm-bench on PR labels May 19, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces batching for clone operations in the installation scheduler to improve performance by reducing scheduler wakeups. It replaces individual clone tasks with batches of up to three operations and introduces a worker-based concurrency limit. Feedback indicates that the heuristic used to calculate the worker limit causes total concurrency to scale non-linearly and potentially exceed intended limits. It is suggested to either refine the formula to account for the batch size or document the rationale for the current implementation.

Comment on lines +634 to +639
fn clone_worker_limit(clone_limit: usize) -> usize {
clone_limit
.saturating_div(2)
.saturating_add(2)
.clamp(1, clone_limit.max(1))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The clone_worker_limit calculation uses a heuristic (limit / 2 + 2) that doesn't explicitly account for CLONE_BATCH_LIMIT. This results in a total potential concurrency (workers * batch size) that scales non-linearly with the original clone_limit. For example, a clone_limit of 4 results in 4 workers (up to 12 concurrent clones), while a clone_limit of 16 results in 10 workers (up to 30 concurrent clones).

If the intention is to maintain a total concurrency close to the original clone_limit while batching, consider a formula like (clone_limit / CLONE_BATCH_LIMIT).max(1). If the increased concurrency is intentional to saturate the Rayon pool, documenting the rationale for this specific heuristic would improve maintainability.

@github-actions
Copy link
Copy Markdown

📊 pm-bench-phases · e4bbe06 · linux (ubuntu-latest)

Workflow run — ant-design

PMs: utoo (this branch) · utoo-npm (latest published) · bun (latest)

npmjs.org

p0_full_cold

PM wall ±σ user sys RSS pgMinor
bun 9.00s 0.22s 10.25s 9.90s 763M 336.4K
utoo-next 7.94s 0.37s 10.53s 12.21s 1002M 123.0K
utoo-npm 8.07s 0.10s 10.69s 12.06s 1.00G 126.3K
utoo 7.65s 0.24s 11.07s 10.86s 892M 145.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 15.6K 18.2K 1.19G 6M 1.86G 1.75G 1M
utoo-next 129.1K 96.1K 1.16G 5M 1.71G 1.70G 2M
utoo-npm 129.5K 97.9K 1.16G 5M 1.71G 1.70G 2M
utoo 81.0K 51.4K 1.16G 6M 1.71G 1.70G 2M

p1_resolve

PM wall ±σ user sys RSS pgMinor
bun 2.09s 0.08s 3.92s 1.15s 521M 186.7K
utoo-next 3.11s 0.36s 5.05s 2.17s 611M 85.5K
utoo-npm 3.00s 0.02s 5.21s 2.14s 606M 81.0K
utoo 2.30s 0.03s 5.71s 1.65s 644M 120.4K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 10.2K 4.2K 202M 3M 107M - 1M
utoo-next 74.7K 86.1K 200M 2M 7M 3M 2M
utoo-npm 73.9K 92.3K 200M 2M 7M 3M 2M
utoo 15.8K 19.7K 202M 3M 7M 3M 2M

p3_cold_install

PM wall ±σ user sys RSS pgMinor
bun 6.68s 0.06s 6.31s 9.56s 581M 192.1K
utoo-next 6.48s 0.32s 5.04s 10.74s 505M 60.0K
utoo-npm 6.14s 0.22s 5.13s 10.61s 475M 60.4K
utoo 5.47s 0.15s 5.04s 9.36s 557M 63.7K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 5.3K 7.0K 1019M 4M 1.76G 1.76G 1M
utoo-next 115.0K 56.6K 989M 3M 1.70G 1.70G 2M
utoo-npm 110.7K 54.9K 989M 3M 1.70G 1.70G 2M
utoo 68.8K 42.3K 989M 3M 1.70G 1.70G 2M

p4_warm_link

PM wall ±σ user sys RSS pgMinor
bun 3.54s 0.07s 0.19s 2.42s 135M 33.5K
utoo-next 2.48s 0.29s 0.49s 3.80s 79M 18.5K
utoo-npm 2.38s 0.08s 0.47s 3.73s 77M 17.7K
utoo 2.13s 0.05s 0.34s 3.26s 50M 11.2K
PM vCtx iCtx netRX netTX cache node_mod lock
bun 241 25 5M 37K 1.91G 1.75G 1M
utoo-next 43.7K 21.4K 5K 4K 1.70G 1.70G 2M
utoo-npm 39.3K 18.2K 4K 8K 1.70G 1.70G 2M
utoo 7.6K 4.5K 5K 23K 1.71G 1.70G 2M

npmmirror.com: no output captured.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Pkg Manager Area: Package Manager benchmark Run pm-bench on PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant