perf(pm): prioritize clone unblockers#2991
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the clone batching logic in install_scheduler.rs to prioritize 'unblocker' jobs—parent clones that unblock nested dependencies. It introduces take_clone_batch and pop_next_clone_job to handle this prioritization, ensuring that unblockers are processed individually to allow their children to enter the queue sooner. Unit tests were added to verify the prioritization and batching behavior. Review feedback highlights a potential performance bottleneck in pop_next_clone_job due to its O(N) complexity and suggests using separate queues for unblocker and leaf jobs to achieve O(1) performance.
| fn pop_next_clone_job(&mut self) -> Option<ReadyClone> { | ||
| let unblocker_index = if self.blocked_by_parent.is_empty() { | ||
| None | ||
| } else { | ||
| self.clone_queue | ||
| .iter() | ||
| .position(|job| self.blocked_by_parent.contains_key(&job.spec.target)) | ||
| }; | ||
| match unblocker_index { | ||
| Some(index) => self.clone_queue.remove(index), | ||
| None => self.clone_queue.pop_front(), | ||
| } | ||
| } |
There was a problem hiding this comment.
The current implementation of pop_next_clone_job iterates through the clone_queue to find a priority job. This is an O(N) operation, and VecDeque::remove() is also O(N) on average. This could introduce a performance bottleneck if the queue becomes large.
For a more efficient implementation, consider using two separate queues: one for high-priority "unblocker" jobs and one for regular "leaf" jobs.
-
In
SchedulerState, replaceclone_queue: VecDeque<ReadyClone>with:unblocker_clone_queue: VecDeque<ReadyClone>, leaf_clone_queue: VecDeque<ReadyClone>,
-
When a job is ready to be cloned (e.g., in
handle_doneforSeededCacheand incomplete_download), classify it and push it to the appropriate queue:let ready_clone = ReadyClone { spec, cache }; if self.blocked_by_parent.contains_key(&ready_clone.spec.target) { self.unblocker_clone_queue.push_back(ready_clone); } else { self.leaf_clone_queue.push_back(ready_clone); }
-
Then,
pop_next_clone_jobcan be simplified to an O(1) operation. It could also return whether the job is an unblocker to avoid a second lookup intake_clone_batch:fn pop_next_clone_job(&mut self) -> Option<(ReadyClone, bool)> { if let Some(job) = self.unblocker_clone_queue.pop_front() { return Some((job, true)); } if let Some(job) = self.leaf_clone_queue.pop_front() { return Some((job, false)); } None }
This would require adjusting
take_clone_batchto handle the new return type.
This approach would eliminate the linear scan and make the scheduling logic more efficient, which could be important for the performance experiment you are running.
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 8.99s | 0.25s | 10.36s | 10.29s | 753M | 337.1K |
| utoo-next | 8.95s | 0.88s | 10.47s | 12.57s | 1007M | 125.2K |
| utoo-npm | 7.92s | 0.06s | 10.51s | 12.07s | 980M | 117.5K |
| utoo | 8.52s | 1.65s | 11.20s | 11.20s | 924M | 143.5K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.5K | 18.4K | 1.19G | 6M | 1.86G | 1.75G | 1M |
| utoo-next | 141.1K | 98.9K | 1.16G | 5M | 1.71G | 1.70G | 2M |
| utoo-npm | 118.3K | 79.6K | 1.16G | 5M | 1.71G | 1.70G | 2M |
| utoo | 81.1K | 53.9K | 1.16G | 7M | 1.71G | 1.70G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 1.88s | 0.03s | 4.03s | 1.04s | 507M | 163.7K |
| utoo-next | 3.00s | 0.01s | 5.37s | 2.09s | 612M | 88.7K |
| utoo-npm | 2.99s | 0.01s | 5.53s | 2.03s | 604M | 78.3K |
| utoo | 2.37s | 0.03s | 5.95s | 1.63s | 644M | 118.4K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 8.2K | 4.6K | 203M | 3M | 107M | - | 1M |
| utoo-next | 70.8K | 87.2K | 200M | 2M | 7M | 3M | 2M |
| utoo-npm | 71.0K | 91.1K | 200M | 2M | 7M | 3M | 2M |
| utoo | 14.0K | 17.0K | 202M | 3M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.90s | 0.30s | 6.28s | 9.95s | 609M | 203.8K |
| utoo-next | 7.22s | 2.10s | 4.95s | 10.93s | 501M | 63.8K |
| utoo-npm | 6.65s | 1.45s | 5.08s | 11.12s | 501M | 65.0K |
| utoo | 5.12s | 0.18s | 4.88s | 9.44s | 517M | 58.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.1K | 7.0K | 1018M | 3M | 1.76G | 1.76G | 1M |
| utoo-next | 110.2K | 48.4K | 989M | 3M | 1.70G | 1.70G | 2M |
| utoo-npm | 115.3K | 53.0K | 989M | 3M | 1.70G | 1.70G | 2M |
| utoo | 50.3K | 33.1K | 989M | 2M | 1.70G | 1.70G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.28s | 0.03s | 0.21s | 2.35s | 134M | 32.9K |
| utoo-next | 2.33s | 0.12s | 0.48s | 3.76s | 78M | 18.4K |
| utoo-npm | 2.26s | 0.01s | 0.49s | 3.79s | 80M | 18.7K |
| utoo | 2.08s | 0.04s | 0.33s | 3.30s | 50M | 11.1K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 320 | 29 | 5M | 33K | 1.91G | 1.75G | 1M |
| utoo-next | 40.0K | 18.6K | 2K | 8K | 1.70G | 1.70G | 2M |
| utoo-npm | 44.5K | 20.2K | 1K | 5K | 1.70G | 1.70G | 2M |
| utoo | 7.7K | 4.6K | 3K | 24K | 1.71G | 1.70G | 2M |
npmmirror.com: no output captured.
Summary
This is a p3/p4 scheduler experiment on top of #2989.
#2989 batches clone completions to reduce scheduler wakeups, but that can delay parent package completion inside a batch. A parent clone completion is on the materialize critical path because it releases
blocked_by_parentchildren. This PR keeps ordinary leaf clones batched, while prioritizing ready parent unblockers and sending them as a single-job batch.Hypothesis
Validation
cargo fmtcargo test -p utoo-pm install_schedulercargo clippy -p utoo-pm --all-targets -- -D warnings --no-depsBench Result
GHA Linux npmjs run
26093544293:Compared with #2989 latest npmjs reference (
p3 5.29s, 53.0K/32.9K;p4 2.23s, 7.8K/4.5K), this is directionally positive. p4 is the strongest signal because this PR directly changes warm materialize scheduling.Conclusion: keep as a candidate; the hypothesis that batched parent clone completions delay nested placement is supported.