perf(pm): spawn install download workers#2999
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the installation scheduler to use tokio::spawn for backgrounding package downloads instead of directly pinning them to the operations list. Feedback highlights a reliability risk where detached tasks could cause the scheduler to hang if a worker panics, as the download_active state would never be cleared. It is recommended to push the JoinHandle into the existing ops collection to ensure panics are correctly propagated and the scheduler remains robust.
| let done_tx = self.done_tx.clone(); | ||
| tokio::spawn(async move { | ||
| let result = download_bytes(&package.tarball_url) | ||
| .await | ||
| .map(DownloadOutcome::Bytes) | ||
| .map_err(|e| format!("{e:#}")); | ||
| OpDone::Download { package, result } | ||
| })); | ||
| let _ = done_tx.send(OpDone::Download { package, result }); | ||
| }); |
There was a problem hiding this comment.
While moving downloads to background workers improves scheduler performance, using tokio::spawn with a detached channel send creates a reliability risk. If a download worker panics, the download_active entry will never be cleared, and the scheduler will hang during shutdown while waiting for a message that will never arrive.
Since self.ops is already used to track asynchronous operations, it is safer to push the JoinHandle into self.ops. This ensures that if a worker panics, the scheduler task will also panic (adhering to the project's policy of not recovering from panics) and prevents potential hangs. Polling a JoinHandle is extremely efficient and will not cause the scheduler interference issues described in the PR.
let handle = tokio::spawn(async move {
let result = download_bytes(&package.tarball_url)
.await
.map(DownloadOutcome::Bytes)
.map_err(|e| format!("{e:#}"));
OpDone::Download { package, result }
});
self.ops.push(Box::pin(async move {
handle.await.expect("Download worker panicked")
}));References
- Do not implement recovery logic for panics. Panics should be treated as unrecoverable bugs that need to be fixed, not as transient, recoverable errors.
|
Closing this AB: GHA did not show the target p3 ctx/wall improvement. The Windows e2e failure is the known npmjs 406 flake, but the perf signal is not good enough to keep. |
📊 pm-bench-phases ·
|
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 9.33s | 0.39s | 10.43s | 10.38s | 704M | 331.6K |
| utoo-next | 8.92s | 1.31s | 10.83s | 12.82s | 992M | 128.2K |
| utoo-npm | 9.04s | 1.52s | 10.65s | 12.38s | 961M | 125.6K |
| utoo | 9.90s | 4.26s | 11.28s | 11.35s | 905M | 137.2K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 14.7K | 18.0K | 1.19G | 6M | 1.88G | 1.76G | 1M |
| utoo-next | 137.3K | 98.7K | 1.16G | 5M | 1.71G | 1.70G | 2M |
| utoo-npm | 120.2K | 84.4K | 1.16G | 5M | 1.71G | 1.70G | 2M |
| utoo | 86.5K | 50.8K | 1.16G | 6M | 1.71G | 1.70G | 2M |
p1_resolve
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 2.06s | 0.02s | 4.02s | 1.04s | 504M | 171.4K |
| utoo-next | 3.00s | 0.04s | 5.45s | 2.06s | 616M | 82.3K |
| utoo-npm | 3.09s | 0.05s | 5.48s | 2.05s | 623M | 76.7K |
| utoo | 2.35s | 0.09s | 5.96s | 1.62s | 645M | 115.8K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 9.4K | 4.2K | 203M | 3M | 108M | - | 1M |
| utoo-next | 72.4K | 88.4K | 200M | 2M | 7M | 3M | 2M |
| utoo-npm | 72.1K | 91.0K | 200M | 2M | 7M | 3M | 2M |
| utoo | 13.9K | 19.7K | 202M | 3M | 7M | 3M | 2M |
p3_cold_install
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 6.80s | 0.12s | 6.32s | 9.98s | 600M | 205.7K |
| utoo-next | 6.09s | 0.09s | 4.97s | 10.95s | 489M | 59.6K |
| utoo-npm | 7.95s | 3.46s | 5.15s | 11.41s | 525M | 61.0K |
| utoo | 5.04s | 0.11s | 4.89s | 9.73s | 527M | 57.9K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 4.4K | 6.8K | 1018M | 4M | 1.76G | 1.76G | 1M |
| utoo-next | 99.0K | 51.2K | 989M | 3M | 1.70G | 1.70G | 2M |
| utoo-npm | 123.5K | 48.2K | 990M | 3M | 1.70G | 1.70G | 2M |
| utoo | 59.2K | 38.7K | 989M | 2M | 1.70G | 1.70G | 2M |
p4_warm_link
| PM | wall | ±σ | user | sys | RSS | pgMinor |
|---|---|---|---|---|---|---|
| bun | 3.48s | 0.06s | 0.20s | 2.35s | 133M | 32.5K |
| utoo-next | 2.21s | 0.07s | 0.49s | 3.78s | 80M | 18.9K |
| utoo-npm | 2.15s | 0.03s | 0.49s | 3.75s | 80M | 18.9K |
| utoo | 2.05s | 0.06s | 0.34s | 3.25s | 50M | 11.3K |
| PM | vCtx | iCtx | netRX | netTX | cache | node_mod | lock |
|---|---|---|---|---|---|---|---|
| bun | 371 | 25 | 5M | 18K | 1.91G | 1.75G | 1M |
| utoo-next | 43.1K | 19.8K | 6K | 10K | 1.70G | 1.70G | 2M |
| utoo-npm | 43.9K | 20.4K | 5K | 7K | 1.70G | 1.70G | 2M |
| utoo | 7.3K | 4.2K | 6K | 6K | 1.71G | 1.70G | 2M |
npmmirror.com: no output captured.
Summary
FuturesUnorderedand into short-lived tokio workers.OpDone::Downloadpath.Why
p3 cold install is dominated by registry tarball downloads plus extract/cache writes. Polling thousands of download futures directly inside the scheduler can wake the scheduler on network readiness even though the scheduler only needs the final downloaded bytes. This AB checks whether keeping network polling in worker tasks reduces scheduler interference and context-switch pressure.
Validation
cargo fmtcargo test -p utoo-pm install_schedulercargo clippy -p utoo-pm --all-targets -- -D warnings --no-depsBenchmark
GHA Linux npmjs run
26114547814:Windows e2e failed on npmjs
HTTP 406 Not Acceptablewhile resolving@isaacs/cliui, matching known registry flake behavior and not specific to this install scheduler change.Conclusion
Do not pick this into #2989. The intended signal was lower p3 scheduler/network context pressure, but the run shows higher p3 vCtx than #2989 and no wall improvement. p4 was also globally noisy, but the experiment still does not produce the target signal.