feat(bootstrapper): run phase items in parallel when possible#1150
feat(bootstrapper): run phase items in parallel when possible#1150dhellmann wants to merge 2 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR implements parallel execution for the iterative bootstrap loop. The Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/fromager/commands/bootstrap.py (1)
193-202:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd
max_workersparameter tobootstrap()function signature and pass it through toBootstrapper.
bootstrap_parallel()forwardsmax_workersviactx.invoke(bootstrap, ..., max_workers=max_workers, ...), but thebootstrap()function does not declare this parameter, so the forwarded value is rejected or ignored. Additionally, the Bootstrapper is hardcoded withmax_workers=1at line 200, preventing any configuration of parallel bootstrap execution. Either addmax_workers: int | Noneto thebootstrap()signature and pass it through to the Bootstrapper constructor instead of hardcoding 1, or remove the forwarded kwarg frombootstrap_parallel().🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/fromager/commands/bootstrap.py` around lines 193 - 202, The bootstrap function is missing a max_workers parameter and always constructs Bootstrapper with max_workers=1; update the bootstrap(...) function signature to accept max_workers: int | None (default as needed) and pass that value into the Bootstrapper constructor instead of the hardcoded 1 so ctx.invoke(bootstrap, ..., max_workers=max_workers, ...) from bootstrap_parallel() is honored; ensure the parameter name matches exactly (max_workers) and propagate it through any intermediate calls or default handling in bootstrap to the Bootstrapper instantiation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e/test_bootstrap_parallel_multiple_versions.sh`:
- Around line 70-110: The current test inspects "starting batch: phase=${phase}
items=N" which is logged before work begins and therefore doesn't prove
concurrency; change the check in the phase loop to detect actual overlapping
execution by parsing per-item start/finish logs with timestamps (or
worker/thread IDs) in bootstrap.log and asserting that for at least one phase an
item start timestamp occurs before another item's finish timestamp (or two
different worker IDs run concurrently) for that phase; update the grep/sed logic
that currently looks for "starting batch: phase=${phase} items=" and instead
search for per-item "start"/"finish" log lines (or worker-id tokens) and
implement an overlap detection routine to set PHASES_WITHOUT_PARALLEL_BATCH when
no overlap is found for a phase (keep using PARALLEL_PHASES and
PHASES_WITHOUT_PARALLEL_BATCH identifiers).
In `@src/fromager/bootstrapper.py`:
- Around line 716-720: The lock _wheel_dir_lock currently protects
update_wheel_mirror calls from _download_prebuilt() but not the identical
server.update_wheel_mirror(self.ctx) invocation inside _build_wheel(), leaving a
TOCTOU race with _find_cached_wheel; fix by acquiring the same _wheel_dir_lock
around the server.update_wheel_mirror(self.ctx) call in _build_wheel() (and any
other places that move files out of wheels_build) so that _download_prebuilt(),
_build_wheel(), and _find_cached_wheel() serialize on the same lock and
eliminate the race window.
- Around line 457-465: The ThreadPoolExecutor block with futures and run_one
allows started workers to perform irreversible writes even if one future fails;
modify run_one to avoid side effects (return intended mutations/data instead of
writing wheels/state), collect all_new_items from futures, then perform a single
commit/apply step that writes to disk only if every future succeeded;
alternatively, if run_one cannot be made pure, detect the phase and fall back to
sequential execution (no ThreadPoolExecutor) so writes are atomic per-item or
wrap mutations behind an explicit commit function (e.g., apply_batch_changes)
invoked only after all futures complete successfully.
In `@tests/test_bootstrapper.py`:
- Around line 614-662: The tests mock _dispatch_phase away so _phase_resolve
never runs (first test) and the second only exercises a single RESOLVE (no
siblings), so neither actually exercises parallel RESOLVE batching; update the
tests to let _phase_resolve run by removing or changing the _dispatch_phase
patch and instead arrange for a RESOLVE phase with multiple items (e.g., patch
_dispatch_phase to return a list containing two RESOLVE work items or call
bootstrap with two top-level Requirement objects) so bt._resolver.resolve is
invoked concurrently twice; for the test_mode failure case, likewise ensure two
sibling RESOLVE items are present so the first resolve raises (via the existing
mock_resolve) and the second succeeds, validating bt.failed_packages records the
failure while the sibling completes.
---
Outside diff comments:
In `@src/fromager/commands/bootstrap.py`:
- Around line 193-202: The bootstrap function is missing a max_workers parameter
and always constructs Bootstrapper with max_workers=1; update the bootstrap(...)
function signature to accept max_workers: int | None (default as needed) and
pass that value into the Bootstrapper constructor instead of the hardcoded 1 so
ctx.invoke(bootstrap, ..., max_workers=max_workers, ...) from
bootstrap_parallel() is honored; ensure the parameter name matches exactly
(max_workers) and propagate it through any intermediate calls or default
handling in bootstrap to the Bootstrapper instantiation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2d93f827-752d-448c-a6c8-055c4f8afa79
📒 Files selected for processing (7)
e2e/ci_bootstrap_parallel_suite.she2e/test_bootstrap_parallel_multiple_versions.shsrc/fromager/bootstrap_requirement_resolver.pysrc/fromager/bootstrapper.pysrc/fromager/commands/bootstrap.pytests/test_bootstrapper.pytests/test_bootstrapper_iterative.py
Create the executor once per `bootstrap()` call rather than creating and destroying it on every batch iteration. Serial (non-parallelizable) phases now run `run_one()` directly in the main thread instead of going through a single-worker pool, eliminating unnecessary thread-pool overhead for those phases. Wrap the executor block in `try/finally` to guarantee the `why` stack is restored even if an exception escapes. Addresses tiran's review comment on PR python-wheel-build#1150. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Register an `add_done_callback` on each submitted future so the progress bar updates as individual items complete rather than waiting for the entire batch. Serial phases update inline after each item. The callback holds `_state_lock` around progress bar calls for thread safety. Addresses tiran's review comment on PR python-wheel-build#1150. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
Make all shared state in `Bootstrapper` and `BootstrapRequirementResolver` thread-safe, then restructure the `bootstrap()` main loop to process work in uniform batches: collect consecutive same-phase items → run via `ThreadPoolExecutor` → restore results to stack in original order. `BootstrapPhase.can_parallelize` controls whether a batch collects multiple items or just one. All work goes through the executor; batch size determines effective parallelism. Phases that install packages are marked as needing to be run in serial to ensure that they are processed one at a time after all dependencies are actually built. This allows most work to happen in parallel, at least somewhat, based on the content of the work item stack. Pass `ctx.settings.max_jobs` as `max_workers` to `ThreadPoolExecutor` so the number of parallel threads respects the `--jobs` flag. `None` (no flag given) uses the executor's default; a positive int caps the thread count. Thread-safety changes: - Use contextvars.ContextVar for the `why` dependency chain stack. ContextVar is safe across both threads and asyncio contexts. ThreadPoolExecutor copies the current context for each submitted task (Python 3.7+), so each task naturally inherits the caller's context without manual snapshot/restore overhead. - `self._state_lock` added to `Bootstrapper`; held at all check-then-set write sites: `_phase_start` (seen dedup), `_add_to_build_order`, `_add_to_graph`, `_handle_phase_error` (multiple-versions block), `_record_test_mode_failure`, and progressbar calls in `_handle_build_requirements` - Add `_wheel_dir_lock` to `Bootstrapper.__init__` and hold it around the operations that can race. - Extract `UPDATE_BUILD_SEQUENCE` as a separate phase so it can be run serially to ensure items are added in the right order. - The `bootstrap` command always sets the worker pool size to 1, so even though tasks run in a thread they are serialized. - Register an `add_done_callback` on each submitted future so the progress bar updates as individual items complete rather than waiting for the entire batch. Serial phases update inline after each item. The callback holds `_state_lock` around progress bar calls for thread safety. - Add `GET_BUILD_DEPS` as a new parallel phase after the existing `PREPARE_BUILD` phase. `PREPARE_BUILD` now only installs build system deps (must be serial to avoid races between packages sharing build deps). `GET_BUILD_DEPS` queries build backend and sdist deps, which are independent per-package and safe to run concurrently. - Split `INSTALL_BUILD_DEPS` phase out of the `BUILD` phase to install build backend/sdist deps into the build environment; guaranteed safe because only one package runs this at a time, so every dep's wheel is in the mirror before installation. Closes: python-wheel-build#1149 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
470f7dd to
45a2420
Compare
|
After experimenting further, this approach isn't going to work. It's too hard to combine a stack with a topology-aware prioritization system. I keep running into issues because we're marking something seen when we start it, even though it's not built. I'm going to salvage an idea or two and experiment with other approaches. |
Summary
Restructure the
bootstrap-parallelcommand's main loop to run independent work items concurrently using aThreadPoolExecutor.How it works
The bootstrap loop maintains a LIFO stack of
WorkItemobjects. Each iteration collects a batch from the top of the stack and submits it to the executor.BootstrapPhase.can_parallelizecontrols batch size: parallel phases collect all consecutive same-phase items (running them concurrently); serial phases pop only the single top item (allowing other phases to interleave while still processing one at a time). All items — serial or parallel — go through the same executor.Phase breakdown after this change
RESOLVESTARTPREPARE_SOURCEPREPARE_BUILDGET_BUILD_DEPSINSTALL_BUILD_DEPSBUILDUPDATE_BUILD_SEQUENCEbuild-order.jsonPROCESS_INSTALL_DEPSCOMPLETEThread-safety changes
whydependency-chain stack converted to acontextvars.ContextVar;ThreadPoolExecutorcopies the caller's context automatically for each task_state_lockguards all check-then-set write sites (_seen_requirements,_build_requirements,_build_stack,failed_packages,_failed_versions, dependency graph, progress bar)_wheel_dir_lockguards wheel directory operations to prevent TOCTOU races between_find_cached_wheelandupdate_wheel_mirroradd_done_callbackso items tick off as they complete rather than at batch boundariesTesting
New e2e test
test_bootstrap_parallel_multiple_versionsbootstraps flit-core 3.0.0--3.12.0 with 5 workers, verifying that all buildable versions are produced and that each parallel phase logged at least one parallel batch.Closes: #1149