-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP, ENH: Linux simultaneous install locking #13217
Conversation
* draft a new regression test for handling simultaneous installs in the same spack instance; this is currently a Linux-only test * the test, test_simultaneous_installs() is currently rather ugly, running through subprocess, but for now is at least reproducibly failing locally
* add locking to "spack install" on Linux; allows the regression test to pass & appears to work well locally in practical testing * flake8 cleanups
CI fail is just |
ping @becker33 |
Just reading the description I think this might be in direct conflict with a feature on the 0.15 roadmap 😟 The WIP is in #13100 and should ensure that multiple invocations of the same Spack instance coordinate via filesystem locks. Let me know if you think I am wrong or misunderstood the scope of the PR. |
Spack already takes write locks on the package prefix and read locks on all of its dependencies before building, as well as appropriate locking of internal databases and file caches. If you're seeing bugs in the locking code I would be more inclined to work on debugging that than to add another layer of locking, unless you can make an argument that you're locking something of a fundamentally different type here. As far as bugs in the locking mechanism go, hopefully they will be addressed as part of @tldahlgren's current work on locking. Can you give some more details about what hasn't been working for you that this is designed to solve? |
The bug and target of the fix are pretty clear here--there's a minimum working reproducer built right into the regression test I've added. Perhaps you can make more specific suggestions as to what needs to be changed. It seems to me that I've identified a clear bug & written a test for it along with the fix. Add the regression test to the competing pull request if you want to prove that it resolves this issue. |
Fixes #9394 Closes #13217. ## Background Spack provides the ability to enable/disable parallel builds through two options: package `parallel` and configuration `build_jobs`. This PR changes the algorithm to allow multiple, simultaneous processes to coordinate the installation of the same spec (and specs with overlapping dependencies.). The `parallel` (boolean) property sets the default for its package though the value can be overridden in the `install` method. Spack's current parallel builds are limited to build tools supporting `jobs` arguments (e.g., `Makefiles`). The number of jobs actually used is calculated as`min(config:build_jobs, # cores, 16)`, which can be overridden in the package or on the command line (i.e., `spack install -j <# jobs>`). This PR adds support for distributed (single- and multi-node) parallel builds. The goals of this work include improving the efficiency of installing packages with many dependencies and reducing the repetition associated with concurrent installations of (dependency) packages. ## Approach ### File System Locks Coordination between concurrent installs of overlapping packages to a Spack instance is accomplished through bottom-up dependency DAG processing and file system locks. The runs can be a combination of interactive and batch processes affecting the same file system. Exclusive prefix locks are required to install a package while shared prefix locks are required to check if the package is installed. Failures are communicated through a separate exclusive prefix failure lock, for concurrent processes, combined with a persistent store, for separate, related build processes. The resulting file contains the failing spec to facilitate manual debugging. ### Priority Queue Management of dependency builds changed from reliance on recursion to use of a priority queue where the priority of a spec is based on the number of its remaining uninstalled dependencies. Using a queue required a change to dependency build exception handling with the most visible issue being that the `install` method *must* install something in the prefix. Consequently, packages can no longer get away with an install method consisting of `pass`, for example. ## Caveats - This still only parallelizes a single-rooted build. Multi-rooted installs (e.g., for environments) are TBD in a future PR. Tasks: - [x] Adjust package lock timeout to correspond to value used in the demo - [x] Adjust database lock timeout to reduce contention on startup of concurrent `spack install <spec>` calls - [x] Replace (test) package's `install: pass` methods with file creation since post-install `sanity_check_prefix` will otherwise error out with `Install failed .. Nothing was installed!` - [x] Resolve remaining existing test failures - [x] Respond to alalazo's initial feedback - [x] Remove `bin/demo-locks.py` - [x] Add new tests to address new coverage issues - [x] Replace built-in package's `def install(..): pass` to "install" something (i.e., only `apple-libunwind`) - [x] Increase code coverage
Linux-only for now
spack install package
multiple times simultaneously with the same spack instance (can be a problem leading to errors in certain CI testing situations, for example)subprocess
shell usage---there may be better ways to use your mocking test infrastructure here?Here's a simple example of this working for me locally on a Linux box, along with output:
cc @junghans