locks: allow locks to work under high contention #27846

trws · 2021-12-07T21:43:56Z

This is a bug found by @harshithamenon. This solution is a bit
different from what we discussed so needs some explanation. The lock=None line it seems shouldn't be a
release, based on my testing, but should be return (lock_type, None)
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database. There were also a
couple of bugs due to taking write locks during the dependency status check at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running >200 instances.

Change lock in check_deps_status to read, release if not installed,
not sure why this was ever write, but read definitely is more
appropriate here, and the read lock is only held out of the scope if
the package is installed.
Release read lock before requeueing to reduce chance of livelock, the
timeout that caused the original issue now happens in roughly 3 of 200
workers instead of 199 on average.

trws · 2021-12-07T22:21:36Z

Accidentally posted this early, sorry for the noise, some unit tests are still failing.

This is a bug found by Harshitha Menon. This solution is a bit different from what we discussed. The `lock=None` line shouldn't be a release, based on my testing, but should be `return (lock_type, None)` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.

trws · 2021-12-08T00:45:12Z

Ok, as far as I can tell, this is now ready for review. Codecov does not like it, but when tested with a clean spack I seem to get a clean unit-test run, same with actions. Tested on ruby and a local box with up to 300 and 200 spack instances respectively with no issues.

tldahlgren · 2021-12-08T00:49:00Z

So I still have reservations about the changes at lines 800, 819-820 based on the original design. Are they really required?

tldahlgren

As I recall, we required a write lock to ensure shared spack instances -- or multi-tasking users -- cannot uninstall a dependency while the software is being installed.

Is this change -- from write to read -- actually necessary?

IF it is, then you need to change the comment at lines 797-799.

@tgamblin @scheibelp @becker33

trws · 2021-12-08T03:18:05Z

I can certainly change the comment.

You mentioned your reservations earlier, and I understand and certainly don't want to break that case. What I don't understand is how the difference between a read and write lock can matter for that case. If uninstall can remove a package while a read lock is held on it but not a write lock, that sounds like a separate and significant bug. I would be happy to ensure that uninstall will honor read locks honestly.

Part of my reasoning is that I'm having trouble coming up with a way that those locks could have worked to produce a parallel install without other errors or timeouts to clear them. The write locks were never intentionally released, and if they were taken they would all have been held by one instance. Best case scenario that would serialize everything on that one instance unless an error or a timeout or similar caused it to back off and get rid of the lock somewhere else in the process.

That's true of the original design diagram as well, it has a designed-in deadlock unless a timeout on upgrading a read lock to a write lock results in the read lock being released. This removes both of these cases by eagerly releasing the read locks when they aren't protecting anything.

trws · 2021-12-14T22:58:28Z

Ok, I have updated the comment to match the new behavior. Additionally, spelunking through the uninstall code shows that it does in fact take a write lock before performing an uninstall, so a read lock is sufficient to protect a package from uninstallation.

On the _check_deps_status change, it is necessary for correctness, but that code only runs in a very specific situation. It only runs when spack is run in --only package mode, and one or more packages are not in the build_list to check their status. If you try current spack, with the write lock, with this command: mpirun -n 2 spack -d install cmake : spack -d install llvm (with or without the dependencies available) at least one of the two will fail with an error like this:

==> [2021-12-14-14:48:53.395812] Error: Cannot proceed with cmake-3.21.4-nsylovqdg25zryw76h6a2ljaradwmzyz: pkgconf-1.8.0-kfureok74bufpsvfi4g6f6voopbwwpds is write locked by another process

As far as I can tell, the package-only case has never been safe to execute in parallel. I can imagine why, if you want to install a set of packages why would you say package only? Either way, the updated version removes this error, and correctly provides either a missing dependencies error or builds the package.

alalazo · 2021-12-15T09:51:24Z

@trws @tldahlgren Fyi, this is planned to be backported to v0.17.1 which should be released end of next week. When the PR gets merged, can you add an entry for it in the description of #27261 ?

trws · 2021-12-15T20:11:24Z

Sure, assuming it gets approved and merged I'll be happy to add an entry for it.

tldahlgren

I was hoping @tgamblin @becker33 @scheibelp would comment since they were instrumental in the decision to use a write lock at that point.

But I will not stand in the way of this PR.

trws · 2021-12-21T20:50:49Z

@tgamblin, any chance to look this over?

alalazo

Reading the code this seems a sensible change. Dependencies should be already installed when we lock them and since they shouldn't be modified by the dependent a read lock seems good. As far as I understand this code is also battle tested by @trws, so LGTM.

trws · 2021-12-22T16:05:47Z

It's also battle-tested by Dinos and Harshitha's huge super-wide build suite. We actually found a hash-collision bug because we got a lock upgrade error after this patch, and locks are never upgraded unless there's a conflict anymore. Could actually put up a PR to print a better error for that come to think of it.

Thanks @alalazo!

* locks: allow locks to work under high contention This is a bug found by Harshitha Menon. The `lock=None` line shouldn't be a release but should be ``` return (lock_type, None) ``` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.

haampie · 2022-03-19T12:05:17Z

@tgamblin this was already backported to 0.17.1, removed it from 0.17.2.

* locks: allow locks to work under high contention This is a bug found by Harshitha Menon. The `lock=None` line shouldn't be a release but should be ``` return (lock_type, None) ``` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.

trws requested review from tgamblin and tldahlgren December 7, 2021 21:43

trws marked this pull request as draft December 7, 2021 22:21

trws force-pushed the fix-locks branch from 52c8b97 to 7589545 Compare December 7, 2021 22:58

trws force-pushed the fix-locks branch from 7589545 to 719ae9a Compare December 7, 2021 23:17

tgamblin added the locking label Dec 7, 2021

trws marked this pull request as ready for review December 8, 2021 00:42

tldahlgren requested changes Dec 8, 2021

View reviewed changes

alalazo assigned tldahlgren Dec 8, 2021

alalazo added this to In progress in Spack v0.18.0 release via automation Dec 8, 2021

alalazo added this to In progress in Spack v0.17.1 Release via automation Dec 8, 2021

tgamblin moved this from In progress to Review in progress in Spack v0.17.1 Release Dec 13, 2021

fix lock type in comment

3e0897d

Spack v0.18.0 release automation moved this from In progress to Reviewer approved Dec 15, 2021

tldahlgren approved these changes Dec 15, 2021

View reviewed changes

Spack v0.17.1 Release automation moved this from Review in progress to Reviewer approved Dec 15, 2021

alalazo mentioned this pull request Dec 16, 2021

Backports 0.17.1 #27261

Merged

24 tasks

alalazo approved these changes Dec 22, 2021

View reviewed changes

alalazo merged commit b7b6542 into develop Dec 22, 2021

Spack v0.18.0 release automation moved this from Reviewer approved to Done Dec 22, 2021

alalazo deleted the fix-locks branch December 22, 2021 15:25

Spack v0.17.1 Release automation moved this from Reviewer approved to Done Dec 22, 2021

jrood-nrel mentioned this pull request Dec 22, 2021

Arcticus config sandialabs/spack-manager#94

Merged

tgamblin added this to In progress in Spack v0.17.2 Release via automation Mar 11, 2022

haampie removed this from In progress in Spack v0.17.2 Release Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

locks: allow locks to work under high contention #27846

locks: allow locks to work under high contention #27846

trws commented Dec 7, 2021

trws commented Dec 7, 2021

trws commented Dec 8, 2021

tldahlgren commented Dec 8, 2021

tldahlgren left a comment •

edited

trws commented Dec 8, 2021

trws commented Dec 14, 2021

alalazo commented Dec 15, 2021

trws commented Dec 15, 2021

tldahlgren left a comment

trws commented Dec 21, 2021

alalazo left a comment

trws commented Dec 22, 2021

haampie commented Mar 19, 2022

locks: allow locks to work under high contention #27846

locks: allow locks to work under high contention #27846

Conversation

trws commented Dec 7, 2021

trws commented Dec 7, 2021

trws commented Dec 8, 2021

tldahlgren commented Dec 8, 2021

tldahlgren left a comment • edited

Choose a reason for hiding this comment

trws commented Dec 8, 2021

trws commented Dec 14, 2021

alalazo commented Dec 15, 2021

trws commented Dec 15, 2021

tldahlgren left a comment

Choose a reason for hiding this comment

trws commented Dec 21, 2021

alalazo left a comment

Choose a reason for hiding this comment

trws commented Dec 22, 2021

haampie commented Mar 19, 2022

tldahlgren left a comment •

edited