Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locks: allow locks to work under high contention #27846

Merged
merged 2 commits into from Dec 22, 2021
Merged

locks: allow locks to work under high contention #27846

merged 2 commits into from Dec 22, 2021

Conversation

trws
Copy link
Contributor

@trws trws commented Dec 7, 2021

This is a bug found by @harshithamenon. This solution is a bit
different from what we discussed so needs some explanation. The lock=None line it seems shouldn't be a
release, based on my testing, but should be return (lock_type, None)
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database. There were also a
couple of bugs due to taking write locks during the dependency status check at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running >200 instances.

  • Change lock in check_deps_status to read, release if not installed,
    not sure why this was ever write, but read definitely is more
    appropriate here, and the read lock is only held out of the scope if
    the package is installed.
  • Release read lock before requeueing to reduce chance of livelock, the
    timeout that caused the original issue now happens in roughly 3 of 200
    workers instead of 199 on average.

@trws
Copy link
Contributor Author

trws commented Dec 7, 2021

Accidentally posted this early, sorry for the noise, some unit tests are still failing.

This is a bug found by Harshitha Menon.  This solution is a bit
different from what we discussed.  The `lock=None` line shouldn't be a
release, based on my testing, but should be `return (lock_type, None)`
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database.  There were also a
couple of bugs due to taking write locks at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running 200 instances on one box.

* Change lock in check_deps_status to read, release if not installed,
  not sure why this was ever write, but read definitely is more
  appropriate here, and the read lock is only held out of the scope if
  the package is installed.
* Release read lock before requeueing to reduce chance of livelock, the
  timeout that caused the original issue now happens in roughly 3 of 200
  workers instead of 199 on average.
@trws trws marked this pull request as ready for review December 8, 2021 00:42
@trws
Copy link
Contributor Author

trws commented Dec 8, 2021

Ok, as far as I can tell, this is now ready for review. Codecov does not like it, but when tested with a clean spack I seem to get a clean unit-test run, same with actions. Tested on ruby and a local box with up to 300 and 200 spack instances respectively with no issues.

@tldahlgren
Copy link
Contributor

So I still have reservations about the changes at lines 800, 819-820 based on the original design. Are they really required?

Copy link
Contributor

@tldahlgren tldahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall, we required a write lock to ensure shared spack instances -- or multi-tasking users -- cannot uninstall a dependency while the software is being installed.

Is this change -- from write to read -- actually necessary?

IF it is, then you need to change the comment at lines 797-799.

@tgamblin @scheibelp @becker33

@trws
Copy link
Contributor Author

trws commented Dec 8, 2021

I can certainly change the comment.

You mentioned your reservations earlier, and I understand and certainly don't want to break that case. What I don't understand is how the difference between a read and write lock can matter for that case. If uninstall can remove a package while a read lock is held on it but not a write lock, that sounds like a separate and significant bug. I would be happy to ensure that uninstall will honor read locks honestly.

Part of my reasoning is that I'm having trouble coming up with a way that those locks could have worked to produce a parallel install without other errors or timeouts to clear them. The write locks were never intentionally released, and if they were taken they would all have been held by one instance. Best case scenario that would serialize everything on that one instance unless an error or a timeout or similar caused it to back off and get rid of the lock somewhere else in the process.

That's true of the original design diagram as well, it has a designed-in deadlock unless a timeout on upgrading a read lock to a write lock results in the read lock being released. This removes both of these cases by eagerly releasing the read locks when they aren't protecting anything.

@alalazo alalazo added this to In progress in Spack v0.18.0 release via automation Dec 8, 2021
@alalazo alalazo added this to In progress in Spack v0.17.1 Release via automation Dec 8, 2021
@tgamblin tgamblin moved this from In progress to Review in progress in Spack v0.17.1 Release Dec 13, 2021
@trws
Copy link
Contributor Author

trws commented Dec 14, 2021

Ok, I have updated the comment to match the new behavior. Additionally, spelunking through the uninstall code shows that it does in fact take a write lock before performing an uninstall, so a read lock is sufficient to protect a package from uninstallation.

On the _check_deps_status change, it is necessary for correctness, but that code only runs in a very specific situation. It only runs when spack is run in --only package mode, and one or more packages are not in the build_list to check their status. If you try current spack, with the write lock, with this command: mpirun -n 2 spack -d install cmake : spack -d install llvm (with or without the dependencies available) at least one of the two will fail with an error like this:

==> [2021-12-14-14:48:53.395812] Error: Cannot proceed with cmake-3.21.4-nsylovqdg25zryw76h6a2ljaradwmzyz: pkgconf-1.8.0-kfureok74bufpsvfi4g6f6voopbwwpds is write locked by another process

As far as I can tell, the package-only case has never been safe to execute in parallel. I can imagine why, if you want to install a set of packages why would you say package only? Either way, the updated version removes this error, and correctly provides either a missing dependencies error or builds the package.

@alalazo
Copy link
Member

alalazo commented Dec 15, 2021

@trws @tldahlgren Fyi, this is planned to be backported to v0.17.1 which should be released end of next week. When the PR gets merged, can you add an entry for it in the description of #27261 ?

@trws
Copy link
Contributor Author

trws commented Dec 15, 2021

Sure, assuming it gets approved and merged I'll be happy to add an entry for it.

Spack v0.18.0 release automation moved this from In progress to Reviewer approved Dec 15, 2021
Copy link
Contributor

@tldahlgren tldahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping @tgamblin @becker33 @scheibelp would comment since they were instrumental in the decision to use a write lock at that point.

But I will not stand in the way of this PR.

Spack v0.17.1 Release automation moved this from Review in progress to Reviewer approved Dec 15, 2021
@alalazo alalazo mentioned this pull request Dec 16, 2021
24 tasks
@trws
Copy link
Contributor Author

trws commented Dec 21, 2021

@tgamblin, any chance to look this over?

Copy link
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code this seems a sensible change. Dependencies should be already installed when we lock them and since they shouldn't be modified by the dependent a read lock seems good. As far as I understand this code is also battle tested by @trws, so LGTM.

@alalazo alalazo merged commit b7b6542 into develop Dec 22, 2021
Spack v0.18.0 release automation moved this from Reviewer approved to Done Dec 22, 2021
@alalazo alalazo deleted the fix-locks branch December 22, 2021 15:25
Spack v0.17.1 Release automation moved this from Reviewer approved to Done Dec 22, 2021
@trws
Copy link
Contributor Author

trws commented Dec 22, 2021

It's also battle-tested by Dinos and Harshitha's huge super-wide build suite. We actually found a hash-collision bug because we got a lock upgrade error after this patch, and locks are never upgraded unless there's a conflict anymore. Could actually put up a PR to print a better error for that come to think of it.

Thanks @alalazo!

alalazo pushed a commit that referenced this pull request Dec 22, 2021
* locks: allow locks to work under high contention

This is a bug found by Harshitha Menon.  

The `lock=None` line shouldn't be a release but should be 
```
return (lock_type, None)
``` 
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database.  There were also a
couple of bugs due to taking write locks at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running 200 instances on one box.

* Change lock in check_deps_status to read, release if not installed,
  not sure why this was ever write, but read definitely is more
  appropriate here, and the read lock is only held out of the scope if
  the package is installed.

* Release read lock before requeueing to reduce chance of livelock, the
  timeout that caused the original issue now happens in roughly 3 of 200
  workers instead of 199 on average.
alalazo pushed a commit that referenced this pull request Dec 23, 2021
* locks: allow locks to work under high contention

This is a bug found by Harshitha Menon.  

The `lock=None` line shouldn't be a release but should be 
```
return (lock_type, None)
``` 
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database.  There were also a
couple of bugs due to taking write locks at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running 200 instances on one box.

* Change lock in check_deps_status to read, release if not installed,
  not sure why this was ever write, but read definitely is more
  appropriate here, and the read lock is only held out of the scope if
  the package is installed.

* Release read lock before requeueing to reduce chance of livelock, the
  timeout that caused the original issue now happens in roughly 3 of 200
  workers instead of 199 on average.
@tgamblin tgamblin added this to In progress in Spack v0.17.2 Release via automation Mar 11, 2022
@haampie haampie removed this from In progress in Spack v0.17.2 Release Mar 19, 2022
@haampie
Copy link
Member

haampie commented Mar 19, 2022

@tgamblin this was already backported to 0.17.1, removed it from 0.17.2.

capitalaslash pushed a commit to capitalaslash/spack that referenced this pull request Aug 30, 2022
* locks: allow locks to work under high contention

This is a bug found by Harshitha Menon.  

The `lock=None` line shouldn't be a release but should be 
```
return (lock_type, None)
``` 
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database.  There were also a
couple of bugs due to taking write locks at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running 200 instances on one box.

* Change lock in check_deps_status to read, release if not installed,
  not sure why this was ever write, but read definitely is more
  appropriate here, and the read lock is only held out of the scope if
  the package is installed.

* Release read lock before requeueing to reduce chance of livelock, the
  timeout that caused the original issue now happens in roughly 3 of 200
  workers instead of 199 on average.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants