New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
locks: allow locks to work under high contention #27846
Conversation
Accidentally posted this early, sorry for the noise, some unit tests are still failing. |
This is a bug found by Harshitha Menon. This solution is a bit different from what we discussed. The `lock=None` line shouldn't be a release, based on my testing, but should be `return (lock_type, None)` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.
Ok, as far as I can tell, this is now ready for review. Codecov does not like it, but when tested with a clean spack I seem to get a clean unit-test run, same with actions. Tested on ruby and a local box with up to 300 and 200 spack instances respectively with no issues. |
So I still have reservations about the changes at lines 800, 819-820 based on the original design. Are they really required? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I recall, we required a write
lock to ensure shared spack instances -- or multi-tasking users -- cannot uninstall a dependency while the software is being installed.
Is this change -- from write
to read
-- actually necessary?
IF it is, then you need to change the comment at lines 797-799.
I can certainly change the comment. You mentioned your reservations earlier, and I understand and certainly don't want to break that case. What I don't understand is how the difference between a read and write lock can matter for that case. If uninstall can remove a package while a read lock is held on it but not a write lock, that sounds like a separate and significant bug. I would be happy to ensure that uninstall will honor read locks honestly. Part of my reasoning is that I'm having trouble coming up with a way that those locks could have worked to produce a parallel install without other errors or timeouts to clear them. The write locks were never intentionally released, and if they were taken they would all have been held by one instance. Best case scenario that would serialize everything on that one instance unless an error or a timeout or similar caused it to back off and get rid of the lock somewhere else in the process. That's true of the original design diagram as well, it has a designed-in deadlock unless a timeout on upgrading a read lock to a write lock results in the read lock being released. This removes both of these cases by eagerly releasing the read locks when they aren't protecting anything. |
Ok, I have updated the comment to match the new behavior. Additionally, spelunking through the uninstall code shows that it does in fact take a write lock before performing an uninstall, so a read lock is sufficient to protect a package from uninstallation. On the
As far as I can tell, the package-only case has never been safe to execute in parallel. I can imagine why, if you want to install a set of packages why would you say package only? Either way, the updated version removes this error, and correctly provides either a missing dependencies error or builds the package. |
@trws @tldahlgren Fyi, this is planned to be backported to v0.17.1 which should be released end of next week. When the PR gets merged, can you add an entry for it in the description of #27261 ? |
Sure, assuming it gets approved and merged I'll be happy to add an entry for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping @tgamblin @becker33 @scheibelp would comment since they were instrumental in the decision to use a write lock at that point.
But I will not stand in the way of this PR.
@tgamblin, any chance to look this over? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the code this seems a sensible change. Dependencies should be already installed when we lock them and since they shouldn't be modified by the dependent a read lock seems good. As far as I understand this code is also battle tested by @trws, so LGTM.
It's also battle-tested by Dinos and Harshitha's huge super-wide build suite. We actually found a hash-collision bug because we got a lock upgrade error after this patch, and locks are never upgraded unless there's a conflict anymore. Could actually put up a PR to print a better error for that come to think of it. Thanks @alalazo! |
* locks: allow locks to work under high contention This is a bug found by Harshitha Menon. The `lock=None` line shouldn't be a release but should be ``` return (lock_type, None) ``` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.
* locks: allow locks to work under high contention This is a bug found by Harshitha Menon. The `lock=None` line shouldn't be a release but should be ``` return (lock_type, None) ``` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.
@tgamblin this was already backported to 0.17.1, removed it from 0.17.2. |
* locks: allow locks to work under high contention This is a bug found by Harshitha Menon. The `lock=None` line shouldn't be a release but should be ``` return (lock_type, None) ``` to inform the caller it couldn't get the lock type requested without disturbing the existing lock object in the database. There were also a couple of bugs due to taking write locks at the beginning without any checking or release, and not releasing read locks before requeueing. This version no longer gives me read upgrade to write errors, even running 200 instances on one box. * Change lock in check_deps_status to read, release if not installed, not sure why this was ever write, but read definitely is more appropriate here, and the read lock is only held out of the scope if the package is installed. * Release read lock before requeueing to reduce chance of livelock, the timeout that caused the original issue now happens in roughly 3 of 200 workers instead of 199 on average.
This is a bug found by @harshithamenon. This solution is a bit
different from what we discussed so needs some explanation. The
lock=None
line it seems shouldn't be arelease, based on my testing, but should be
return (lock_type, None)
to inform the caller it couldn't get the lock type requested without
disturbing the existing lock object in the database. There were also a
couple of bugs due to taking write locks during the dependency status check at the beginning without any
checking or release, and not releasing read locks before requeueing.
This version no longer gives me read upgrade to write errors, even
running >200 instances.
not sure why this was ever write, but read definitely is more
appropriate here, and the read lock is only held out of the scope if
the package is installed.
timeout that caused the original issue now happens in roughly 3 of 200
workers instead of 199 on average.