-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repair: Release permit earlier when the repair_reader is done #14677
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should further clarify that this is caused by an edge-case of having switched to empty reader, see #14676 (comment).
In general, the evictable reader used by repair should take care of this deadlock, but we defeated that mechanism by replacing EOF-d evictable readers with empty ones.
CI state |
Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes scylladb#14676
b0e247c
to
a226493
Compare
Sure. I updated the commit log. |
Is there a way (not in this fix, but later), to add a watchdog to identify
such deadlocks and automatically release pending permits if there is
no progress?
…On Thu, Jul 13, 2023 at 9:28 AM Asias He ***@***.***> wrote:
I think we should further clarify that this is caused by an edge-case of
having switched to empty reader, see #14676 (comment)
<#14676 (comment)>.
In general, the evictable reader used by repair should take care of this
deadlock, but we defeated that mechanism by replacing EOF-d evictable
readers with empty ones.
Sure. I updated the commit log.
—
Reply to this email directly, view it on GitHub
<#14677 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANHURKRYZ4WXRMUH3ASU7TXP6IRLANCNFSM6AAAAAA2ILYBAE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
We can start with adding a big timeout (like we do for the repair reader to read fragment, timeout in 30minutes). This way we know something wrong with the lock. It is too hard to fix the deadlock just with a watchdog. However, I have an idea of getting rid of view registration lock completely, so we would not play with two locks to avoid deadlock. |
In addition, Botond has the idea of getting rid of the _streaming_concurrency_sem lock. |
I want to get rid of the (currently very limited) count resources, which is only 10 for streaming. Note that deadlocks can still happen based on memory resources, but that is much-much less likely, because these reads don't use that much memory. |
I think the best we could do with such a watchdog is to forcibly break the cycle by injecting an exception. This can already be achieved by timeouts, which we do use. |
CI state |
Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes #14676 Closes #14677 (cherry picked from commit 1b577e0)
Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes #14676 Closes #14677 (cherry picked from commit 1b577e0)
Consider - 10 repair instances take all the 10 _streaming_concurrency_sem - repair readers are done but the permits are not released since they are waiting for view update _registration_sem - view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them - deadlock happens Note, when the readers are done, i.e., reaching EOS, the repair reader replaces the underlying (evictable) reader with an empty reader. The empty reader is not evictable, so the resources cannot be forcibly released. To fix, release the permits manually as soon as the repair readers are done even if the repair job is waiting for _registration_sem. Fixes #14676 Closes #14677 (cherry picked from commit 1b577e0)
Consider
10 repair instances take all the 10 _streaming_concurrency_sem
repair readers are done but the permits are not released since they are waiting for view update _registration_sem
view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them
dead lock happens
To fix, release the permits as soon as the repair readers are done even if the repair job is waiting for _registration_sem.
Fixes #14676