Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repair: Release permit earlier when the repair_reader is done #14677

Closed
wants to merge 1 commit into from

Conversation

asias
Copy link
Contributor

@asias asias commented Jul 13, 2023

Consider

  • 10 repair instances take all the 10 _streaming_concurrency_sem

  • repair readers are done but the permits are not released since they are waiting for view update _registration_sem

  • view updates trying to take the _streaming_concurrency_sem to make progress of view update so it could release _registration_sem, but it could not take _streaming_concurrency_sem since the 10 repair instances have taken them

  • dead lock happens

To fix, release the permits as soon as the repair readers are done even if the repair job is waiting for _registration_sem.

Fixes #14676

@asias asias requested review from tgrabiec and nyh as code owners July 13, 2023 04:44
Copy link
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should further clarify that this is caused by an edge-case of having switched to empty reader, see #14676 (comment).
In general, the evictable reader used by repair should take care of this deadlock, but we defeated that mechanism by replacing EOF-d evictable readers with empty ones.

@scylladb-promoter
Copy link
Contributor

Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes scylladb#14676
@asias
Copy link
Contributor Author

asias commented Jul 13, 2023

I think we should further clarify that this is caused by an edge-case of having switched to empty reader, see #14676 (comment). In general, the evictable reader used by repair should take care of this deadlock, but we defeated that mechanism by replacing EOF-d evictable readers with empty ones.

Sure. I updated the commit log.

@dorlaor
Copy link
Contributor

dorlaor commented Jul 13, 2023 via email

@asias
Copy link
Contributor Author

asias commented Jul 13, 2023

Is there a way (not in this fix, but later), to add a watchdog to identify such deadlocks and automatically release pending permits if there is no progress?

We can start with adding a big timeout (like we do for the repair reader to read fragment, timeout in 30minutes). This way we know something wrong with the lock. It is too hard to fix the deadlock just with a watchdog.

However, I have an idea of getting rid of view registration lock completely, so we would not play with two locks to avoid deadlock.

@asias
Copy link
Contributor Author

asias commented Jul 13, 2023

In addition, Botond has the idea of getting rid of the _streaming_concurrency_sem lock.

@denesb
Copy link
Contributor

denesb commented Jul 13, 2023

In addition, Botond has the idea of getting rid of the _streaming_concurrency_sem lock.

I want to get rid of the (currently very limited) count resources, which is only 10 for streaming. Note that deadlocks can still happen based on memory resources, but that is much-much less likely, because these reads don't use that much memory.

@denesb
Copy link
Contributor

denesb commented Jul 13, 2023

Is there a way (not in this fix, but later), to add a watchdog to identify such deadlocks and automatically release pending permits if there is no progress?

I think the best we could do with such a watchdog is to forcibly break the cycle by injecting an exception. This can already be achieved by timeouts, which we do use.
Knowing which permit to release would require intricate understanding the purpose and lifecycle of each reader involved. The amount of heuristics this would require, would be just as error prone, if not more, as the current code.

@scylladb-promoter
Copy link
Contributor

denesb added a commit that referenced this pull request Jul 14, 2023
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0)
denesb added a commit that referenced this pull request Jul 14, 2023
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0)
denesb added a commit that referenced this pull request Jul 14, 2023
Consider

- 10 repair instances take all the 10 _streaming_concurrency_sem

- repair readers are done but the permits are not released since they
  are waiting for view update _registration_sem

- view updates trying to take the _streaming_concurrency_sem to make
  progress of view update so it could release _registration_sem, but it
  could not take _streaming_concurrency_sem since the 10 repair
  instances have taken them

- deadlock happens

Note, when the readers are done, i.e., reaching EOS, the repair reader
replaces the underlying (evictable) reader with an empty reader. The
empty reader is not evictable, so the resources cannot be forcibly
released.

To fix, release the permits manually as soon as the repair readers are
done even if the repair job is waiting for _registration_sem.

Fixes #14676

Closes #14677

(cherry picked from commit 1b577e0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

deadlock caused by view update _registration_sem and streaming reader _streaming_concurrency_sem
5 participants