New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better handle runner concurrency #1197
Conversation
Is it possible that this prevents reaper from running parallel repairs on different clusters? It looks like old code checked existing runs filtering them by cluster name but now there is no filtering. |
Yes, it's the case. And we need to address this so that the limit applies per cluster. |
I created an issue to track this. |
After thelastpickle#1197 active repairs in one cluster prevent repairs in others. This diff makes repairs in different clusters independent again.
After thelastpickle#1197 active repairs in one cluster prevent repairs in others. This diff makes repairs in different clusters independent again.
After #1197 active repairs in one cluster prevent repairs in others. This diff makes repairs in different clusters independent again.
The current way of limiting the number of concurrent runners is done by checking the number of runners in the local Reaper instance. Paused runs still have a runner thread, which gets counted as well, resulting in paused repairs potentially blocking other repairs.
Another problem that can arise is that different reaper instance could start runs in a different order, resulting in different runs being authorized depending on the instance.
The current set of changes orders the runners by repair run creation date (the run id is a timeuuid), to prioritize older runs, and also takes into account the state of the run, so that paused runs let running one be processed.