-
Notifications
You must be signed in to change notification settings - Fork 156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop any ongoing cleanup on Job termination and on cleanup startup #1345
Stop any ongoing cleanup on Job termination and on cleanup startup #1345
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nits, other than that /lgtm
ddffe23
to
18cb78a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case when Cleanup Job needs to be recreated by the Operator, the new Job could complain about that it cannot execute cleanup operation, because there's already one running.
What if the stop operation fails? (It's not retried, it may fail or the pod may terminate ungracefully.)
Is there an option to stop the old one when the new one is starting?
be97808
to
2efb53c
Compare
/retest-required |
/test images |
/retest-required |
@zimnx shouldn't the cancellation be retried with some backoff or better retried on next job start? |
There's 1s backoff.
Any argument behind why retrying on next job start would be better? |
ok, I see there is now a backoff for 10s, thx
not better, but in addition to cancellation on termination
What exactly happens if the cancellation fails and a new job tries to start another cleanup while the old one is running? Is it just a warning, stuck until completion or the new one fails? |
In case when Cleanup Job needs to be recreated by the Operator, the new Job could complain about that it cannot execute cleanup operation, because there's already one running. Also when user would observe that maintenance operations have too big impact on cluster performance he could decide to stop the cleanup. Explicit stop is required as Scylla keeps cleaning even when API request client disconnects. In case when Cleanup Job was terminated ungracefully, a cleanup might still be running in Scylla. As we can't keep track of it, a new one needs to be started from the begining. To overcome an error returned by Scylla API when cleanup is triggered and there's still one running, we will stop all ongoing cleanups at the begining. With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates due to received stop signal, and on each cleanup startup.
2efb53c
to
31a563f
Compare
Which fails the request that triggers it. Job would be restarted - Job RestartPolicy is set to OnFailure - attempt to stop would be sent, if unsuccessful Job would spin until this "unstoppable" cleanup would finish, eventually triggering new one. In case when stop would be successful, new one should be triggered right after.
That would solve the ungraceful shutdown, I'll add it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
/hold
(for https://github.com/scylladb/scylla-operator/pull/1345/files#r1304437045)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tnozicka, zimnx The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
thanks for the updates - one pending question, feel free to |
/hold cancel |
Description of your changes:
In case when Cleanup Job needs to be recreated by the Operator, the new
Job could complain about that it cannot execute cleanup operation, because there's
already one running.
Also when user would observe that maintenance operations have too big
impact on cluster performance he could decide to stop the cleanup.
Explicit stop is required as Scylla keeps cleaning even when API request
client disconnects.
In case when Cleanup Job was terminated ungracefully, a cleanup might
still be running in Scylla. As we can't keep track of it, a new one
needs to be started from the begining. To overcome an error returned by
Scylla API when cleanup is triggered and there's still one running, we
will stop all ongoing cleanups at the begining.
With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates
due to received stop signal, and on each cleanup startup.
Which issue is resolved by this Pull Request:
Resolves #1343