Stop any ongoing cleanup on Job termination and on cleanup startup #1345

zimnx · 2023-08-22T14:20:05Z

Description of your changes:

In case when Cleanup Job needs to be recreated by the Operator, the new
Job could complain about that it cannot execute cleanup operation, because there's
already one running.
Also when user would observe that maintenance operations have too big
impact on cluster performance he could decide to stop the cleanup.
Explicit stop is required as Scylla keeps cleaning even when API request
client disconnects.

In case when Cleanup Job was terminated ungracefully, a cleanup might
still be running in Scylla. As we can't keep track of it, a new one
needs to be started from the begining. To overcome an error returned by
Scylla API when cleanup is triggered and there's still one running, we
will stop all ongoing cleanups at the begining.

With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates
due to received stop signal, and on each cleanup startup.

Which issue is resolved by this Pull Request:

Resolves #1343

rzetelskik

a few nits, other than that /lgtm

pkg/cmd/operator/cleanupjob.go

tnozicka

In case when Cleanup Job needs to be recreated by the Operator, the new Job could complain about that it cannot execute cleanup operation, because there's already one running.

What if the stop operation fails? (It's not retried, it may fail or the pod may terminate ungracefully.)
Is there an option to stop the old one when the new one is starting?

pkg/cmd/operator/cleanupjob.go

zimnx · 2023-08-24T09:10:36Z

/retest-required

tnozicka · 2023-08-24T10:00:46Z

/test images

zimnx · 2023-08-24T11:54:53Z

/retest-required

tnozicka · 2023-08-24T12:33:24Z

What if the stop operation fails? (It's not retried, it may fail or the pod may terminate ungracefully.)
Is there an option to stop the old one when the new one is starting?

@zimnx shouldn't the cancellation be retried with some backoff or better retried on next job start?

zimnx · 2023-08-24T13:04:24Z

shouldn't the cancellation be retried with some backoff

There's 1s backoff.

better retried on next job start?

Any argument behind why retrying on next job start would be better?

tnozicka · 2023-08-24T13:25:01Z

There's 1s backoff.

ok, I see there is now a backoff for 10s, thx

Any argument behind why retrying on next job start would be better?

not better, but in addition to cancellation on termination

pods may terminate ungracefully, cleanup won't be stopped at all
cleanup cancellation may fail and we can't afford to retry it forever because of terminationGracePeriod. Eventually, there is always a case where this could fail to cancel and something should likely handle it.

the new Job could complain about that it cannot execute cleanup operation, because there's already one running.

What exactly happens if the cancellation fails and a new job tries to start another cleanup while the old one is running? Is it just a warning, stuck until completion or the new one fails?

In case when Cleanup Job needs to be recreated by the Operator, the new Job could complain about that it cannot execute cleanup operation, because there's already one running. Also when user would observe that maintenance operations have too big impact on cluster performance he could decide to stop the cleanup. Explicit stop is required as Scylla keeps cleaning even when API request client disconnects. In case when Cleanup Job was terminated ungracefully, a cleanup might still be running in Scylla. As we can't keep track of it, a new one needs to be started from the begining. To overcome an error returned by Scylla API when cleanup is triggered and there's still one running, we will stop all ongoing cleanups at the begining. With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates due to received stop signal, and on each cleanup startup.

zimnx · 2023-08-24T13:57:07Z

What exactly happens if the cancellation fails and a new job tries to start another cleanup while the old one is running? Is it just a warning, stuck until completion or the new one fails?

ERROR 2023-08-21 09:26:15,917 [shard 0] api - Failed force_keyspace_cleanup of system_schema.table{name=columns, id=24101c25-a2ae-3af7-87c1-b40ee1aca33f}: std::runtime_error (cleanup request failed: there is an ongoing cleanup on system_schema.columns)

Which fails the request that triggers it. Job would be restarted - Job RestartPolicy is set to OnFailure - attempt to stop would be sent, if unsuccessful Job would spin until this "unstoppable" cleanup would finish, eventually triggering new one. In case when stop would be successful, new one should be triggered right after.

in addition to cancellation on termination

That would solve the ungraceful shutdown, I'll add it.

pkg/cmd/operator/cleanupjob.go

tnozicka

/approve
/lgtm

/hold
(for https://github.com/scylladb/scylla-operator/pull/1345/files#r1304437045)

scylla-operator-bot · 2023-08-24T14:46:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tnozicka,zimnx]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tnozicka · 2023-08-24T14:47:57Z

thanks for the updates - one pending question, feel free to /hold cancel if that's confirmed

zimnx · 2023-08-24T20:11:04Z

/hold cancel

zimnx added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 22, 2023

scylla-operator-bot bot requested review from rzetelskik and tnozicka August 22, 2023 14:20

scylla-operator-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 22, 2023

rzetelskik reviewed Aug 22, 2023

View reviewed changes

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

zimnx force-pushed the mz/1343-cleanup-retriggered branch from ddffe23 to 18cb78a Compare August 22, 2023 15:52

tnozicka reviewed Aug 23, 2023

View reviewed changes

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

pkg/cmd/operator/cleanupjob.go Show resolved Hide resolved

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved

zimnx force-pushed the mz/1343-cleanup-retriggered branch 2 times, most recently from be97808 to 2efb53c Compare August 24, 2023 08:56

zimnx force-pushed the mz/1343-cleanup-retriggered branch from 2efb53c to 31a563f Compare August 24, 2023 13:57

zimnx changed the title ~~Stop any ongoing cleanup on Job termination~~ Stop any ongoing cleanup on Job termination and on cleanup startup Aug 24, 2023

zimnx requested a review from tnozicka August 24, 2023 13:58

tnozicka reviewed Aug 24, 2023

View reviewed changes

pkg/cmd/operator/cleanupjob.go Show resolved Hide resolved

tnozicka approved these changes Aug 24, 2023

View reviewed changes

scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2023

scylla-operator-bot bot assigned tnozicka Aug 24, 2023

scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2023

scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2023

scylla-operator-bot bot merged commit e6bce29 into scylladb:master Aug 24, 2023
10 checks passed

zimnx deleted the mz/1343-cleanup-retriggered branch August 25, 2023 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop any ongoing cleanup on Job termination and on cleanup startup #1345

Stop any ongoing cleanup on Job termination and on cleanup startup #1345

zimnx commented Aug 22, 2023 •

edited

rzetelskik left a comment

tnozicka left a comment

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023 •

edited

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

tnozicka left a comment

scylla-operator-bot bot commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

Stop any ongoing cleanup on Job termination and on cleanup startup #1345

Stop any ongoing cleanup on Job termination and on cleanup startup #1345

Conversation

zimnx commented Aug 22, 2023 • edited

rzetelskik left a comment

Choose a reason for hiding this comment

tnozicka left a comment

Choose a reason for hiding this comment

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023 • edited

zimnx commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

tnozicka left a comment

Choose a reason for hiding this comment

scylla-operator-bot bot commented Aug 24, 2023

tnozicka commented Aug 24, 2023

zimnx commented Aug 24, 2023

zimnx commented Aug 22, 2023 •

edited

tnozicka commented Aug 24, 2023 •

edited