Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop any ongoing cleanup on Job termination and on cleanup startup #1345

Merged

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Aug 22, 2023

Description of your changes:

In case when Cleanup Job needs to be recreated by the Operator, the new
Job could complain about that it cannot execute cleanup operation, because there's
already one running.
Also when user would observe that maintenance operations have too big
impact on cluster performance he could decide to stop the cleanup.
Explicit stop is required as Scylla keeps cleaning even when API request
client disconnects.

In case when Cleanup Job was terminated ungracefully, a cleanup might
still be running in Scylla. As we can't keep track of it, a new one
needs to be started from the begining. To overcome an error returned by
Scylla API when cleanup is triggered and there's still one running, we
will stop all ongoing cleanups at the begining.

With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates
due to received stop signal, and on each cleanup startup.

Which issue is resolved by this Pull Request:

Resolves #1343

@zimnx zimnx added kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 22, 2023
@scylla-operator-bot scylla-operator-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 22, 2023
Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few nits, other than that /lgtm

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
@zimnx zimnx force-pushed the mz/1343-cleanup-retriggered branch from ddffe23 to 18cb78a Compare August 22, 2023 15:52
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case when Cleanup Job needs to be recreated by the Operator, the new Job could complain about that it cannot execute cleanup operation, because there's already one running.

What if the stop operation fails? (It's not retried, it may fail or the pod may terminate ungracefully.)
Is there an option to stop the old one when the new one is starting?

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
@zimnx zimnx force-pushed the mz/1343-cleanup-retriggered branch 2 times, most recently from be97808 to 2efb53c Compare August 24, 2023 08:56
@zimnx
Copy link
Collaborator Author

zimnx commented Aug 24, 2023

/retest-required

@tnozicka
Copy link
Member

/test images

@zimnx
Copy link
Collaborator Author

zimnx commented Aug 24, 2023

/retest-required

@tnozicka
Copy link
Member

tnozicka commented Aug 24, 2023

What if the stop operation fails? (It's not retried, it may fail or the pod may terminate ungracefully.)
Is there an option to stop the old one when the new one is starting?

@zimnx shouldn't the cancellation be retried with some backoff or better retried on next job start?

@zimnx
Copy link
Collaborator Author

zimnx commented Aug 24, 2023

shouldn't the cancellation be retried with some backoff

There's 1s backoff.

better retried on next job start?

Any argument behind why retrying on next job start would be better?

@tnozicka
Copy link
Member

There's 1s backoff.

ok, I see there is now a backoff for 10s, thx

Any argument behind why retrying on next job start would be better?

not better, but in addition to cancellation on termination

  • pods may terminate ungracefully, cleanup won't be stopped at all
  • cleanup cancellation may fail and we can't afford to retry it forever because of terminationGracePeriod. Eventually, there is always a case where this could fail to cancel and something should likely handle it.

the new Job could complain about that it cannot execute cleanup operation, because there's already one running.

What exactly happens if the cancellation fails and a new job tries to start another cleanup while the old one is running? Is it just a warning, stuck until completion or the new one fails?

In case when Cleanup Job needs to be recreated by the Operator, the new
Job could complain about that it cannot execute cleanup operation, because there's
already one running.
Also when user would observe that maintenance operations have too big
impact on cluster performance he could decide to stop the cleanup.
Explicit stop is required as Scylla keeps cleaning even when API request
client disconnects.

In case when Cleanup Job was terminated ungracefully, a cleanup might
still be running in Scylla. As we can't keep track of it, a new one
needs to be started from the begining. To overcome an error returned by
Scylla API when cleanup is triggered and there's still one running, we
will stop all ongoing cleanups at the begining.

With this change all ongoing cleanup compactions are stopped when Cleanup Job terminates
due to received stop signal, and on each cleanup startup.
@zimnx zimnx force-pushed the mz/1343-cleanup-retriggered branch from 2efb53c to 31a563f Compare August 24, 2023 13:57
@zimnx
Copy link
Collaborator Author

zimnx commented Aug 24, 2023

What exactly happens if the cancellation fails and a new job tries to start another cleanup while the old one is running? Is it just a warning, stuck until completion or the new one fails?

ERROR 2023-08-21 09:26:15,917 [shard 0] api - Failed force_keyspace_cleanup of system_schema.table{name=columns, id=24101c25-a2ae-3af7-87c1-b40ee1aca33f}: std::runtime_error (cleanup request failed: there is an ongoing cleanup on system_schema.columns)

Which fails the request that triggers it. Job would be restarted - Job RestartPolicy is set to OnFailure - attempt to stop would be sent, if unsuccessful Job would spin until this "unstoppable" cleanup would finish, eventually triggering new one. In case when stop would be successful, new one should be triggered right after.


in addition to cancellation on termination

That would solve the ungraceful shutdown, I'll add it.

@zimnx zimnx changed the title Stop any ongoing cleanup on Job termination Stop any ongoing cleanup on Job termination and on cleanup startup Aug 24, 2023
@zimnx zimnx requested a review from tnozicka August 24, 2023 13:58
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scylla-operator-bot scylla-operator-bot bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2023
@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 24, 2023
@scylla-operator-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tnozicka
Copy link
Member

thanks for the updates - one pending question, feel free to /hold cancel if that's confirmed

@zimnx
Copy link
Collaborator Author

zimnx commented Aug 24, 2023

/hold cancel

@scylla-operator-bot scylla-operator-bot bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 24, 2023
@scylla-operator-bot scylla-operator-bot bot merged commit e6bce29 into scylladb:master Aug 24, 2023
10 checks passed
@zimnx zimnx deleted the mz/1343-cleanup-retriggered branch August 25, 2023 08:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cleanup is triggered multiple times
3 participants