Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement proposal for running cleanup after scaling #1300

Merged
merged 1 commit into from Aug 9, 2023

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Jul 24, 2023

The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation,
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become
stale taking up unnecessary space. Operator should support running the cleanup to keep disk space as low as possible and ensure clusters are stable and reliable over time.

Details can be found in actual enhancement.

@zimnx zimnx added the kind/design Categorizes issue or PR as related to design. label Jul 24, 2023
@zimnx zimnx added this to the v1.10 milestone Jul 24, 2023
@zimnx zimnx requested a review from tnozicka July 24, 2023 13:19
@mykaul
Copy link
Contributor

mykaul commented Jul 25, 2023

cc @bhalevy - can you review the suggestion?

@bhalevy
Copy link
Member

bhalevy commented Jul 25, 2023

Since this PR only contains the document for the proposal not the enhacement itself, how about renaming the PR subject to be the same as the patch: Enhancement proposal for running cleanup after scaling?

@bhalevy
Copy link
Member

bhalevy commented Jul 25, 2023

Also, what is this file for: enhancements/proposals/dir_placeholder.delete_me_with_first_proposal?

## Summary

The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation,
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stored data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation,
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become
stale taking up unnecessary space. Operator should support running the cleanup to keep disk space as low as possible and
ensure clusters are stable and reliable over time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most serious problem with not running cleanup is the possibility of data resurrection.
This happens when tombstones delete data on other nodes, and the tombstone is eventually purged, leaving behind neither the data nor the tombstone. Then, decommission or removenode, may expose the stale data that wasn't cleaned up, when the token ownership is moved back to the original node, that might still have the data. Since the tombstone that deleted it was purged, the data will get resurrected.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mention about the data resurrection to Motivation paragraph

the node disks. When nodes are added or removed from the cluster, they gain or lose some tokens, which can result in
files stored on the node disks still containing data associated with lost tokens. Over time, this can lead to a build-up
of unnecessary data and cause disk space issues. By running node cleanup after scaling, these files can be cleared,
freeing up disk space.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. It's more than just cleaning up disk space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mention about the data resurrection to Motivation paragraph


### Non-Goals

Running node cleanup during off-peak hours.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add: Running node cleanup after vertical scaling.
(Since it is not needed in this case)

Copy link
Collaborator Author

@zimnx zimnx Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a drawback, not a non-goal. Mentioned it in Drawbacks paragraph

tokens for each node as an annotation in the member service.
In addition, a new controller in Operator will be responsible for managing Jobs that will execute a cleanup on nodes
that require it. The trigger for the Job creation will be a mismatch between the current and latest hash. The controller
will ensure that there will be only one cleanup Job running at the same time to prevent extraneous load on the cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Job/job/

Should be one cleanup job per node I guess.
But we can (and should) run cleanup on all nodes in parallel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I think we agreed on running in parallel. the proposal should reflect it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


This design doesn’t take into account whether a node received a token or lost it, it only detects ring changes and
reacts with a cleanup trigger upon change. When a node is decommissioned, tokens are redistributed and nodes getting
them doesn’t require a cleanup since there’s no stale data on their disks associated with these new tokens. Operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/doesn't/don't/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#### Cleanup is not run when necessary

When keyspace RF is decreased, nodes no longer need to keep extraneous copies of the data, cleanup could free the disks.
Approach designed here doesn't detect this case because the token ring is not changed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, cleanup is needed after changing replication strategy, e.g. from Simple to NetworkTopology as token ownership of secondary replica will change.

By the way, speaking of automation in this respect - when increasing RF, repair need to run in order to build the additional replicas. This is probably not automated as well IIUC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mention about it.

@tnozicka tnozicka added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 26, 2023
tokens for each node as an annotation in the member service.
In addition, a new controller in Operator will be responsible for managing Jobs that will execute a cleanup on nodes
that require it. The trigger for the Job creation will be a mismatch between the current and latest hash. The controller
will ensure that there will be only one cleanup Job running at the same time to prevent extraneous load on the cluster
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, I think we agreed on running in parallel. the proposal should reflect it.

enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
enhancements/proposals/cleanup_after_scaling.md Outdated Show resolved Hide resolved
@tnozicka
Copy link
Member

Also, what is this file for: enhancements/proposals/dir_placeholder.delete_me_with_first_proposal?

git doesn't track folders, so this was used as a placeholder for that directory until we have the first proposal here which should remove it.

@zimnx zimnx changed the title Enhancement for running cleanup after scaling Enhancement proposal for running cleanup after scaling Jul 27, 2023
@zimnx zimnx force-pushed the mz/cleanup-proposal branch 2 times, most recently from 9f4da8f to 377fe05 Compare July 27, 2023 09:25
@zimnx zimnx requested review from tnozicka and bhalevy July 27, 2023 09:27
@rzetelskik
Copy link
Member

rzetelskik commented Aug 8, 2023

I think that instead of adding a file with the proposal directly to the top directory, we should adopt a similar naming scheme to the original KEP repository, i.e. a subdirectory prefixed with a tracking issue number and a readme inside of it. I believe it provides better readability, as well as a reference to the origin of the proposal. See What is the number at the beginning of the KEP name?.

@scylla-operator-bot scylla-operator-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 8, 2023
@zimnx
Copy link
Collaborator Author

zimnx commented Aug 8, 2023

I think that instead of adding a file with the proposal directly to the top directory, we should adopt a similar naming scheme to the original KEP repository, i.e. a subdirectory prefixed with a tracking issue number and a readme inside of it. I believe it provides better readability, as well as a reference to the origin of the proposal. See What is the number at the beginning of the KEP name?.

I agree, in addition having a separate directory allows for external file injection like images, yamls etc. Moved it.

@zimnx zimnx enabled auto-merge August 9, 2023 13:27
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

thanks

@zimnx zimnx merged commit 96dd75e into scylladb:master Aug 9, 2023
21 checks passed
@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023
@scylla-operator-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zimnx zimnx deleted the mz/cleanup-proposal branch August 9, 2023 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm Indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants