-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement proposal for running cleanup after scaling #1300
Conversation
cc @bhalevy - can you review the suggestion? |
Since this PR only contains the document for the proposal not the enhacement itself, how about renaming the PR subject to be the same as the patch: |
Also, what is this file for: |
## Summary | ||
|
||
The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation, | ||
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stored data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation, | ||
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become | ||
stale taking up unnecessary space. Operator should support running the cleanup to keep disk space as low as possible and | ||
ensure clusters are stable and reliable over time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most serious problem with not running cleanup is the possibility of data resurrection.
This happens when tombstones delete data on other nodes, and the tombstone is eventually purged, leaving behind neither the data nor the tombstone. Then, decommission or removenode, may expose the stale data that wasn't cleaned up, when the token ownership is moved back to the original node, that might still have the data. Since the tombstone that deleted it was purged, the data will get resurrected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention about the data resurrection to Motivation paragraph
the node disks. When nodes are added or removed from the cluster, they gain or lose some tokens, which can result in | ||
files stored on the node disks still containing data associated with lost tokens. Over time, this can lead to a build-up | ||
of unnecessary data and cause disk space issues. By running node cleanup after scaling, these files can be cleared, | ||
freeing up disk space. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. It's more than just cleaning up disk space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention about the data resurrection to Motivation paragraph
|
||
### Non-Goals | ||
|
||
Running node cleanup during off-peak hours. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can add: Running node cleanup after vertical scaling.
(Since it is not needed in this case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a drawback, not a non-goal. Mentioned it in Drawbacks paragraph
tokens for each node as an annotation in the member service. | ||
In addition, a new controller in Operator will be responsible for managing Jobs that will execute a cleanup on nodes | ||
that require it. The trigger for the Job creation will be a mismatch between the current and latest hash. The controller | ||
will ensure that there will be only one cleanup Job running at the same time to prevent extraneous load on the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Job/job/
Should be one cleanup job per node I guess.
But we can (and should) run cleanup on all nodes in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I think we agreed on running in parallel. the proposal should reflect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
This design doesn’t take into account whether a node received a token or lost it, it only detects ring changes and | ||
reacts with a cleanup trigger upon change. When a node is decommissioned, tokens are redistributed and nodes getting | ||
them doesn’t require a cleanup since there’s no stale data on their disks associated with these new tokens. Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/doesn't/don't/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#### Cleanup is not run when necessary | ||
|
||
When keyspace RF is decreased, nodes no longer need to keep extraneous copies of the data, cleanup could free the disks. | ||
Approach designed here doesn't detect this case because the token ring is not changed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, cleanup is needed after changing replication strategy, e.g. from Simple to NetworkTopology as token ownership of secondary replica will change.
By the way, speaking of automation in this respect - when increasing RF, repair need to run in order to build the additional replicas. This is probably not automated as well IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention about it.
tokens for each node as an annotation in the member service. | ||
In addition, a new controller in Operator will be responsible for managing Jobs that will execute a cleanup on nodes | ||
that require it. The trigger for the Job creation will be a mismatch between the current and latest hash. The controller | ||
will ensure that there will be only one cleanup Job running at the same time to prevent extraneous load on the cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I think we agreed on running in parallel. the proposal should reflect it.
git doesn't track folders, so this was used as a placeholder for that directory until we have the first proposal here which should remove it. |
c6bb3c0
to
9ff0699
Compare
9f4da8f
to
377fe05
Compare
377fe05
to
4475e8b
Compare
4475e8b
to
e169e66
Compare
I think that instead of adding a file with the proposal directly to the top directory, we should adopt a similar naming scheme to the original KEP repository, i.e. a subdirectory prefixed with a tracking issue number and a readme inside of it. I believe it provides better readability, as well as a reference to the origin of the proposal. See What is the number at the beginning of the KEP name?. |
e169e66
to
090c893
Compare
I agree, in addition having a separate directory allows for external file injection like images, yamls etc. Moved it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
thanks
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tnozicka, zimnx The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The Operator supports both horizontal and vertical scaling, but the procedure isn’t in sync with ScyllaDB documentation,
because the cleanup part is not implemented. It’s important because upon scaling, the stored on node disk might become
stale taking up unnecessary space. Operator should support running the cleanup to keep disk space as low as possible and ensure clusters are stable and reliable over time.
Details can be found in actual enhancement.