Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run node cleanup after scaling the ScyllaCluster #1294

Merged
merged 2 commits into from Aug 10, 2023

Conversation

zimnx
Copy link
Collaborator

@zimnx zimnx commented Jul 3, 2023

Running node cleanup after scaling a cluster allows to avoid the
accumulation of unnecessary data on the node disks. When nodes are added
or removed from the cluster, they gain or lose some tokens, which can
result in files stored on the node disks still containing data
associated with lost tokens. Over time, this can lead to a build-up of
unnecessary data and cause disk space issues. By running node cleanup
after scaling, these files can be cleared, freeing up disk space.

Scylla Operator was extended with controllers responsible for executing
a cleanup on nodes after horizontally scaling.

Fixes #1317

@zimnx zimnx added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 3, 2023
@zimnx zimnx added this to the v1.10 milestone Jul 3, 2023
@zimnx zimnx requested review from tnozicka and rzetelskik July 3, 2023 16:29
@tzach
Copy link
Contributor

tzach commented Jul 3, 2023

@zimnx to validate the logic:
If a cluster scales from X to 2X nodes, for example, how many times is cleanup executed, X times (on for each new node) or once?

@zimnx
Copy link
Collaborator Author

zimnx commented Jul 4, 2023

If a cluster scales from X to 2X nodes, for example, how many times is cleanup executed, X times (on for each new node) or once?

It's executed once on every node except last one that joined, only when Operator is done with scaling.
So if you have N nodes, and you scale to 2*N, Operator will add N new nodes first, and only then execute 2*N-1 node cleanups, one node at a time.

Edit:
I changed code to run cleanup Jobs in parallel as ScyllaDB core team member pointed out that cleanup no longer influence user traffic that much and it's fine to remove this limitation.

Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Very nice job overall!

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/controller/sidecar/sync.go Show resolved Hide resolved
test/e2e/set/scyllacluster/scyllacluster_cleanup.go Outdated Show resolved Hide resolved
pkg/controller/sidecar/sync.go Outdated Show resolved Hide resolved
@@ -433,13 +434,13 @@ func TestStatefulSetForRack(t *testing.T) {
Kind: "ScyllaCluster",
Name: "basic",
UID: "the-uid",
Controller: pointer.Bool(true),
BlockOwnerDeletion: pointer.Bool(true),
Controller: pointer.Ptr(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these changes be in a separate commit? They're not directly related to this feature.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extracted

Copy link
Member

@tnozicka tnozicka Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume they could technically even ship separately (but the bar is lower until we get a new CI)

pkg/controller/sidecar/sync.go Outdated Show resolved Hide resolved
test/e2e/set/scyllacluster/scyllacluster_cleanup.go Outdated Show resolved Hide resolved
test/e2e/utils/observer.go Outdated Show resolved Hide resolved
test/e2e/set/scyllacluster/scyllacluster_cleanup.go Outdated Show resolved Hide resolved
@zimnx zimnx force-pushed the mz/cleanup branch 2 times, most recently from 7d1f6ce to fc4c472 Compare July 17, 2023 12:05
@rzetelskik rzetelskik self-requested a review July 19, 2023 10:17
rzetelskik
rzetelskik previously approved these changes Jul 19, 2023
Copy link
Member

@rzetelskik rzetelskik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, lgtm now

@zimnx zimnx marked this pull request as draft July 20, 2023 10:28
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanned through the important parts, got a few question before I go through the rest, thanks!

pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/controller/sidecar/sync.go Outdated Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
@zimnx zimnx marked this pull request as ready for review July 21, 2023 12:19
@zimnx zimnx force-pushed the mz/cleanup branch 2 times, most recently from d18639a to 79b1c0d Compare July 24, 2023 13:20
@zimnx zimnx requested a review from tnozicka July 25, 2023 14:03
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

round 2

pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Show resolved Hide resolved
pkg/controller/scyllacluster/sync_jobs.go Outdated Show resolved Hide resolved
progressingMessages = append(progressingMessages, fmt.Sprintf("Waiting for ScyllaCluster %q to rollout.", naming.ObjRef(sc)))
}

if len(progressingMessages) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: being discussed on the proposal - #1300 (comment)

pkg/controller/scyllacluster/resource.go Outdated Show resolved Hide resolved
pkg/controller/scyllacluster/resource.go Outdated Show resolved Hide resolved
pkg/cmd/operator/cleanupjob.go Outdated Show resolved Hide resolved
pkg/controller/scyllacluster/resource_test.go Outdated Show resolved Hide resolved
pkg/controller/sidecar/sync.go Show resolved Hide resolved
pkg/controller/sidecar/sync.go Show resolved Hide resolved
pkg/controllerhelpers/scylla.go Outdated Show resolved Hide resolved
pkg/naming/constants.go Outdated Show resolved Hide resolved
pkg/scyllaclient/client.go Show resolved Hide resolved
test/e2e/set/scyllacluster/scyllacluster_cleanup.go Outdated Show resolved Hide resolved
test/e2e/set/scyllacluster/scyllacluster_cleanup.go Outdated Show resolved Hide resolved
@zimnx zimnx marked this pull request as draft July 31, 2023 12:34
@zimnx zimnx force-pushed the mz/cleanup branch 2 times, most recently from 88e6b3a to f0cdcd8 Compare July 31, 2023 17:01
@zimnx zimnx marked this pull request as ready for review August 2, 2023 09:56
@zimnx zimnx requested a review from tnozicka August 2, 2023 09:56
@mykaul
Copy link
Contributor

mykaul commented Aug 8, 2023

What's the latest with this PR? I see most / all conversations marked as 'outdated' - can we resolve those that were and move forward?

@zimnx
Copy link
Collaborator Author

zimnx commented Aug 8, 2023

What's the latest with this PR? I see most / all conversations marked as 'outdated' - can we resolve those that were and move forward?

It's waiting for @tnozicka feedback about review remarks resolution and approval.

tnozicka
tnozicka previously approved these changes Aug 9, 2023
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the updates

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023
@scylla-operator-bot scylla-operator-bot bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. do-not-merge/contains-merge-commits size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. labels Aug 9, 2023
@scylla-operator-bot scylla-operator-bot bot removed do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. do-not-merge/contains-merge-commits labels Aug 9, 2023
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023
Running node cleanup after scaling a cluster allows to avoid the
accumulation of unnecessary data on the node disks. When nodes are added
or removed from the cluster, they gain or lose some tokens, which can
result in files stored on the node disks still containing data
associated with lost tokens. Over time, this can lead to a build-up of
unnecessary data and cause disk space issues. By running node cleanup
after scaling, these files can be cleared, freeing up disk space.

Scylla Operator was extended with controllers responsible for executing
a cleanup on nodes after horizontally scaling.
@scylla-operator-bot scylla-operator-bot bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023
@zimnx zimnx enabled auto-merge August 9, 2023 13:35
Copy link
Member

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@scylla-operator-bot scylla-operator-bot bot added the lgtm Indicates that a PR is ready to be merged. label Aug 9, 2023
@scylla-operator-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tnozicka, zimnx

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@zimnx zimnx merged commit d131f8e into scylladb:master Aug 10, 2023
39 of 41 checks passed
@zimnx zimnx deleted the mz/cleanup branch August 10, 2023 10:22
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Aug 15, 2023
Operator v1.10 runs cleanups [1] on each of the old nodes after
adding a new one like it is recommended by the Scylla documentation [2].
So, cover this functionality with the functional test.

[1] scylladb/scylla-operator/pull/1294
[2] https://opensource.docs.scylladb.com/stable/operating-scylla/ \
      procedures/cluster-management/add-node-to-cluster.html
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Aug 15, 2023
Operator v1.10 runs cleanups [1] on each of the old nodes after
adding a new one like it is recommended by the Scylla documentation [2].
So, cover this functionality with the functional test.

[1] scylladb/scylla-operator/pull/1294
[2] https://opensource.docs.scylladb.com/stable/operating-scylla/ \
      procedures/cluster-management/add-node-to-cluster.html
vponomaryov added a commit to vponomaryov/scylla-cluster-tests that referenced this pull request Aug 17, 2023
Operator v1.10 runs cleanups [1] on each of the old nodes after
adding a new one like it is recommended by the Scylla documentation [2].
So, cover this functionality with the functional test.

[1] scylladb/scylla-operator/pull/1294
[2] https://opensource.docs.scylladb.com/stable/operating-scylla/ \
      procedures/cluster-management/add-node-to-cluster.html
fruch pushed a commit to scylladb/scylla-cluster-tests that referenced this pull request Aug 25, 2023
Operator v1.10 runs cleanups [1] on each of the old nodes after
adding a new one like it is recommended by the Scylla documentation [2].
So, cover this functionality with the functional test.

[1] scylladb/scylla-operator/pull/1294
[2] https://opensource.docs.scylladb.com/stable/operating-scylla/ \
      procedures/cluster-management/add-node-to-cluster.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement running node cleanup
5 participants