`The node is already decommissioned` error after second decommission of the same Scylla member which was recreated #1293

vponomaryov · 2023-07-03T11:06:27Z

Describe the bug
The following scenario was ran twice:

Terminate K8S node that hosts last Scylla pod
Reduce the members number in the ScyllaCluster object from 3 to 2
Wait for the pod deletion
Increase the members number in the ScyllaCluster object from 2 to 3
Wait for the pod readiness

And on the second loop we get following error after the decomission step:

INFO  2023-07-02 11:16:31,303 [shard 0] storage_service - entering DECOMMISSIONED mode
INFO  2023-07-02 11:16:31,303 [shard 0] storage_service - DECOMMISSIONING: done
INFO  2023-07-02 11:16:31,312 [shard 2] compaction - [Compact system.local e5816660-18c9-11ee-8b2e-3c47bb21f959] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-107-big-Data.db:level=0]. 86kB to 45kB (~52% of original) in 8ms = 5MB/s. ~256 total partitions merged to 1.
I0702 11:16:33.064186       1 sidecar/sync.go:92] "The node is already decommissioned"
I0702 11:16:37.630922       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
I0702 11:16:47.630724       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
...
I0702 11:46:57.630567       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"

To Reproduce
Steps to reproduce the behavior:

Deploy Scylla operator
Create 3-members Scylla cluster using dynamic volume provisioner
Run above mentioned scenario twice
See error

Expected behavior
The mentioned scenario must work any number of times.

Logs

db-cluster-8d793a38.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/db-cluster-8d793a38.tar.gz
sct-runner-events-8d793a38.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/sct-runner-events-8d793a38.tar.gz
sct-8d793a38.log.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/sct-8d793a38.log.tar.gz
monitor-set-8d793a38.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/monitor-set-8d793a38.tar.gz
loader-set-8d793a38.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/loader-set-8d793a38.tar.gz
kubernetes-8d793a38.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/8d793a38-419b-4abf-b68c-a40edac2723a/20230702_120106/kubernetes-8d793a38.tar.gz

Environment:

Platform: GKE
Kubernetes version: v1.27.2-gke.2100
Scylla version: 2022.2.9-20230618.843304f9f734
Scylla-operator version: 1.9.0-rc.2

The text was updated successfully, but these errors were encountered:

vponomaryov · 2023-07-03T15:07:34Z

In the second run of the same scenario the bug was not reproduced: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/100/consoleFull
Everything was the same. Looks like some race condition...

rzetelskik · 2023-07-03T16:09:50Z

Everything was the same. Looks like some race condition...

To me it looks like the ScyllaCluster's .spec.members was updated back to 3 members before the operator scaled the StatefulSet down, so that's probably right, but I don't have the proof for that yet. Also @vponomaryov could you please point me to the code of the failing longevity test?

vponomaryov · 2023-07-04T15:43:16Z

Ran 2 more test runs and both didn't face this bug.
So, it turns out to be 1 occurrence out of 4 runs.

rzetelskik · 2023-07-06T12:06:16Z

I believe the scenario you mentioned in the original post wasn't exactly followed here.

Although you can see in sct-8d793a38.log the following logs:

< t:2023-07-02 10:52:53,225 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `2' in sct-cluster's spec
< t:2023-07-02 10:52:53,351 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2"...
...
< t:2023-07-02 10:52:55,672 f:base.py         l:222  c:LocalCmdRunner       p:DEBUG > pod/sct-cluster-us-east1-b-us-east1-2 condition met
< t:2023-07-02 10:52:55,685 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-89
0728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2" finished with status 0
...
< t:2023-07-02 10:52:59,934 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `3' in sct-cluster's spec

It doesn't look like the Pod was ever actually deleted.
ScyllaOperator should've logged the following line:

"Scaling StatefulSet" ScyllaCluster="scylla/sct-cluster" StatefulSet="scylla/sct-cluster-us-east1-b-us-east1" CurrentReplicas=3 UpdatedReplicas=2

and then the opposite. Neither is present in the logs.
Looking at the logs of the actual Scylla Pod db-cluster-8d793a38/sct-cluster-us-east1-b-us-east1-2/system.log you can also see that the node decommissioned successfully and just kept hanging there, without the Pod being terminated.
Unfortunately I can't confirm that by looking at the Pod's manifest, since it randomly died much later in the process, and the creationTimestamp was overwritten.

So what actually happened, and what reproduces the issue is:

1. Deploy a ScyllaCluster with 3 members
2. Wait for ScyllaCluster to rollout
3. Reduce the members number in the ScyllaCluster object from 3 to 2
4. Immediately increase the members number in the ScyllaCluster object from 2 to 3
5. Wait for ScyllaCluster to rollout

The e2e implementing this scenario:

var _ = g.Describe("ScyllaCluster", func() {
	defer g.GinkgoRecover()

	f := framework.NewFramework("scyllacluster")

	g.FIt("should support scale-up after decommissioning", func() {
		ctx, cancel := context.WithTimeout(context.Background(), testTimeout)
		defer cancel()

		sc := scyllafixture.BasicScyllaCluster.ReadOrFail()
		sc.Spec.Datacenter.Racks[0].Members = 3

		framework.By("Creating a ScyllaCluster with 3 members")
		sc, err := f.ScyllaClient().ScyllaV1().ScyllaClusters(f.Namespace()).Create(ctx, sc, metav1.CreateOptions{})
		o.Expect(err).NotTo(o.HaveOccurred())

		framework.By("Waiting for the ScyllaCluster to rollout (RV=%s)", sc.ResourceVersion)
		waitCtx1, waitCtx1Cancel := utils.ContextForRollout(ctx, sc)
		defer waitCtx1Cancel()
		sc, err = utils.WaitForScyllaClusterState(waitCtx1, f.ScyllaClient().ScyllaV1(), sc.Namespace, sc.Name, utils.WaitForStateOptions{}, utils.IsScyllaClusterRolledOut)
		o.Expect(err).NotTo(o.HaveOccurred())

		verifyScyllaCluster(ctx, f.KubeClient(), sc)
		hosts := getScyllaHostsAndWaitForFullQuorum(ctx, f.KubeClient().CoreV1(), sc)
		o.Expect(hosts).To(o.HaveLen(3))
		diRF1 := insertAndVerifyCQLData(ctx, hosts)
		defer diRF1.Close()

		framework.By("Scaling the ScyllaCluster down to 2 replicas")
		sc, err = f.ScyllaClient().ScyllaV1().ScyllaClusters(sc.Namespace).Patch(
			ctx,
			sc.Name,
			types.JSONPatchType,
			[]byte(`[{"op": "replace", "path": "/spec/datacenter/racks/0/members", "value": 2}]`),
			metav1.PatchOptions{},
		)
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(sc.Spec.Datacenter.Racks[0].Members).To(o.BeEquivalentTo(2))

		framework.By("Scaling the ScyllaCluster back to 3 replicas")
		sc, err = f.ScyllaClient().ScyllaV1().ScyllaClusters(f.Namespace()).Patch(
			ctx,
			sc.Name,
			types.JSONPatchType,
			[]byte(`[{"op": "replace", "path": "/spec/datacenter/racks/0/members", "value": 3}]`),
			metav1.PatchOptions{},
		)
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(sc.Spec.Datacenter.Racks[0].Members).To(o.BeEquivalentTo(3))

		framework.By("Waiting for the ScyllaCluster to rollout (RV=%s)", sc.ResourceVersion)
		waitCtx6, waitCtx6Cancel := utils.ContextForRollout(ctx, sc)
		defer waitCtx6Cancel()
		sc, err = utils.WaitForScyllaClusterState(waitCtx6, f.ScyllaClient().ScyllaV1(), sc.Namespace, sc.Name, utils.WaitForStateOptions{}, utils.IsScyllaClusterRolledOut)
		o.Expect(err).NotTo(o.HaveOccurred())

		verifyScyllaCluster(ctx, f.KubeClient(), sc)
	})
})

Nevertheless I believe this case should be handled by the operator. Once we've started decommissioning the node, it shall be deleted.
We've discussed it with @zimnx and he'll include this case in his recent PR #1283

vponomaryov · 2023-07-06T14:36:22Z

Although you can see in sct-8d793a38.log the following logs:

< t:2023-07-02 10:52:53,225 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `2' in sct-cluster's spec
< t:2023-07-02 10:52:53,351 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2"...
...
< t:2023-07-02 10:52:55,672 f:base.py         l:222  c:LocalCmdRunner       p:DEBUG > pod/sct-cluster-us-east1-b-us-east1-2 condition met
< t:2023-07-02 10:52:55,685 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-89
0728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2" finished with status 0
...
< t:2023-07-02 10:52:59,934 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `3' in sct-cluster's spec

It doesn't look like the Pod was ever actually deleted.

Em,

what then the following means?

Command "
    kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache \
        --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2
" finished with status 0

rzetelskik · 2023-07-06T14:54:48Z

Read the rest of the comment. Printing the log doesn't magically delete the Pod. As to what exactly happened there - I have no clue. All I can see from the operator logs is that the StatefulSet wasn't scaled down/up.

vponomaryov · 2023-07-06T14:57:41Z

Read the rest of the comment. Printing the log doesn't magically delete the Pod.

I read everything.
I am saying that in this case K8S returned us ok result when it is not.

rzetelskik · 2023-07-06T15:04:41Z

I understand that. All I'm saying is that it doesn't look like the Pod was actually deleted. It doesn't imply your test was wrong or anything. Without the audit logs I can't be a hundred percent sure that the Pod wasn't deleted, but looking at the operator logs it definitely wasn't triggered by it. The logs of the Scylla node don't look like the Pod was deleted either.
Anyway - I don't think it matters. Scaling back up before the Pod disappeared shouldn't cause the decommissioned Pod to stay in some limbo forever.

vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Jul 3, 2023

zimnx assigned rzetelskik Jul 3, 2023

rzetelskik added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 5, 2023

rzetelskik mentioned this issue Jul 6, 2023

Support scaling changes during ongoing scaling #1283

Draft

tnozicka assigned zimnx and unassigned rzetelskik Oct 16, 2023

tnozicka added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`The node is already decommissioned` error after second decommission of the same Scylla member which was recreated #1293

`The node is already decommissioned` error after second decommission of the same Scylla member which was recreated #1293

vponomaryov commented Jul 3, 2023

vponomaryov commented Jul 3, 2023

rzetelskik commented Jul 3, 2023

vponomaryov commented Jul 4, 2023

rzetelskik commented Jul 6, 2023

vponomaryov commented Jul 6, 2023

rzetelskik commented Jul 6, 2023 •

edited

Loading

vponomaryov commented Jul 6, 2023

rzetelskik commented Jul 6, 2023

The node is already decommissioned error after second decommission of the same Scylla member which was recreated #1293

The node is already decommissioned error after second decommission of the same Scylla member which was recreated #1293

Comments

vponomaryov commented Jul 3, 2023

vponomaryov commented Jul 3, 2023

rzetelskik commented Jul 3, 2023

vponomaryov commented Jul 4, 2023

rzetelskik commented Jul 6, 2023

vponomaryov commented Jul 6, 2023

rzetelskik commented Jul 6, 2023 • edited Loading

vponomaryov commented Jul 6, 2023

rzetelskik commented Jul 6, 2023

`The node is already decommissioned` error after second decommission of the same Scylla member which was recreated #1293

`The node is already decommissioned` error after second decommission of the same Scylla member which was recreated #1293

rzetelskik commented Jul 6, 2023 •

edited

Loading