Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The node is already decommissioned error after second decommission of the same Scylla member which was recreated #1293

Open
vponomaryov opened this issue Jul 3, 2023 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@vponomaryov
Copy link
Contributor

Describe the bug
The following scenario was ran twice:

  • Terminate K8S node that hosts last Scylla pod
  • Reduce the members number in the ScyllaCluster object from 3 to 2
  • Wait for the pod deletion
  • Increase the members number in the ScyllaCluster object from 2 to 3
  • Wait for the pod readiness

And on the second loop we get following error after the decomission step:

INFO  2023-07-02 11:16:31,303 [shard 0] storage_service - entering DECOMMISSIONED mode
INFO  2023-07-02 11:16:31,303 [shard 0] storage_service - DECOMMISSIONING: done
INFO  2023-07-02 11:16:31,312 [shard 2] compaction - [Compact system.local e5816660-18c9-11ee-8b2e-3c47bb21f959] Compacted 2 sstables to [/var/lib/scylla/data/system/local-7ad54392bcdd35a684174e047860b377/me-107-big-Data.db:level=0]. 86kB to 45kB (~52% of original) in 8ms = 5MB/s. ~256 total partitions merged to 1.
I0702 11:16:33.064186       1 sidecar/sync.go:92] "The node is already decommissioned"
I0702 11:16:37.630922       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
I0702 11:16:47.630724       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"
...
I0702 11:46:57.630567       1 sidecar/probes.go:122] "readyz probe: node is not ready" Service="scylla/sct-cluster-us-east1-b-us-east1-2"

To Reproduce
Steps to reproduce the behavior:

  1. Deploy Scylla operator
  2. Create 3-members Scylla cluster using dynamic volume provisioner
  3. Run above mentioned scenario twice
  4. See error

Expected behavior
The mentioned scenario must work any number of times.

Logs

Environment:

  • Platform: GKE
  • Kubernetes version: v1.27.2-gke.2100
  • Scylla version: 2022.2.9-20230618.843304f9f734
  • Scylla-operator version: 1.9.0-rc.2
@vponomaryov vponomaryov added the kind/bug Categorizes issue or PR as related to a bug. label Jul 3, 2023
@vponomaryov
Copy link
Contributor Author

In the second run of the same scenario the bug was not reproduced: https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/valerii/job/vp-longevity-scylla-operator-3h-gke/100/consoleFull
Everything was the same. Looks like some race condition...

@rzetelskik
Copy link
Member

Everything was the same. Looks like some race condition...

To me it looks like the ScyllaCluster's .spec.members was updated back to 3 members before the operator scaled the StatefulSet down, so that's probably right, but I don't have the proof for that yet. Also @vponomaryov could you please point me to the code of the failing longevity test?

@vponomaryov
Copy link
Contributor Author

Ran 2 more test runs and both didn't face this bug.
So, it turns out to be 1 occurrence out of 4 runs.

@rzetelskik rzetelskik added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jul 5, 2023
@rzetelskik
Copy link
Member

I believe the scenario you mentioned in the original post wasn't exactly followed here.

Although you can see in sct-8d793a38.log the following logs:

< t:2023-07-02 10:52:53,225 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `2' in sct-cluster's spec
< t:2023-07-02 10:52:53,351 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2"...
...
< t:2023-07-02 10:52:55,672 f:base.py         l:222  c:LocalCmdRunner       p:DEBUG > pod/sct-cluster-us-east1-b-us-east1-2 condition met
< t:2023-07-02 10:52:55,685 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-89
0728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2" finished with status 0
...
< t:2023-07-02 10:52:59,934 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `3' in sct-cluster's spec

It doesn't look like the Pod was ever actually deleted.
ScyllaOperator should've logged the following line:

"Scaling StatefulSet" ScyllaCluster="scylla/sct-cluster" StatefulSet="scylla/sct-cluster-us-east1-b-us-east1" CurrentReplicas=3 UpdatedReplicas=2

and then the opposite. Neither is present in the logs.
Looking at the logs of the actual Scylla Pod db-cluster-8d793a38/sct-cluster-us-east1-b-us-east1-2/system.log you can also see that the node decommissioned successfully and just kept hanging there, without the Pod being terminated.
Unfortunately I can't confirm that by looking at the Pod's manifest, since it randomly died much later in the process, and the creationTimestamp was overwritten.

So what actually happened, and what reproduces the issue is:

1. Deploy a ScyllaCluster with 3 members
2. Wait for ScyllaCluster to rollout
3. Reduce the members number in the ScyllaCluster object from 3 to 2
4. Immediately increase the members number in the ScyllaCluster object from 2 to 3
5. Wait for ScyllaCluster to rollout

The e2e implementing this scenario:

var _ = g.Describe("ScyllaCluster", func() {
	defer g.GinkgoRecover()

	f := framework.NewFramework("scyllacluster")

	g.FIt("should support scale-up after decommissioning", func() {
		ctx, cancel := context.WithTimeout(context.Background(), testTimeout)
		defer cancel()

		sc := scyllafixture.BasicScyllaCluster.ReadOrFail()
		sc.Spec.Datacenter.Racks[0].Members = 3

		framework.By("Creating a ScyllaCluster with 3 members")
		sc, err := f.ScyllaClient().ScyllaV1().ScyllaClusters(f.Namespace()).Create(ctx, sc, metav1.CreateOptions{})
		o.Expect(err).NotTo(o.HaveOccurred())

		framework.By("Waiting for the ScyllaCluster to rollout (RV=%s)", sc.ResourceVersion)
		waitCtx1, waitCtx1Cancel := utils.ContextForRollout(ctx, sc)
		defer waitCtx1Cancel()
		sc, err = utils.WaitForScyllaClusterState(waitCtx1, f.ScyllaClient().ScyllaV1(), sc.Namespace, sc.Name, utils.WaitForStateOptions{}, utils.IsScyllaClusterRolledOut)
		o.Expect(err).NotTo(o.HaveOccurred())

		verifyScyllaCluster(ctx, f.KubeClient(), sc)
		hosts := getScyllaHostsAndWaitForFullQuorum(ctx, f.KubeClient().CoreV1(), sc)
		o.Expect(hosts).To(o.HaveLen(3))
		diRF1 := insertAndVerifyCQLData(ctx, hosts)
		defer diRF1.Close()

		framework.By("Scaling the ScyllaCluster down to 2 replicas")
		sc, err = f.ScyllaClient().ScyllaV1().ScyllaClusters(sc.Namespace).Patch(
			ctx,
			sc.Name,
			types.JSONPatchType,
			[]byte(`[{"op": "replace", "path": "/spec/datacenter/racks/0/members", "value": 2}]`),
			metav1.PatchOptions{},
		)
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(sc.Spec.Datacenter.Racks[0].Members).To(o.BeEquivalentTo(2))

		framework.By("Scaling the ScyllaCluster back to 3 replicas")
		sc, err = f.ScyllaClient().ScyllaV1().ScyllaClusters(f.Namespace()).Patch(
			ctx,
			sc.Name,
			types.JSONPatchType,
			[]byte(`[{"op": "replace", "path": "/spec/datacenter/racks/0/members", "value": 3}]`),
			metav1.PatchOptions{},
		)
		o.Expect(err).NotTo(o.HaveOccurred())
		o.Expect(sc.Spec.Datacenter.Racks[0].Members).To(o.BeEquivalentTo(3))

		framework.By("Waiting for the ScyllaCluster to rollout (RV=%s)", sc.ResourceVersion)
		waitCtx6, waitCtx6Cancel := utils.ContextForRollout(ctx, sc)
		defer waitCtx6Cancel()
		sc, err = utils.WaitForScyllaClusterState(waitCtx6, f.ScyllaClient().ScyllaV1(), sc.Namespace, sc.Name, utils.WaitForStateOptions{}, utils.IsScyllaClusterRolledOut)
		o.Expect(err).NotTo(o.HaveOccurred())

		verifyScyllaCluster(ctx, f.KubeClient(), sc)
	})
})

Nevertheless I believe this case should be handled by the operator. Once we've started decommissioning the node, it shall be deleted.
We've discussed it with @zimnx and he'll include this case in his recent PR #1283

@vponomaryov
Copy link
Contributor Author

Although you can see in sct-8d793a38.log the following logs:

< t:2023-07-02 10:52:53,225 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `2' in sct-cluster's spec
< t:2023-07-02 10:52:53,351 f:local_cmd_runner.py l:62   c:LocalCmdRunner       p:DEBUG > Running command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2"...
...
< t:2023-07-02 10:52:55,672 f:base.py         l:222  c:LocalCmdRunner       p:DEBUG > pod/sct-cluster-us-east1-b-us-east1-2 condition met
< t:2023-07-02 10:52:55,685 f:base.py         l:142  c:LocalCmdRunner       p:DEBUG > Command "kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-89
0728/.kube/http-cache --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2" finished with status 0
...
< t:2023-07-02 10:52:59,934 f:__init__.py     l:2552 c:sdcm.cluster_k8s     p:DEBUG > Replace `/spec/datacenter/racks/0/members' with `3' in sct-cluster's spec

It doesn't look like the Pod was ever actually deleted.

Em,

what then the following means?

Command "
    kubectl --cache-dir=/home/ubuntu/sct-results/20230702-080104-890728/.kube/http-cache \
        --namespace=scylla wait --timeout=21600s --for=delete pod sct-cluster-us-east1-b-us-east1-2
" finished with status 0

@rzetelskik
Copy link
Member

rzetelskik commented Jul 6, 2023

Read the rest of the comment. Printing the log doesn't magically delete the Pod. As to what exactly happened there - I have no clue. All I can see from the operator logs is that the StatefulSet wasn't scaled down/up.

@vponomaryov
Copy link
Contributor Author

Read the rest of the comment. Printing the log doesn't magically delete the Pod.

I read everything.
I am saying that in this case K8S returned us ok result when it is not.

@rzetelskik
Copy link
Member

I understand that. All I'm saying is that it doesn't look like the Pod was actually deleted. It doesn't imply your test was wrong or anything. Without the audit logs I can't be a hundred percent sure that the Pod wasn't deleted, but looking at the operator logs it definitely wasn't triggered by it. The logs of the Scylla node don't look like the Pod was deleted either.
Anyway - I don't think it matters. Scaling back up before the Pod disappeared shouldn't cause the decommissioned Pod to stay in some limbo forever.

@tnozicka tnozicka assigned zimnx and unassigned rzetelskik Oct 16, 2023
@tnozicka tnozicka added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

4 participants