GKE: Can't reduce cluster if kubernetes node is gone #215

dkropachev · 2020-10-23T11:02:29Z

Describe the bug
Can't reduce cluster if kubernetes node is gone

To Reproduce
Steps to reproduce the behavior:

Deploy scylla-operator on GKE
Deploy scylla cluster with at least 2 nodes:
kubectl apply -n scylla -f ./examples/eks/cluster.yaml
Get nodeName of the last pod:
kubectl get pods -n scylla -o yaml | grep nodeName
Kill kubernetes node:
kubectl delete node ${nodeName}
Reduce scylla cluster by 1 member:
sed -r 's/([ \t]+)members: 3/\1members: 2/g' ./examples/eks/cluster.yaml
kubectl apply -n scylla -f ./examples/eks/cluster.yaml
Wait till node is removed:
kubectl --namespace=scylla wait --timeout=5m --all --for=condition=Ready pod
error: timed out waiting for the condition on pods/sct-cluster-us-east1-c-us-east1-2

Expected behavior
Node is removed
Next atempt to increase members number succeeded

Logs
operator.zip

Environment:

Platform: GKE
Kubernetes version: 1.15.12-gke.20
Scylla version:4.1.2
Scylla-operator version: e.g.: 0.2.4

slivne · 2020-10-27T11:39:27Z

So this is related to handling of lost host and how K8 / scylla-operator handles that

When k8s node is gone, PVC might still have node affinity pointing to lost node. In this situation, PVC is deleted by the Operator and node replacement logic is triggered to restore cluster RF. Fixes #215

When k8s node is gone, PVC might still have node affinity pointing to lost node. In this situation, PVC is deleted by the Operator and node replacement logic is triggered to restore cluster RF. Fixes #215 Fixes #114

When k8s node is gone, PVC might still have node affinity pointing to lost node. In this situation, PVC is deleted by the Operator and node replacement logic is triggered to restore cluster RF. Fixes #114 Fixes #215

zimnx · 2020-11-20T15:14:57Z

Tested following scenario on GKE (with #258)

Deploy 3 node cluster
kubectl delete node <last_pod_node>
Decrease number of Scylla Cluster members to 2

Operator found that node has PV with node affinity set to missing node, Operator started replacing node. Once new node reached UN state, decomission happened and ScyllaCluster was successfully scaled down to 2.

$ kubectl -n scylla get pods -w
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Terminating       0          4m12s
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Terminating       0          4m12s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Pending           0          0s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Pending           0          1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Init:0/2          0          1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Init:1/2          0          5s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     PodInitializing   0          6s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Running           0          51s
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Running           0          3m57s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Running           0          4m27s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Terminating       0          5m1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m4s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m11s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m11s

$ kubectl -n scylla get pods -o wide                                              
NAME                                           READY   STATUS    RESTARTS   AGE   IP            NODE                                          NOMINATED NODE   READINESS GATES
scylla-cluster-europe-west2-a-europe-west2-0   2/2     Running   0          13m   10.240.0.58   gke-maciej-215-3-default-pool-3e692d62-73g9   <none>           <none>
scylla-cluster-europe-west2-a-europe-west2-1   2/2     Running   0          11m   10.240.0.71   gke-maciej-215-3-default-pool-3e692d62-crqt   <none>           <none>

So scale down won't happen immediately, first node must be replaced.

When k8s node is gone, PVC might still have node affinity pointing to lost node. In this situation, PVC is deleted by the Operator and node replacement logic is triggered to restore cluster RF. Fixes #114 Fixes #215

dkropachev · 2020-12-07T07:02:41Z

Tested following scenario on GKE (with #258)

Deploy 3 node cluster
kubectl delete node <last_pod_node>
Decrease number of Scylla Cluster members to 2

Operator found that node has PV with node affinity set to missing node, Operator started replacing node. Once new node reached UN state, decomission happened and ScyllaCluster was successfully scaled down to 2.

$ kubectl -n scylla get pods -w
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Terminating       0          4m12s
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Terminating       0          4m12s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Pending           0          0s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Pending           0          1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Init:0/2          0          1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Init:1/2          0          5s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     PodInitializing   0          6s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Running           0          51s
scylla-cluster-europe-west2-a-europe-west2-2   2/2     Running           0          3m57s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Running           0          4m27s
scylla-cluster-europe-west2-a-europe-west2-2   1/2     Terminating       0          5m1s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m4s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m11s
scylla-cluster-europe-west2-a-europe-west2-2   0/2     Terminating       0          5m11s

$ kubectl -n scylla get pods -o wide                                              
NAME                                           READY   STATUS    RESTARTS   AGE   IP            NODE                                          NOMINATED NODE   READINESS GATES
scylla-cluster-europe-west2-a-europe-west2-0   2/2     Running   0          13m   10.240.0.58   gke-maciej-215-3-default-pool-3e692d62-73g9   <none>           <none>
scylla-cluster-europe-west2-a-europe-west2-1   2/2     Running   0          11m   10.240.0.71   gke-maciej-215-3-default-pool-3e692d62-crqt   <none>           <none>

So scale down won't happen immediately, first node must be replaced.

It is not happening on GKE, tested multiple times.

s-o stops doing things and sends TLS handshake error:

{"L":"INFO","T":"2020-12-04T19:31:48.383Z","N":"cluster-controller.replace","M":"Replace member Pod found","cluster":"scylla/sct-cluster","resourceVersion":"17507","member":"sct-cluster-us-east1-b-us-east1-2","replace_address":"10.3.249.58","ready":false,"_trace_id":"0pcVUQ4vRLuPa2RAdOlPGQ"}
{"L":"INFO","T":"2020-12-04T19:31:48.383Z","N":"cluster-controller","M":"Reconciliation successful","cluster":"scylla/sct-cluster","resourceVersion":"17507","_trace_id":"UWKHDuVeSG6hQFsMTS4qJg"}
2020/12/04 19:31:48 http: TLS handshake error from 10.0.3.1:54476: EOF
2020/12/04 19:31:58 http: TLS handshake error from 10.0.3.1:54516: EOF
2020/12/04 19:32:08 http: TLS handshake error from 10.0.3.1:54564: EOF
2020/12/04 19:32:18 http: TLS handshake error from 10.0.3.1:54598: EOF
2020/12/04 19:32:28 http: TLS handshake error from 10.0.3.1:54648: EOF
2020/12/04 19:32:38 http: TLS handshake error from 10.0.3.1:54678: EOF
2020/12/04 19:32:48 http: TLS handshake error from 10.0.3.1:54716: EOF
2020/12/04 19:32:58 http: TLS handshake error from 10.0.3.1:54744: EOF
2020/12/04 19:33:08 http: TLS handshake error from 10.0.3.1:54782: EOF
2020/12/04 19:33:18 http: TLS handshake error from 10.0.3.1:54812: EOF

Test-id: 5c40dbc7-ef2b-4792-96cf-b17739c0fbab

db-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/5c40dbc7-ef2b-4792-96cf-b17739c0fbab/20201204_202713/db-cluster-5c40dbc7.zip
kubernetes: https://cloudius-jenkins-test.s3.amazonaws.com/5c40dbc7-ef2b-4792-96cf-b17739c0fbab/20201204_202713/kubernetes-5c40dbc7.zip

Test-id: 65b0a4a9-232e-49ed-9198-820a3e4bc7d3

db-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/65b0a4a9-232e-49ed-9198-820a3e4bc7d3/20201206_074304/db-cluster-65b0a4a9.zip
kubernetes: https://cloudius-jenkins-test.s3.amazonaws.com/65b0a4a9-232e-49ed-9198-820a3e4bc7d3/20201206_074304/kubernetes-65b0a4a9.zip

dkropachev · 2020-12-07T08:59:07Z

Events in scylla namespace:

found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found

Finalizers in PVC caused a race between statefulset controller and pvc provisioner. Pod spawned on next available node was missing PVC and manual intervention was needed. Removing finalizers from PVC prior to PVC and Pod deletion seems to help. Fixes #215

zimnx · 2020-12-18T14:55:52Z

I managed to reproduce it locally two times over 20 runs on minikube and GKE.
I set up a watch on k8s etcd and it turns out that sometimes PVC was deleted after pod, and when it happened, new pod wasn't able to spawn due to above issue.
Once PVC is deleted forcefully issue no longer happened after 15 runs. I'll leave it for next couple of runs to be extra sure.

Finalizers in PVC caused a race between statefulset controller and pvc provisioner. Pod spawned on next available node was missing PVC and manual intervention was needed. Removing finalizers from PVC prior to PVC and Pod deletion seems to help. Fixes #215

First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #215

dkropachev added the kind/bug Categorizes issue or PR as related to a bug. label Oct 23, 2020

dkropachev mentioned this issue Oct 23, 2020

fix(disrupt_terminate_and_replace_node): make it valid for gke scylladb/scylla-cluster-tests#2810

Merged

7 tasks

slivne assigned zimnx Oct 27, 2020

zimnx added this to the 1.0 milestone Oct 27, 2020

slivne added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 27, 2020

zimnx mentioned this issue Nov 20, 2020

cluster: automatic replacement of orphaned nodes (#114 #215) #258

Merged

zimnx closed this as completed in #258 Nov 23, 2020

dkropachev reopened this Dec 7, 2020

zimnx mentioned this issue Dec 15, 2020

cluster: remove PVC finalizers prior deletion (#215) #297

Merged

zimnx closed this as completed in #297 Dec 18, 2020

zimnx added a commit that referenced this issue Dec 29, 2020

cluster: change order of delete/update pvc during replace (#215)

512ac08

First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #215

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE: Can't reduce cluster if kubernetes node is gone #215

GKE: Can't reduce cluster if kubernetes node is gone #215

dkropachev commented Oct 23, 2020

slivne commented Oct 27, 2020

zimnx commented Nov 20, 2020 •

edited

dkropachev commented Dec 7, 2020

dkropachev commented Dec 7, 2020

zimnx commented Dec 18, 2020

GKE: Can't reduce cluster if kubernetes node is gone #215

GKE: Can't reduce cluster if kubernetes node is gone #215

Comments

dkropachev commented Oct 23, 2020

slivne commented Oct 27, 2020

zimnx commented Nov 20, 2020 • edited

dkropachev commented Dec 7, 2020

Test-id: 5c40dbc7-ef2b-4792-96cf-b17739c0fbab

Test-id: 65b0a4a9-232e-49ed-9198-820a3e4bc7d3

dkropachev commented Dec 7, 2020

zimnx commented Dec 18, 2020

zimnx commented Nov 20, 2020 •

edited