New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE: Can't reduce cluster if kubernetes node is gone #215
Comments
So this is related to handling of lost host and how K8 / scylla-operator handles that |
When k8s node is gone, PVC might still have node affinity pointing to lost node. In this situation, PVC is deleted by the Operator and node replacement logic is triggered to restore cluster RF. Fixes #215
Tested following scenario on GKE (with #258)
Operator found that node has PV with node affinity set to missing node, Operator started replacing node. Once new node reached UN state, decomission happened and ScyllaCluster was successfully scaled down to 2.
So scale down won't happen immediately, first node must be replaced. |
It is not happening on GKE, tested multiple times. s-o stops doing things and sends TLS handshake error:
Test-id: 5c40dbc7-ef2b-4792-96cf-b17739c0fbabdb-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/5c40dbc7-ef2b-4792-96cf-b17739c0fbab/20201204_202713/db-cluster-5c40dbc7.zip Test-id: 65b0a4a9-232e-49ed-9198-820a3e4bc7d3db-cluster: https://cloudius-jenkins-test.s3.amazonaws.com/65b0a4a9-232e-49ed-9198-820a3e4bc7d3/20201206_074304/db-cluster-65b0a4a9.zip |
Events in scylla namespace:
|
Finalizers in PVC caused a race between statefulset controller and pvc provisioner. Pod spawned on next available node was missing PVC and manual intervention was needed. Removing finalizers from PVC prior to PVC and Pod deletion seems to help. Fixes #215
Finalizers in PVC caused a race between statefulset controller and pvc provisioner. Pod spawned on next available node was missing PVC and manual intervention was needed. Removing finalizers from PVC prior to PVC and Pod deletion seems to help. Fixes #215
I managed to reproduce it locally two times over 20 runs on minikube and GKE. |
Finalizers in PVC caused a race between statefulset controller and pvc provisioner. Pod spawned on next available node was missing PVC and manual intervention was needed. Removing finalizers from PVC prior to PVC and Pod deletion seems to help. Fixes #215
First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #215
Describe the bug
Can't reduce cluster if kubernetes node is gone
To Reproduce
Steps to reproduce the behavior:
kubectl apply -n scylla -f ./examples/eks/cluster.yaml
kubectl get pods -n scylla -o yaml | grep nodeName
kubectl delete node ${nodeName}
sed -r 's/([ \t]+)members: 3/\1members: 2/g' ./examples/eks/cluster.yaml
kubectl apply -n scylla -f ./examples/eks/cluster.yaml
kubectl --namespace=scylla wait --timeout=5m --all --for=condition=Ready pod
error: timed out waiting for the condition on pods/sct-cluster-us-east1-c-us-east1-2
Expected behavior
Node is removed
Next atempt to increase members number succeeded
Logs
operator.zip
Environment:
The text was updated successfully, but these errors were encountered: