Node replacement get stuck #291

dkropachev · 2020-12-09T13:04:41Z

Describe the bug
Node replacement get stuck
target node did not get back to life

To Reproduce
Steps to reproduce the behavior:

Deploy s-o
Depoy scylla cluster of 3 nodes
Drain 3rd node:

$ kubectl drain <svc> --ignore-daemonsets --delete-local-data

Command s-o to replace it:

$ kubectl -n scylla label svc simple-cluster-us-east-1-us-east-1a-2 scylla/replace=""

Look at the status and wait:

dkropachev@dkropahev-pc:/usr/src/scylladb/scylla-operator/dkropachev$ kubectl get pods -n scylla
NAME                                    READY   STATUS    RESTARTS   AGE
simple-cluster-us-east-1-us-east-1a-0   2/2     Running   0          17m
simple-cluster-us-east-1-us-east-1a-1   2/2     Running   0          16m
simple-cluster-us-east-1-us-east-1a-2   0/2     Pending   0          11m

Expected behavior
Node is back to life

Config Files and Logs

logs_and_configs.zip

Environment:

Platform: minikube
Kubernetes version: v1.19.1
Scylla version: 4.2.0
Scylla-operator version: 0bce43e

The text was updated successfully, but these errors were encountered:

dkropachev · 2020-12-09T13:10:07Z

At that state node does not produce any logs.

deleting pod, kicked things thru, but scylla failed to start:

kubectl delete pod -n scylla simple-cluster-us-east-1-us-east-1a-2

kubectl get pods -n scylla
NAME                                    READY   STATUS    RESTARTS   AGE
simple-cluster-us-east-1-us-east-1a-0   2/2     Running   0          29m
simple-cluster-us-east-1-us-east-1a-1   2/2     Running   0          27m
simple-cluster-us-east-1-us-east-1a-2   1/2     Running   0          4m32s

kubectl logs -n scylla simple-cluster-us-east-1-us-east-1a-2 scylla
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down database: waiting for background jobs...
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down database was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down migration manager notifier
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down migration manager notifier was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down prometheus API server
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down prometheus API server was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down sighup
INFO  2020-12-09 13:08:51,695 [shard 0] init - Shutting down sighup was successful
ERROR 2020-12-09 13:08:51,695 [shard 0] init - Startup failed: std::runtime_error (Cannot replace_address 10.111.163.220 because it doesn't exist in gossip)

zimnx · 2020-12-15T15:03:35Z

@dkropachev do you have Scylla logs saved anywhere?
This runtime error suggest that IP address we were trying to replace doesn't exist in gossip info, but I see in Scylla Operator logs that it's correct, and was bound to last service in this cluster.

Stuck node is going to be fixed in #297

dkropachev · 2020-12-15T15:05:35Z

Unfortunately I did not grab them, tomorrow I will spin test environment and get them

dkropachev · 2020-12-15T15:16:28Z

Also, it worth to mention that there are a lot of cases how we can endup with missing ip address in gossip, i think s-o should check if address in the gossip before forcing scylla to replace it.

dkropachev · 2020-12-24T02:23:39Z

Still see that in some cases node replacement did not endup with fully functioning node, latest example:

db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/db-cluster-9881ad6a.zip |
monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/monitor-set-9881ad6a.zip |
loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/loader-set-9881ad6a.zip |
kubernetes | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/kubernetes-9881ad6a.zip |
sct-runner | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/sct-runner-9881ad6a.zip

events:

61m         Normal    SuccessfulCreate       statefulset/sct-cluster-us-east1-b-us-east1                    create Claim data-sct-cluster-us-east1-b-us-east1-2 Pod sct-cluster-us-east1-b-us-east1-2 in StatefulSet sct-cluster-us-east1-b-us-east1 success
60m         Warning   Unhealthy              pod/sct-cluster-us-east1-b-us-east1-2                          Readiness probe failed: Get http://10.142.0.56:8080/readyz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
61m         Normal    Pulled                 pod/sct-cluster-us-east1-b-us-east1-2                          Container image "scylladb/scylla-manager-agent:2.2.0" already present on machine
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
61m         Normal    Synced                 scyllacluster/sct-cluster                                      Rack "us-east1" replaced "sct-cluster-us-east1-b-us-east1-2" node
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found

zimnx · 2020-12-28T11:11:43Z

@dkropachev how often do you encounter this issue?

dkropachev · 2020-12-28T11:30:25Z

@dkropachev how often do you encounter this issue?

Every second run of the related jenkins job, i.e. 1 of 12-16 replacing.

First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #291

dkropachev added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2020

zimnx added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 9, 2020

dkropachev mentioned this issue Dec 15, 2020

feature(nemesis): add OperatorNodeTerminateAndReplace scylladb/scylla-cluster-tests#2987

Merged

7 tasks

zimnx added a commit that referenced this issue Dec 29, 2020

cluster: change order of delete/update pvc during replace (#291)

aea2da0

First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #291

zimnx mentioned this issue Dec 29, 2020

cluster: change order of delete/update pvc during replace (#291) #316

Merged

1 task

zimnx closed this as completed in #316 Dec 29, 2020

zimnx added a commit that referenced this issue Dec 29, 2020

cluster: change order of delete/update pvc during replace (#291)

07950cd

First initiate PVC deletion and then clear finalizers to unblock PVC deletion. Fixes #291

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node replacement get stuck #291

Node replacement get stuck #291

dkropachev commented Dec 9, 2020

dkropachev commented Dec 9, 2020

zimnx commented Dec 15, 2020

dkropachev commented Dec 15, 2020

dkropachev commented Dec 15, 2020

dkropachev commented Dec 24, 2020

zimnx commented Dec 28, 2020

dkropachev commented Dec 28, 2020

Node replacement get stuck #291

Node replacement get stuck #291

Comments

dkropachev commented Dec 9, 2020

dkropachev commented Dec 9, 2020

zimnx commented Dec 15, 2020

dkropachev commented Dec 15, 2020

dkropachev commented Dec 15, 2020

dkropachev commented Dec 24, 2020

zimnx commented Dec 28, 2020

dkropachev commented Dec 28, 2020