Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node replacement get stuck #291

Closed
dkropachev opened this issue Dec 9, 2020 · 7 comments · Fixed by #316
Closed

Node replacement get stuck #291

dkropachev opened this issue Dec 9, 2020 · 7 comments · Fixed by #316
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@dkropachev
Copy link
Contributor

Describe the bug
Node replacement get stuck
target node did not get back to life

To Reproduce
Steps to reproduce the behavior:

  1. Deploy s-o
  2. Depoy scylla cluster of 3 nodes
  3. Drain 3rd node:
$ kubectl drain <svc> --ignore-daemonsets --delete-local-data 
  1. Command s-o to replace it:
$ kubectl -n scylla label svc simple-cluster-us-east-1-us-east-1a-2 scylla/replace=""
  1. Look at the status and wait:
dkropachev@dkropahev-pc:/usr/src/scylladb/scylla-operator/dkropachev$ kubectl get pods -n scylla
NAME                                    READY   STATUS    RESTARTS   AGE
simple-cluster-us-east-1-us-east-1a-0   2/2     Running   0          17m
simple-cluster-us-east-1-us-east-1a-1   2/2     Running   0          16m
simple-cluster-us-east-1-us-east-1a-2   0/2     Pending   0          11m

Expected behavior
Node is back to life

Config Files and Logs

logs_and_configs.zip

Environment:

  • Platform: minikube
  • Kubernetes version: v1.19.1
  • Scylla version: 4.2.0
  • Scylla-operator version: 0bce43e
@dkropachev dkropachev added the kind/bug Categorizes issue or PR as related to a bug. label Dec 9, 2020
@dkropachev
Copy link
Contributor Author

At that state node does not produce any logs.

deleting pod, kicked things thru, but scylla failed to start:

kubectl delete pod -n scylla simple-cluster-us-east-1-us-east-1a-2

kubectl get pods -n scylla
NAME                                    READY   STATUS    RESTARTS   AGE
simple-cluster-us-east-1-us-east-1a-0   2/2     Running   0          29m
simple-cluster-us-east-1-us-east-1a-1   2/2     Running   0          27m
simple-cluster-us-east-1-us-east-1a-2   1/2     Running   0          4m32s

kubectl logs -n scylla simple-cluster-us-east-1-us-east-1a-2 scylla
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down database: waiting for background jobs...
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down database was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down migration manager notifier
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down migration manager notifier was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down prometheus API server
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down prometheus API server was successful
INFO  2020-12-09 13:08:51,694 [shard 0] init - Shutting down sighup
INFO  2020-12-09 13:08:51,695 [shard 0] init - Shutting down sighup was successful
ERROR 2020-12-09 13:08:51,695 [shard 0] init - Startup failed: std::runtime_error (Cannot replace_address 10.111.163.220 because it doesn't exist in gossip)

@zimnx zimnx added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Dec 9, 2020
@zimnx
Copy link
Collaborator

zimnx commented Dec 15, 2020

@dkropachev do you have Scylla logs saved anywhere?
This runtime error suggest that IP address we were trying to replace doesn't exist in gossip info, but I see in Scylla Operator logs that it's correct, and was bound to last service in this cluster.

Stuck node is going to be fixed in #297

@dkropachev
Copy link
Contributor Author

Unfortunately I did not grab them, tomorrow I will spin test environment and get them

@dkropachev
Copy link
Contributor Author

Also, it worth to mention that there are a lot of cases how we can endup with missing ip address in gossip, i think s-o should check if address in the gossip before forcing scylla to replace it.

@dkropachev
Copy link
Contributor Author

Still see that in some cases node replacement did not endup with fully functioning node, latest example:

db-cluster | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/db-cluster-9881ad6a.zip |
monitor-set | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/monitor-set-9881ad6a.zip |
loader-set | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/loader-set-9881ad6a.zip |
kubernetes | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/kubernetes-9881ad6a.zip |
sct-runner | https://cloudius-jenkins-test.s3.amazonaws.com/9881ad6a-d801-464f-a19f-19c1b1d26231/20201223_200155/sct-runner-9881ad6a.zip

events:

61m         Normal    SuccessfulCreate       statefulset/sct-cluster-us-east1-b-us-east1                    create Claim data-sct-cluster-us-east1-b-us-east1-2 Pod sct-cluster-us-east1-b-us-east1-2 in StatefulSet sct-cluster-us-east1-b-us-east1 success
60m         Warning   Unhealthy              pod/sct-cluster-us-east1-b-us-east1-2                          Readiness probe failed: Get http://10.142.0.56:8080/readyz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
61m         Normal    Pulled                 pod/sct-cluster-us-east1-b-us-east1-2                          Container image "scylladb/scylla-manager-agent:2.2.0" already present on machine
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
61m         Normal    Synced                 scyllacluster/sct-cluster                                      Rack "us-east1" replaced "sct-cluster-us-east1-b-us-east1-2" node
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found
0s          Warning   FailedScheduling       pod/sct-cluster-us-east1-b-us-east1-2                          persistentvolumeclaim "data-sct-cluster-us-east1-b-us-east1-2" not found

@zimnx
Copy link
Collaborator

zimnx commented Dec 28, 2020

@dkropachev how often do you encounter this issue?

@dkropachev
Copy link
Contributor Author

@dkropachev how often do you encounter this issue?

Every second run of the related jenkins job, i.e. 1 of 12-16 replacing.

zimnx added a commit that referenced this issue Dec 29, 2020
First initiate PVC deletion and then clear finalizers
to unblock PVC deletion.

Fixes #291
zimnx added a commit that referenced this issue Dec 29, 2020
First initiate PVC deletion and then clear finalizers
to unblock PVC deletion.

Fixes #291
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants