cluster-reset fails occasionally when the quorum is lost #2682

ShylajaDevadiga · 2022-03-26T00:18:27Z

Environmental Info:
RKE2 Version: v1.23.5-rc3+rke2r1, v1.22.8-rc4+rke2r1

Node(s) CPU architecture, OS, and Version:
OS used for testing: rocky linux 8.4

Cluster Configuration:
4 node cluster, 3 server 1 agent

Describe the bug:

Steps To Reproduce:
Create a 3 node cluster, all servers
Stop two servers to simulate quorum loss
Stop rke2 service on first node
Run cluster-reset

sudo systemctl stop rke2-server
sudo rke2 server --cluster-reset

Expected behavior:
cluster-reset is expected to be seen on the console, indicating server is reset

Actual behavior:
Complains on wal file not found error":"open /var/lib/rancher/rke2/server/db/etcd-tmp/member/wal: no such file

Additional context / logs:

{"level":"info","ts":"2022-03-25T23:56:41.153Z","caller":"embed/etcd.go:276","msg":"now serving peer/client/metrics","local-member-id":"35a3a0e4dc08adcd","initial-advertise-peer-urls":["http://127.0.0.1:2400"],"listen-peer-urls":["http://127.0.0.1:2400"],"advertise-client-urls":["http://127.0.0.1:2399"],"listen-client-urls":["http://127.0.0.1:2399"],"listen-metrics-urls":[]}
{"level":"info","ts":"2022-03-25T23:56:41.153Z","caller":"etcdserver/server.go:744","msg":"starting initial election tick advance","election-ticks":10}
{"level":"fatal","ts":"2022-03-25T23:56:41.153Z","caller":"etcdserver/server.go:874","msg":"failed to purge wal file","error":"open /var/lib/rancher/rke2/server/db/etcd-tmp/member/wal: no such file or directory","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).purgeFile\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.1-k3s1/etcdserver/server.go:874\ngo.etcd.io/etcd/server/v3/etcdserver.(*EtcdServer).GoAttach.func1\n\t/go/pkg/mod/github.com/k3s-io/etcd/server/v3@v3.5.1-k3s1/etcdserver/server.go:2661"}

The text was updated successfully, but these errors were encountered:

ShylajaDevadiga · 2022-03-29T22:27:30Z

Validated cluster-reset as well as cluster-reset-restore functionality on v1.23.5-rc4+rke2r1, v1.22.8-rc5+rke2r1, v1.21.11-rc5+rke2r1

ShylajaDevadiga added the kind/bug Something isn't working label Mar 26, 2022

ShylajaDevadiga added this to the v1.23.5+rke2r1 milestone Mar 26, 2022

ShylajaDevadiga assigned brandond Mar 26, 2022

ShylajaDevadiga added this to To Triage in Development [DEPRECATED] via automation Mar 26, 2022

ShylajaDevadiga self-assigned this Mar 29, 2022

ShylajaDevadiga moved this from To Triage to To Test in Development [DEPRECATED] Mar 29, 2022

ShylajaDevadiga closed this as completed Mar 29, 2022

Development [DEPRECATED] automation moved this from To Test to Done Issue / Merged PR Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster-reset fails occasionally when the quorum is lost #2682

cluster-reset fails occasionally when the quorum is lost #2682

ShylajaDevadiga commented Mar 26, 2022

ShylajaDevadiga commented Mar 29, 2022

cluster-reset fails occasionally when the quorum is lost #2682

cluster-reset fails occasionally when the quorum is lost #2682

Comments

ShylajaDevadiga commented Mar 26, 2022

ShylajaDevadiga commented Mar 29, 2022