[release-1.25] Add additional static pod cleanup during cluster reset #4726

brandond · 2023-09-01T17:38:22Z

Proposed Changes

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences.

Background is discussed at #4707 (comment):

Shut down the etcd static pod (and the kubelet, to keep the kubelet from restarting it) at the end of the cluster-reset process, so that etcd doesn't have to be restarted and reconfigured midway through the next start. Etcd is explicitly shut down at the end of the cluster-reset process on k3s, we just haven't wired up the context on RKE2.
Remove the apiserver static pod manifest during rke2 startup, so that the kubelet doesn't start it before it's been written with the current config - after etcd starts.
need to confirm that this doesn't do anything weird during normal restarts of the rke2 service
Use the absence of etcd db files on a node with etcd enabled as an indicator of cluster-reset, and force cleanup of the etcd and apiserver static pods early on in startup. This prevents them from being restarted later, while the kubelet and embedded controllers are trying to talk to them.

Also updates k3s.

Types of Changes

bugfix

Verification

See linked issue

In addition to the steps in the linked issue, should also see some new messages at the end of the cluster-reset process:

INFO[0067] Shutting down kubelet and etcd
ERRO[0067] Kubelet exited: signal: killed
INFO[0072] Managed etcd cluster membership has been reset, restart without --cluster-reset flag now. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

Confirm that there is no etcd process running after rke2 exits at the end of the cluster-reset.

Testing

Linked Issues

User-Facing Change

Further Comments

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Updates k3s: k3s-io/k3s@8d84d15...8fcbc2b Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

Add additional static pod cleanup during cluster reset

3b85a28

Addresses issue with hangs or crashes when starting up servers following a cluster-reset, caused by etcd and/or the apiserver being restarted in unexpected sequences. Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

brandond requested a review from a team as a code owner September 1, 2023 17:38

dereknola approved these changes Sep 1, 2023

View reviewed changes

Bump K3s version for v1.25

77030f3

Updates k3s: k3s-io/k3s@8d84d15...8fcbc2b Signed-off-by: Brad Davidson <brad.davidson@rancher.com>

brandond force-pushed the staticpod-sync-fix_release-1.25 branch from 9f829e1 to 77030f3 Compare September 1, 2023 18:33

brandond merged commit 785512e into rancher:release-1.25 Sep 1, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.25] Add additional static pod cleanup during cluster reset #4726

[release-1.25] Add additional static pod cleanup during cluster reset #4726

brandond commented Sep 1, 2023

[release-1.25] Add additional static pod cleanup during cluster reset #4726

[release-1.25] Add additional static pod cleanup during cluster reset #4726

Conversation

brandond commented Sep 1, 2023

Proposed Changes

Types of Changes

Verification

Testing

Linked Issues

User-Facing Change

Further Comments