Closed-loop monitoring of critical static pods #897

brandond · 2021-04-19T23:32:27Z

RKE2 currently only generates manifests for static pods on startup. If the configuration captured in the static pod manifest is not valid and the pods crashloop, RKE2 does not detect or handle this.

In K3s, many critical failures (such as failure of the etcd server to join the cluster as in #349) are detected via the embedded etcd server failing, which causes the K3s process to exit, be restarted by systemd, and retry the join operation from the beginning. In RKE2 these failures are masked by the indirection of the static pod executor model.

We should monitor critical static pods (such as the etcd server, apiserver, etc) and exit if they exit, as we do in K3s. This might be easiest if we configure the pods to not be restarted, and set up a goroutine to periodically check if they are running?

dereknola · 2023-02-07T17:44:37Z

This was completed and validated by QA under k3s-io/k3s#5649

brandond added this to the v1.21.1+rke2r1 milestone Apr 19, 2021

brandond added this to To Triage in Development [DEPRECATED] via automation Apr 19, 2021

brandond added kind/bug Something isn't working priority/important-soon labels Apr 19, 2021

davidnuzik moved this from To Triage to Next Up in Development [DEPRECATED] Apr 19, 2021

davidnuzik mentioned this issue Apr 20, 2021

Servers fail to join cluster if multiple nodes are joined concurrently or image import/pull takes more than 5 minutes #349

Closed

cjellick modified the milestones: v1.21.1+rke2r1, v1.21 - Backlog May 12, 2021

davidnuzik added the kind/internal label May 12, 2021

cjellick modified the milestones: v1.21 - Backlog, v1.22 - Backlog Jun 28, 2021

brandond mentioned this issue Sep 13, 2021

Set controller authorization-kubeconfig and authentication-kubeconfig k3s-io/k3s#4007

Merged

brandond mentioned this issue Sep 22, 2021

RKE2 multi master should be easy to install/friendly #1871

Closed

brandond mentioned this issue Oct 5, 2021

[docs] In HA installs, wait for the first node to be ready before joining others #895

Closed

brandond mentioned this issue Jun 1, 2022

rke2-server.service fails to activate during systemctl restart following rke2 secrets-encrypt reencrypt #3004

Closed

dereknola modified the milestones: v1.22 - Backlog, v1.24 - Backlog Jun 1, 2022

dereknola self-assigned this Jun 6, 2022

dereknola mentioned this issue Jun 6, 2022

Delay service readiness until after startuphooks have finished k3s-io/k3s#5649

Merged

dereknola moved this from Next Up to To Test in Development [DEPRECATED] Aug 4, 2022

dereknola moved this from To Test to Backlog in Development [DEPRECATED] Aug 4, 2022

caroline-suse-rancher modified the milestones: v1.24 - Backlog, Backlog Nov 16, 2022

dereknola closed this as completed Feb 7, 2023

Development [DEPRECATED] automation moved this from Backlog to Done Issue / Merged PR Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closed-loop monitoring of critical static pods #897

Closed-loop monitoring of critical static pods #897

brandond commented Apr 19, 2021 •

edited

dereknola commented Feb 7, 2023

Closed-loop monitoring of critical static pods #897

Closed-loop monitoring of critical static pods #897

Comments

brandond commented Apr 19, 2021 • edited

dereknola commented Feb 7, 2023

brandond commented Apr 19, 2021 •

edited