Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed-loop monitoring of critical static pods #897

Closed
brandond opened this issue Apr 19, 2021 · 1 comment
Closed

Closed-loop monitoring of critical static pods #897

brandond opened this issue Apr 19, 2021 · 1 comment
Assignees
Labels
kind/bug Something isn't working kind/internal
Milestone

Comments

@brandond
Copy link
Contributor

brandond commented Apr 19, 2021

RKE2 currently only generates manifests for static pods on startup. If the configuration captured in the static pod manifest is not valid and the pods crashloop, RKE2 does not detect or handle this.

In K3s, many critical failures (such as failure of the etcd server to join the cluster as in #349) are detected via the embedded etcd server failing, which causes the K3s process to exit, be restarted by systemd, and retry the join operation from the beginning. In RKE2 these failures are masked by the indirection of the static pod executor model.

We should monitor critical static pods (such as the etcd server, apiserver, etc) and exit if they exit, as we do in K3s. This might be easiest if we configure the pods to not be restarted, and set up a goroutine to periodically check if they are running?

@brandond brandond added this to the v1.21.1+rke2r1 milestone Apr 19, 2021
@brandond brandond added this to To Triage in Development [DEPRECATED] via automation Apr 19, 2021
@brandond brandond added kind/bug Something isn't working priority/important-soon labels Apr 19, 2021
@davidnuzik davidnuzik moved this from To Triage to Next Up in Development [DEPRECATED] Apr 19, 2021
@dereknola dereknola self-assigned this Jun 6, 2022
@dereknola dereknola moved this from Next Up to To Test in Development [DEPRECATED] Aug 4, 2022
@dereknola dereknola moved this from To Test to Backlog in Development [DEPRECATED] Aug 4, 2022
Development [DEPRECATED] automation moved this from Backlog to Done Issue / Merged PR Feb 7, 2023
@dereknola
Copy link
Contributor

This was completed and validated by QA under k3s-io/k3s#5649

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working kind/internal
Projects
No open projects
Development [DEPRECATED]
Done Issue / Merged PR
Development

No branches or pull requests

5 participants