-
-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deploy the controllers as daemonsets or at least redeploy in case of node failure? #52
Comments
Here is the cue bit for timoni, I use:
|
Is the Kubernetes control plane still working? I expect it to reschedule the pod on a different node. Maybe the toleration we set in Flux is too broad, I set it like this
|
It may well be that since etcd has no quorum, the control plane will no longer schedule pods anywhere. I suggest creating a cluster with 2 worker nodes, deploy Flux on one of the workers, make that node fail and see if it gets reschedule to the healthy node. |
I can still schedule pods. |
If you describe the Flux pod, is there any hint in the events about some blocker to rescheduling? |
Events show this
The full description:
|
Hmm why is |
Replicaset
Deployment
|
Really odd, the replicaset says |
I will see later today, if I start over and redeploy the entire cluster. Then test again and see if it has the same behaviour. |
I guess if you delete pod it will get rescheduled, this looks like some race condition in the Kubernetes scheduler or the toleration makes it trip. |
Correct :)
A new was created and it's working fine. |
Hmm so it looks like it got stuck in Terminating, but why wasn't this status reflected in the replicaset and why it didn't timeout. I wander if this is some bug in Kubernetes. |
Terminating status only after ran:
i.e. the kubernetes dashboard and other pods also linger some time in terminating. Default is some 15min, before those get removed by Kubernetes?
|
If you managed to reproduce this, it would be good to take snapshots of the deployment and replicaset and see what events are issued for those, I guess those expired and that's why there are no events listed now. |
If you can reproduce this, please add with kubectl edit a toleration like so, and retest please:
|
Here some attempts to log events:
And no events on the Deployment either.
I hope this helps. Now, I'll edit Flux as you stated above and test again. |
Still running fine. Tested sync with git and events.
ReplicaSet log:
Running on node "vmalmakms".
Now shutting down "vmalmakms"...
old pod
|
No events on the Deployment, a new ReplicaSet is created:
|
Ok so the |
Hi, yes, without this parameter it just stays there forever in running state. I would probably change it to ~5min as the standard setting of Kubernetes? It now reschedules way ahead of other pods. |
Thanks @disi for all the tests. I have published the fix, rerunning |
In my testing, I created a cluster of three master nodes, all are untainted and can schedule normal pods.
Flux is only ever running on the node it was originally deployed on via timoni.
If this node goes down, the controllers are not deployed to other nodes.
flux events - shows logs until the node went down.
The pods show running on the node that is down:
stream logs failed Get "https://10.0.2.22:10250/containerLogs/flux-system/flux-57bd866b6d-zbrfc/helm-controller?follow=true&sinceSeconds=300&tailLines=100×tamps=true": dial tcp 10.0.2.22:10250: connect:
The text was updated successfully, but these errors were encountered: