-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 agent PANIC when controlplane quorum is lost since several hours #6130
Comments
It sounds like there's a race condition somewhere in the load-balancer retry logic, and that should be fixable. If you can run the agent with |
There is a read lock held within the faulting function, so we must have missed holding the correct lock somewhere that's modifying the randomServers array. https://github.com/k3s-io/k3s/blob/v1.28.9%2Bk3s1/pkg/agent/loadbalancer/servers.go#L114-L135 edit: nevermind, I see that this function holds a read lock on the list, but writes to the nextServerIndex field. Multiple goroutines could increment that at the same time, after its value has been clamped to the length of the array. |
I'm currently unable to reproduce this after leaving multiple server nodes powered off for 4 and then a 20 hour stretch. I can see the agent node correctly updating loadbalancer server lists without issue from the agent. I'm not saying I don't believe this happened but that I won't be able to really reproduce and validate the fix. $ kgn
$ sudo journalctl -u rke2-agent
// you can see a lot of restarts from the two nodes being down in the cluster
|
@sebastienmusso We've elected to close this because of it's inability to be reliably reproduced. If you find that the behavior is not resolved on the releases that will be out hopefully in the next week or so please feel free to re-create an issue with what you find. Thank you for reporting your bug! |
Environmental Info:
RKE2 Version:
v1.28.9+rke2r1
Node(s) CPU architecture, OS, and Version:
Linux 5.15.0-102-generic Ubuntu
Cluster Configuration:
3 servers
4 agent
Describe the bug:
We are trying to check the behaviour of our application when control plane quorum is lost.
We stopped 2 controlplane out of the 3 and wait some hours.
At the beginning ingress and deployed application still available, but some hours later every connection hanged and we could not reach app within ingress.
On all agent nodes we got this golang panic:
With this panic, containerd and kubelet wich are started by rke2 agent process have been killed too and container were unavailable.
This message happen on all agent nodes but at different times (few hours between them)
I know controlplane quorum should not be lost as long, but what if.
Steps To Reproduce:
Stop 2 server nodes and let cluster work
Expected behavior:
rke agent process should continue running or maybe kill himself with security message/error
Actual behavior:
panic from rke agent process
The text was updated successfully, but these errors were encountered: