-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After etcd test, external etcd cluster cant up (3x etcd, 2x control-plane) #6193
Comments
Check the etcd pod logs on the two etcd nodes that you've brought back up, to see why they are not finding each other and coming up. |
When I close ml-etcd-0 and ml-etcd-2, still working is only ml-etcd-1 ip: 192.168.0.5 -> this is ip of ml-master-0, logs ml-etcd-1
Logs from ml-master-0
logs from
|
Are you using a load-balancer for the --server address, or did you just point all the nodes at a single server? Preferably you would be using a load-balancer... When building the cluster and using one of the servers as the --server address, it should look like this:
When bringing the cluster back up, you should ensure that ml-etcd-0 and one of ml-etcd-1 or ml-etcd-2 are up, before starting ml-master-0 and ml-master-1 If you're using an external load-balancer as the --server address, make sure that it is actually health checking the nodes, otherwise startup may fail due to the lb attempting to send connections to nodes that are not up. |
You have right I was wrong when I installed cluster I use server: ml-control-plane-0 not ml-etc-0 .
In version with load balancer ml-etc-0 have to be online when restart cluster ? |
When bringing up a split-role cluster the etcd nodes need to come up first, and need to point at an etcd node as the server. The control-plane nodes can come up after that, and should also point at an etcd node. The apiserver on control-plane nodes can't run without the datastore, for obvious reasons. if (for example) ml-etcd-0 is unavailable you could:
If you had an LB or DNS alias this would be much easier; if you're picking a single node to join against you just need to be sure that the node is actually available - if the original node is down, then pick a new one. |
Ok, thanks for quick answer, I understand. When I test infrastructure I see that when I poweroff two of etcd instances I cant restart cluster without ml-etcd-0 With Load Balancer I have add as target all 5 instances (etcd + control_plane) for port 9345 and 6443? |
Technically agents can join against any of the 5 servers, but since servers need to join an etcd node (or there needs to at least be one etcd node in the cluster at the time they join), I would probably point the LB at the 3 etcd nodes. The LB is just used at startup, agents reconnect directly to the servers without going through the LB once they are started. |
I have created that architecture Load Balancer: (targets: etcd-0, etcd-1, etcd-2, ports: 9345) Now I test simulates servers error, I poweroff etcd-0 and etcd-1 and after 1 minute I turn on only etcd-1, cluster up automatically, but there was some issue, control-plane-1 not up ( I have to restart rke2 process) and etcd very long was making a connection (5 minutes). I think that cool feature will be add |
If you're looking for a simpler architecture you might just consider having the 3 server nodes be etcd+control-plane. Is there something in particular that you're getting out of splitting those up?
Did you also stop the control-plane nodes, or did you leave them running while the etcd nodes were down?
That might be possible at some point in the future, but you wouldn't be able to use any of the etcd snapshot management stuff built into rke2 or rancher... |
I know, I have already test this configuration I am looking for external etcd because I'm already planning what will happen when the server grows, start with etcd + control plane is fine
Control plane was running without restart
Thanks for help |
This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions. |
I have testing my cluster configuration: 3x etcd, 2x control-plane, 1x worker
For that I swich-off 2 etcd instances, and poweron 1 etcd instance (ml-etcd-0), but after that etcd cluster cant up.
Environmental Info:
RKE2 Version: v1.30.1+rke2r1
**Node(s) CPU architecture, OS, and Version: **
amd64 ubuntu24-04
Cluster Configuration:
I have architecture:
ml-etcd-0 Ready etcd 22m v1.30.1+rke2r1
ml-etcd-1 Ready etcd 22m v1.30.1+rke2r1
ml-etcd-2 Ready etcd 22m v1.30.1+rke2r1
ml-master-0 Ready control-plane,master 22m v1.30.1+rke2r1
ml-master-1 Ready control-plane,master 20m v1.30.1+rke2r1
ml-worker-0 Ready 19m v1.30.1+rke2r1
Describe the bug:
When I shutdown ml-etcd-0 and ml-etcd-1 instances and after that I run again ml-etcd-0 instance again, 2/3 instances are working, control plane not working. Etcd cluster try to connect with other etcd instances but received error 500
Logs from ml-etcd-0
Expected behavior:
After shutdown 2/3 etcd instances, cluster should not working, but after rerun one etcd instances cluster should working.
The text was updated successfully, but these errors were encountered: