Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After etcd test, external etcd cluster cant up (3x etcd, 2x control-plane) #6193

Closed
gawsoftpl opened this issue Jun 13, 2024 · 11 comments
Closed

Comments

@gawsoftpl
Copy link

gawsoftpl commented Jun 13, 2024

I have testing my cluster configuration: 3x etcd, 2x control-plane, 1x worker

For that I swich-off 2 etcd instances, and poweron 1 etcd instance (ml-etcd-0), but after that etcd cluster cant up.

Environmental Info:
RKE2 Version: v1.30.1+rke2r1

**Node(s) CPU architecture, OS, and Version: **
amd64 ubuntu24-04

Cluster Configuration:
I have architecture:
ml-etcd-0 Ready etcd 22m v1.30.1+rke2r1
ml-etcd-1 Ready etcd 22m v1.30.1+rke2r1
ml-etcd-2 Ready etcd 22m v1.30.1+rke2r1
ml-master-0 Ready control-plane,master 22m v1.30.1+rke2r1
ml-master-1 Ready control-plane,master 20m v1.30.1+rke2r1
ml-worker-0 Ready 19m v1.30.1+rke2r1

Describe the bug:
When I shutdown ml-etcd-0 and ml-etcd-1 instances and after that I run again ml-etcd-0 instance again, 2/3 instances are working, control plane not working. Etcd cluster try to connect with other etcd instances but received error 500

Logs from ml-etcd-0

201797Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc004e39500/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T16:15:05.209160+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:05Z" level=info msg="Waiting for apiserver addresses"
2024-06-13T16:15:05.209224+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:05Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
2024-06-13T16:15:05.574349+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:05Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6444/v1-rke2/readyz: 500 Internal Server Error"
2024-06-13T16:15:09.091456+00:00 ml-etcd-0 rke2[1211]: {"level":"warn","ts":"2024-06-13T16:15:09.090126Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc004dd2c40/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T16:15:09.094295+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:09Z" level=warning msg="Failed to get apiserver address from etcd: context deadline exceeded"
2024-06-13T16:15:10.208173+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:10Z" level=info msg="Waiting for apiserver addresses"
2024-06-13T16:15:10.590325+00:00 ml-etcd-0 rke2[1211]: time="2024-06-13T16:15:10Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:6444/v1-rke2/readyz: 500 Internal Server Error"

Expected behavior:
After shutdown 2/3 etcd instances, cluster should not working, but after rerun one etcd instances cluster should working.

@brandond
Copy link
Member

Check the etcd pod logs on the two etcd nodes that you've brought back up, to see why they are not finding each other and coming up.

@gawsoftpl
Copy link
Author

gawsoftpl commented Jun 13, 2024

When I close ml-etcd-0 and ml-etcd-2, still working is only ml-etcd-1
I power on ml-etcd-0 and ml-etcd-2, after that I have below logs. It looks like etcd try connect with control-plane but it not working because etcd clusters is not synchronize, looks like deadlock

ip: 192.168.0.5 -> this is ip of ml-master-0,

logs ml-etcd-1

2024-06-13T17:21:12.836098+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:12Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.0.5:9345: connect: connection refused"
2024-06-13T17:21:12.836407+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:12Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 192.168.0.5:9345: connect: connection refused" url="wss://192.168.0.5:9345/v1-rke2/connect"
2024-06-13T17:21:13.837427+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:13Z" level=info msg="Connecting to proxy" url="wss://192.168.0.5:9345/v1-rke2/connect"
2024-06-13T17:21:13.838095+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:13Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.0.5:9345: connect: connection refused"
2024-06-13T17:21:13.838226+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:13Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 192.168.0.5:9345: connect: connection refused" url="wss://192.168.0.5:9345/v1-rke2/connect"
2024-06-13T17:21:14.840411+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:14Z" level=info msg="Connecting to proxy" url="wss://192.168.0.5:9345/v1-rke2/connect"
2024-06-13T17:21:14.842075+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:14Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp 192.168.0.5:9345: connect: connection refused"
2024-06-13T17:21:14.842282+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:14Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp 192.168.0.5:9345: connect: connection refused" url="wss://192.168.0.5:9345/v1-rke2/connect"
2024-06-13T17:21:15.842771+00:00 ml-etcd-1 rke2[1878]: time="2024-06-13T17:21:15Z" level=info msg="Connecting to proxy" url="wss://192.168.0.5:9345/v1-rke2/connect"

Logs from ml-master-0

2024-06-13T17:25:55.916197+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1893 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.916278+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1979 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.916361+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1983 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.916415+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 2596 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.917282+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 4438 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.917407+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 4601 (containerd-shim) remains running after unit stopped.
2024-06-13T17:25:55.917479+00:00 ml-master-0 systemd[1]: Failed to start rke2-server.service - Rancher Kubernetes Engine v2 (server).
2024-06-13T17:26:01.159332+00:00 ml-master-0 systemd[1]: rke2-server.service: Scheduled restart job, restart counter is at 44.
2024-06-13T17:26:01.160608+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1893 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.160902+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.161100+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1979 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.161188+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.161267+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1983 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.161337+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.161427+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 2596 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.161516+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.161641+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 4438 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.161743+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.161943+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 4601 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.161996+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.168936+00:00 ml-master-0 systemd[1]: Starting rke2-server.service - Rancher Kubernetes Engine v2 (server)...
2024-06-13T17:26:01.172465+00:00 ml-master-0 sh[10008]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
2024-06-13T17:26:01.232430+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1893 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.232605+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.232737+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1979 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.232913+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.233026+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 1983 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.233147+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.233223+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 2596 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.233360+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.233437+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 4438 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.233532+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.233627+00:00 ml-master-0 systemd[1]: rke2-server.service: Found left-over process 4601 (containerd-shim) in control group while starting unit. Ignoring.
2024-06-13T17:26:01.233764+00:00 ml-master-0 systemd[1]: rke2-server.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
2024-06-13T17:26:01.372384+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=warning msg="Unknown flag --cloud-provider found in config.yaml, skipping\n"
2024-06-13T17:26:01.372730+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=warning msg="not running in CIS mode"
2024-06-13T17:26:01.373480+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=info msg="Applying Pod Security Admission Configuration"
2024-06-13T17:26:01.373883+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=info msg="Starting rke2 v1.30.1+rke2r1 (e7f87c6dd56fdd76a7dab58900aeea8946b2c008)"
2024-06-13T17:26:01.429077+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=info msg="Managed etcd cluster not yet initialized"
2024-06-13T17:26:01.746558+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:01Z" level=info msg="Reconciling bootstrap data between datastore and disk"
2024-06-13T17:26:03.926695+00:00 ml-master-0 rke2[10014]: time="2024-06-13T17:26:03Z" level=fatal msg="starting kubernetes: preparing server: failed to get MemberList from server: Internal error occurred: failed to get etcd MemberList: context deadline exceeded"
2024-06-13T17:26:03.936086+00:00 ml-master-0 systemd[1]: rke2-server.service: Main process exited, code=exited, status=1/FAILURE
2024-06-13T17:26:03.967862+00:00 ml-master-0 systemd[1]: rke2-server.service: Failed with result 'exit-code'.
2024-06-13T17:26:03.968096+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1893 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968247+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1979 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968343+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 1983 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968412+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 2596 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968513+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 4438 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968580+00:00 ml-master-0 systemd[1]: rke2-server.service: Unit process 4601 (containerd-shim) remains running after unit stopped.
2024-06-13T17:26:03.968939+00:00 ml-master-0 systemd[1]: Failed to start rke2-server.service - Rancher Kubernetes Engine v2 (server).

logs from
ml-etcd-2:

2024-06-13T17:31:15.068714+00:00 ml-etcd-2 rke2[817]: {"level":"warn","ts":"2024-06-13T17:31:15.066654Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0022bea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T17:31:15.068890+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:15Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
2024-06-13T17:31:15.123848+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:15Z" level=info msg="Waiting for apiserver addresses"
2024-06-13T17:31:17.427257+00:00 ml-etcd-2 rke2[817]: {"level":"warn","ts":"2024-06-13T17:31:17.42237Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003e9afc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T17:31:17.432435+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:17Z" level=warning msg="Failed to get apiserver address from etcd: context deadline exceeded"
2024-06-13T17:31:20.130449+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:20Z" level=info msg="Waiting for apiserver addresses"
2024-06-13T17:31:22.237356+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:22Z" level=info msg="Waiting for etcd server to become available"
2024-06-13T17:31:22.425036+00:00 ml-etcd-2 rke2[817]: {"level":"warn","ts":"2024-06-13T17:31:22.423953Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc003e9ae00/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T17:31:22.425592+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:22Z" level=warning msg="Failed to get apiserver address from etcd: context deadline exceeded"
2024-06-13T17:31:22.650523+00:00 ml-etcd-2 rke2[817]: {"level":"warn","ts":"2024-06-13T17:31:22.650223Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0022bea80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
2024-06-13T17:31:22.653336+00:00 ml-etcd-2 rke2[817]: time="2024-06-13T17:31:22Z" level=error msg="Sending HTTP 500 response to 192.168.0.5:57728: failed to get etcd MemberList: context deadline exceeded"

@brandond
Copy link
Member

brandond commented Jun 13, 2024

Are you using a load-balancer for the --server address, or did you just point all the nodes at a single server? Preferably you would be using a load-balancer...

When building the cluster and using one of the servers as the --server address, it should look like this:

  1. ml-etcd-0 (no --server address)
  2. ml-etcd-1 (--server=ml-etcd-0)
  3. ml-etcd-2 (--server=ml-etcd-0)
  4. ml-master-0 (--server=ml-etcd-0)
  5. ml-master-1 (--server=ml-etcd-0)

When bringing the cluster back up, you should ensure that ml-etcd-0 and one of ml-etcd-1 or ml-etcd-2 are up, before starting ml-master-0 and ml-master-1

If you're using an external load-balancer as the --server address, make sure that it is actually health checking the nodes, otherwise startup may fail due to the lb attempting to send connections to nodes that are not up.

@gawsoftpl
Copy link
Author

Are you using a load-balancer for the --server address, or did you just point all the nodes at a single server? Preferably you would be using a load-balancer...

When building the cluster and using one of the servers as the --server address, it should look like this:

  1. ml-etcd-0 (no --server address)
  2. ml-etcd-1 (--server=ml-etcd-0)
  3. ml-etcd-2 (--server=ml-etcd-0)
  4. ml-master-0 (--server=ml-etcd-0)
  5. ml-master-1 (--server=ml-etcd-0)

When bringing the cluster back up, you should ensure that ml-etcd-0 and one of ml-etcd-1 or ml-etcd-2 are up, before starting ml-master-0 and ml-master-1

If you're using an external load-balancer as the --server address, make sure that it is actually health checking the nodes, otherwise startup may fail due to the lb attempting to send connections to nodes that are not up.

You have right I was wrong when I installed cluster I use server: ml-control-plane-0 not ml-etc-0 .

  • In first version of my deployment I created cluster without load balancer, this solution has one defect that without ml-etc-0 I can't restart cluster

In version with load balancer ml-etc-0 have to be online when restart cluster ?

@brandond
Copy link
Member

brandond commented Jun 13, 2024

When bringing up a split-role cluster the etcd nodes need to come up first, and need to point at an etcd node as the server. The control-plane nodes can come up after that, and should also point at an etcd node. The apiserver on control-plane nodes can't run without the datastore, for obvious reasons.

if (for example) ml-etcd-0 is unavailable you could:

  1. remove the --server address from ml-etcd-1 and start it
  2. change the --server address on ml-etcd-2 to point at ml-etcd-1 and start it
  3. repeat on ml-master-0 and ml-master-2 (starting with the new server address)

If you had an LB or DNS alias this would be much easier; if you're picking a single node to join against you just need to be sure that the node is actually available - if the original node is down, then pick a new one.

@gawsoftpl
Copy link
Author

gawsoftpl commented Jun 13, 2024

When bringing up a split-role cluster the etcd nodes need to come up first, and need to point at an etcd node as the server. The control-plane nodes can come up after that, and should also point at an etcd node. The apiserver on control-plane nodes can't run without the datastore, for obvious reasons.

Ok, thanks for quick answer, I understand. When I test infrastructure I see that when I poweroff two of etcd instances I cant restart cluster without ml-etcd-0

With Load Balancer I have add as target all 5 instances (etcd + control_plane) for port 9345 and 6443?

@brandond
Copy link
Member

Technically agents can join against any of the 5 servers, but since servers need to join an etcd node (or there needs to at least be one etcd node in the cluster at the time they join), I would probably point the LB at the 3 etcd nodes.

The LB is just used at startup, agents reconnect directly to the servers without going through the LB once they are started.

@gawsoftpl
Copy link
Author

Technically agents can join against any of the 5 servers, but since servers need to join an etcd node (or there needs to at least be one etcd node in the cluster at the time they join), I would probably point the LB at the 3 etcd nodes.

The LB is just used at startup, agents reconnect directly to the servers without going through the LB once they are started.

I have created that architecture

Load Balancer: (targets: etcd-0, etcd-1, etcd-2, ports: 9345)
etcd-0
etcd-1
etcd-2
control-plane-0
control-plane-1

Now I test simulates servers error, I poweroff etcd-0 and etcd-1 and after 1 minute I turn on only etcd-1, cluster up automatically, but there was some issue, control-plane-1 not up ( I have to restart rke2 process) and etcd very long was making a connection (5 minutes).

I think that cool feature will be add --datastore-endpoint like in k3s for connect external etcd cluster without bootstrap etcd cluster via rke2

@brandond
Copy link
Member

brandond commented Jun 13, 2024

If you're looking for a simpler architecture you might just consider having the 3 server nodes be etcd+control-plane. Is there something in particular that you're getting out of splitting those up?

there was some issue, control-plane-1 not up ( I have to restart rke2 process) and etcd very long was making a connection (5 minutes).

Did you also stop the control-plane nodes, or did you leave them running while the etcd nodes were down?

I think that cool feature will be add --datastore-endpoint like in k3s for connect external etcd cluster without bootstrap etcd cluster via rke2

That might be possible at some point in the future, but you wouldn't be able to use any of the etcd snapshot management stuff built into rke2 or rancher...

@gawsoftpl
Copy link
Author

gawsoftpl commented Jun 13, 2024

If you're looking for a simpler architecture you might just consider having the 3 server nodes be etcd+control-plane. Is there something in particular that you're getting out of splitting those up?

I know, I have already test this configuration I am looking for external etcd because I'm already planning what will happen when the server grows, start with etcd + control plane is fine

there was some issue, control-plane-1 not up ( I have to restart rke2 process) and etcd very long was making a connection (5 minutes).

Did you also stop the control-plane nodes, or did you leave them running while the etcd nodes were down?

Control plane was running without restart

I think that cool feature will be add --datastore-endpoint like in k3s for connect external etcd cluster without bootstrap etcd cluster via rke2

That might be possible at some point in the future, but you wouldn't be able to use any of the etcd snapshot management stuff built into rke2 or rancher...

Thanks for help

Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Aug 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants