[docs] In HA installs, wait for the first node to be ready before joining others #895

rancher-max · 2021-04-19T18:25:56Z

Need to update documentation to call out the importance of waiting for the initial node to be running before joining other server nodes, due to the limitations in etcd learners.

Martin-Weiss · 2021-04-20T05:15:22Z

When updating the docs with this - please also add the info on how to test local and remote if the first node is ready and etcd is not in a status where an other secondary master is in process of joining..

massep88 · 2021-04-28T08:31:28Z

Maybe something interesting to add in the doc : While experiencing issues with a simultaneous rke2 / etcd servers start, I could validate that for example a 1s wait between rke2 / etcd starts was ok.

teebeey · 2021-05-06T14:35:11Z

Is this only when joining the second master, or do we have to wait between all the masters joining?

brandond · 2021-05-06T18:24:55Z

Only one node can join the etcd cluster at a time, so all servers (we don't have masters) need to wait for previous nodes to complete their join before joining.

massep88 · 2021-05-07T07:07:19Z

Yes and this would be great if this is included in the documentation.
Validating above @brandond comment I have re-tested this, adding a specific test for etcd launch on the previous master before moving on with the next one.

davidnuzik · 2021-06-14T23:23:30Z

@brandond why did you unassign @rancher-max ?
cc: @cjellick

brandond · 2021-06-14T23:26:32Z

When running through milestone items after @cjellick dropped and asked the rest of the team to finish moving things out of the v1.21.2+rke2r1 milestone, this issue came up and @rancher-max indicated that it was unclear why he was assigned an issue that's not ready for testing. If he's expected to write the documentation and put in the PR then someone needs to remind him.

davidnuzik · 2021-06-15T00:01:48Z

Oh okay gotcha. @rancher-max and I synced on this and he agreed he could do the docs work here previously. I'll reassign to him.

braunsonm · 2021-10-05T14:05:42Z

Can this limitation be wrapped by rke2 server rather than relying on the user to stagger the joining of master nodes? RKE2's binary could do an random exponential backoff and retry rather than require the delay.

This means bootstrapping RKE2 clusters using tools like Terraform get a little more painful. Typically you'd create a group of master nodes using something like a for_each or count meta-argument like so:

resource "virtual_machine" "server_nodes" {
  for_each = {
    master1 = 1.1.1.1
    master2 = 1.1.1.2
    master3 = 1.1.1.3
  }

 ....
}

The only way to truly force the first server node to be online before creating the others is to introduce dependencies between server nodes. This creates a lot of problems because if you need to recreate the 1st server node, all the dependencies need to be destroyed which destroys the whole server plane.

The other option which is more of a hack, is when your cloud-init script runs systemctl enable --now rke2-server to introduce a sleep infront which hopefully is enough of a wait to ensure the previous node has joined. This doesn't work a lot of the time.

brandond · 2021-10-05T18:26:07Z

You do need to have dependencies between nodes though - exactly one of your nodes must be started with --cluster-init (which is implicit on RKE2 when not providing --server); the remainder must be started with --server pointing at that first node, or at some other fixed registration endpoint that is backed by that node.

That said we will probably eventually add some retry behavior so that joins work better; this is tracked under #897

rancher-max · 2022-05-10T21:41:25Z

This doesn't really apply anymore as in all of the latest releases it is actually possible to run all the rke2-server processes at the same time, so I'm going to close this docs issue. See #349 for details on the fix.

rancher-max added the kind/documentation Improvements or additions to documentation label Apr 19, 2021

rancher-max mentioned this issue Apr 19, 2021

Servers fail to join cluster if multiple nodes are joined concurrently or image import/pull takes more than 5 minutes #349

Closed

davidnuzik added this to the v1.21.1+rke2r1 milestone Apr 20, 2021

davidnuzik added the priority/important-soon label Apr 20, 2021

brandond mentioned this issue May 8, 2021

Unable to start 3 servers at the same time, slaves servers are not able to wait and connect #962

Closed

davidnuzik modified the milestones: v1.21.1+rke2r1, v1.21.2+rke2r1 May 24, 2021

davidnuzik assigned rancher-max May 24, 2021

brandond modified the milestones: v1.21.2+rke2r1, v1.21.3+rke2r1 Jun 11, 2021

brandond unassigned rancher-max Jun 11, 2021

davidnuzik assigned rancher-max Jun 15, 2021

fapatel1 modified the milestones: v1.21.3+rke2r1, Documentation Backlog Jun 23, 2021

fapatel1 unassigned rancher-max Jun 23, 2021

rancher-max closed this as completed May 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] In HA installs, wait for the first node to be ready before joining others #895

[docs] In HA installs, wait for the first node to be ready before joining others #895

rancher-max commented Apr 19, 2021

Martin-Weiss commented Apr 20, 2021

massep88 commented Apr 28, 2021

teebeey commented May 6, 2021

brandond commented May 6, 2021

massep88 commented May 7, 2021

davidnuzik commented Jun 14, 2021

brandond commented Jun 14, 2021 •

edited

Loading

davidnuzik commented Jun 15, 2021

braunsonm commented Oct 5, 2021 •

edited

Loading

brandond commented Oct 5, 2021 •

edited

Loading

rancher-max commented May 10, 2022

[docs] In HA installs, wait for the first node to be ready before joining others #895

[docs] In HA installs, wait for the first node to be ready before joining others #895

Comments

rancher-max commented Apr 19, 2021

Martin-Weiss commented Apr 20, 2021

massep88 commented Apr 28, 2021

teebeey commented May 6, 2021

brandond commented May 6, 2021

massep88 commented May 7, 2021

davidnuzik commented Jun 14, 2021

brandond commented Jun 14, 2021 • edited Loading

davidnuzik commented Jun 15, 2021

braunsonm commented Oct 5, 2021 • edited Loading

brandond commented Oct 5, 2021 • edited Loading

rancher-max commented May 10, 2022

brandond commented Jun 14, 2021 •

edited

Loading

braunsonm commented Oct 5, 2021 •

edited

Loading

brandond commented Oct 5, 2021 •

edited

Loading