[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

Oats87 · 2023-04-05T20:11:43Z

Rancher Server Setup

Rancher version: v2.7-head
Installation option (Docker install/Helm Chart):
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
Proxy/Cert Details:

Information about the Cluster

Kubernetes version: N/A
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): I tested with custom

Describe the bug
A user is unable to restore an etcd snapshot on a custom K3s/RKE2 provisioned downstream cluster when the controlplane/etcd are completely unavailable. The system-agent-install script gets stuck waiting for a machine plan secret to be assigned to the RKE bootstrap.

To Reproduce

Create an RKE2 cluster (custom) and register a controlplane+etcd node and a worker node
Take an etcd snapshot on the cluster and put it in a safe place
Shut down the controlplane+etcd and worker node from your cluster
Wait for the cluster to die
Attempt to register 2 new nodes (controlplane+etcd, worker)

Result
Note that the install script will get stuck waiting for a plan secret

Expected Result
You should be able to register new nodes

Screenshots

Additional context

SURE-6119

Oats87 · 2023-04-05T20:28:35Z

It seems that the rkebootstrap is not being reconciled properly (the machine plan is not generated for the rkebootstrap), because the cluster control plane initialized condition is set to true after a cluster is provisioned

Oats87 · 2023-04-26T13:59:19Z

Encountered rancher/rke2#4052 (comment) while attempting to fix the

Oats87 · 2023-04-26T21:38:57Z

#41174
#40994
#41024
#41129
#41095

Oats87 · 2023-05-15T22:39:46Z

https://github.com/rancher/rancher/pull/41459/files#diff-347b85f4b27f0bc66ce4f2e3c4ca653d3f1c1ffb9e2b4f7c35814cf7881fd8a5R526 adds e2e tests to restore etcd on both machine provisioned and custom clusters when etcd is completely down

Josh-Diamond · 2023-05-19T21:06:48Z

Ticket #41080 - Test Results - ❌ blocked

Reproduced with HA Helm Rancher on v2.7.3:

Fresh install of Rancher v2.7.3
Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node
Once active, take a snapshot
Once snapshot is successfully captured, power down the nodes
Once cluster "dies", register 2 new nodes - 1 etcd/cp and 1 wkr
Reproduced - cluster stuck in "updating" state w/ cp/etcd node waiting for cluster agent to connect

Verified with HA Helm Rancher on v2.7-1eb478f3e77fde8a0328ed99cc3c39a1048442d2-head:

Fresh install of Rancher v2.7-head
Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node
Once active, take a snapshot
Once snapshot is successfully captured, power down the nodes
Once cluster "dies", register 2 new nodes - 1 etcd/cp and 1 wkr
❌ Veri-FAILED - encountered [BUG] 000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again #39689; - BLOCKED

curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again

snasovich · 2023-05-22T15:46:57Z

#41459 (files) adds e2e tests to restore etcd on both machine provisioned and custom clusters when etcd is completely down

@Oats87 , is it correct to assume that for the Custom RKE2 clusters the procedure to restore is the same as one outlined for RKE1 here?
And then, could you please outline the procedure for restoring machine-provisioned clusters?

Josh-Diamond · 2023-05-22T22:43:24Z

Ticket #41080 - Test Results - ✅

Verified with HA Helm Rancher on v2.7-707d173dc87d0dd28759fdf1a32e4be2949f7183-head:

Scenario	Test Case	Result
1.	Fresh install - Rancher `v2.7-head`; reproduce + "restore" cluster	✅
2.	Upgrade - reproduce on Rancher `v2.7.3` => upgrade Rancher `v2.7-head` and "restore" cluster	✅

Note:

OS ubuntu 22.04 was used for nodes in downstream RKE2 custom clusters

Scenario 1 - (Fresh install)

Fresh install of Rancher v2.7-head
Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node w/ s3 snapshots configured
Once active, deploy a workload in default namespace
Take a snapshot (s3)
Once snapshot is successfully captured, power down the nodes
Once cluster "dies", register 1 new node - all-roles, then delete the existing etcd/cp node
Once active, restore to snapshot taken in step 4
Observe restore worked by confirming workload deployed in step 3 is still seen in cluster
Register 2 more nodes - 1 etcd/cp, and 1 worker
Once active, delete all-roles node created in step 7
Verified - RKE2 cluster successfully restored from snapshot when cluster is down

Scenario 2 - (Upgrade)

Fresh install of Rancher v2.7.3
Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node w/ s3 snapshots configured
Once active, deploy a workload in default namespace
Take a snapshot (s3)
Once snapshot is successfully captured, power down the nodes
Once cluster "dies", upgrade Rancher to v2.7-6ce75a0b57ad9e88cf135254a9b3b40082788eb6-head
Register 1 new node - all-roles to cluster, then delete the existing etcd/cp node
Once active, restore to snapshot taken in step 4
Observe restore worked by confirming workload deployed in step 3 is still seen in cluster
Register 2 more nodes - 1 etcd/cp, and 1 worker
Verified - RKE2 cluster successfully restored from snapshot when cluster is down

Oats87 · 2023-05-24T14:49:16Z

The procedure is similar between machine provisioned and custom clusters.

If you have a complete cluster failure, you must remove all etcd nodes/machines from your cluster before you can add a "new" etcd node for restore.

NOTE If you are using local snapshots, it is VERY important that you ensure you back up the corresponding snapshot you want to restore from the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder on the etcd node you are going to be removing. You can copy the snapshot onto your new node in the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder. Furthermore, if using local snapshots and restoring to a new node, restoration cannot be done via the UI as of now.

ANOTHER NOTE This procedure is only usable with Rancher >= v2.7.6 -- if you follow this procedure with an older version of Rancher, it will not work as expected.

Remove all etcd nodes from your cluster. If using custom, you can remove the machine objects from the UI -- if using machine provisioned, scale your pools that contain etcd nodes to zero. Initially, you will see the nodes hang in "deleting", but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to "short circuit" the etcd safe-removal logic.
Once all etcd nodes are removed, add a new etcd node that you are planning to restore from. Rancher will proceed to error out when processing the cluster and state that restoration from etcd snapshot is required.
Restore from etcd snapshot -- if S3, restore from the UI is possible, if using a local snapshot, you can define cluster.rkeConfig.etcdSnapshotRestore.nameas the file name of the snapshot on disk in/var/lib/rancher/<k3s/rke2>/server/db/snapshots/`
Once restoration is successful, you can scale your etcd nodes back up to desired redundancy.

Josh-Diamond · 2023-05-25T17:34:25Z

Ticket #41080 - Additional Scenarios - Test Results - ✅

Verified with HA Helm Rancher on v2.7-fbca7c34e5aeae36b85d8b0e9af12a2f6f13e0ea-head:

Scenario	Test Case	Result
1.	Fresh install - Downstream AWS RKE2 Node driver cluster - single-node: `all roles`	✅
2.	Fresh install - Downstream AWS RKE2 Node driver cluster - 2 nodes, split roles: 1 `etcd/cp` + 1 `wkr`	✅

Scenario 1 - (single-node: all roles)

Fresh install of Rancher v2.7-head
Provision a downstream RKE2 AWS Node driver cluster w/ 1 all roles node w/ s3 snapshots configured
Once active, deploy a workload in default namespace
Take a snapshot (s3)
Once snapshot is successfully captured, power down the node
Once cluster "dies", navigate to Cluster Management > Cluster Details > Machines and delete the node
New node begins to provision (1 all roles) and Rancher informs user to restore from etcd snapshot - proceed w/ restoring to snapshot taken in step 4
Once active, navigate to Cluster Explorer > Pods > All Namespaces and observe restore worked by confirming workload deployed in step 3 is present in the cluster
Verified - RKE2 cluster successfully restored from snapshot when cluster is down

Scenario 2 - (2 nodes, split-roles: 1 etcd/cp + 1 wkr)

Fresh install of Rancher v2.7-head
Provision a downstream RKE2 AWS Node driver cluster w/ 2 nodes - [ 1 etcd/cp + 1 wkr ] - w/ s3 snapshots configured
Once active, deploy a workload in default namespace
Take a snapshot (s3)
Once snapshot is successfully captured, power down the nodes
Once cluster "dies", navigate to Cluster Management > Cluster Details > Machines and delete both nodes
New nodes begin to provision (1 etcd/cp and 1 wkr) and Rancher informs user to restore from etcd snapshot - proceed w/ restoring to snapshot taken in step 4
Once active, navigate to Cluster Explorer > Pods > All Namespaces and observe restore worked by confirming workload deployed in step 3 is present in the cluster
Verified - RKE2 cluster successfully restored from snapshot when cluster is down

kaioneuhauss · 2023-05-25T18:35:38Z

@Oats87 Is there a plan for this procedure to also work on Rancher Prime 2.7.3?

snasovich · 2023-06-05T23:18:07Z

@kaioneuhauss , code changes in linked PR(s) are necessary for the procedure so it will only work on the upcoming Q2 feature release (2.7.5, including Prime version).

Oats87 added kind/bug Issues that are defects reported by users or that we know have reached a real release area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework labels Apr 5, 2023

Oats87 self-assigned this Apr 5, 2023

Oats87 added [zube]: Working internal labels Apr 5, 2023

Oats87 added this to the 2023-Q2-v2.7x milestone Apr 5, 2023

Oats87 mentioned this issue Apr 6, 2023

[BUG] SystemUpgradeControllerReady condition on clusters.provisioning.cattle.io object is unnecessarily updated in certain cases, causing thrashing #41085

Closed

Oats87 added the team/area2/sc label Apr 6, 2023

snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Apr 7, 2023

Oats87 added the [zube]: To Test label Apr 19, 2023

zube bot removed the [zube]: Working label Apr 19, 2023

snasovich added the [zube]: Working label Apr 19, 2023

zube bot removed the [zube]: To Test label Apr 19, 2023

This was referenced Apr 27, 2023

Allow etcd snapshot restoration on a fresh set of etcd nodes #41320

Merged

[CAPR] Add new provisioning tests for etcd snapshot creation/restore, encryption key rotation, and certificate rotation. #41403

Merged

zube bot assigned Josh-Diamond May 10, 2023

Oats87 added the [zube]: To Test label May 15, 2023

zube bot removed the [zube]: Working label May 15, 2023

Josh-Diamond added [zube]: QA Next up [zube]: QA Working and removed [zube]: To Test [zube]: QA Next up labels May 18, 2023

Josh-Diamond closed this as completed May 22, 2023

zube bot added [zube]: Done and removed [zube]: QA Working labels May 22, 2023

irishgordo mentioned this issue Jun 14, 2023

[BUG] RKE2 Custom Cluster Cloud Provider RKE2 Embedded After Delete of Node used for RKE2 Snapshot Restore - causes init: node not found harvester/harvester#4083

Open

btat mentioned this issue Aug 1, 2023

[WIP] Add RKE2 backup and restore rancher/rancher-docs#502

Merged

zube bot removed the [zube]: Done label Aug 21, 2023

w13915984028 mentioned this issue Sep 4, 2023

Enhance k8s (etcd) backup and restore harvester/harvester#4450

Open

mtt0 mentioned this issue Oct 14, 2023

[BUG] On a Rancher upgrade, RKE2 and RKE1 clusters are stuck and do not come up Active for a long time #43096

Open

jdbaudean mentioned this issue Feb 9, 2024

[BUG] Can't restore RKE2 cluster from Etcd snapshot stored in S3, system-agent-install script hangs on subsequent node registration #44413

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

Oats87 commented Apr 5, 2023 •

edited

Oats87 commented Apr 5, 2023

Oats87 commented Apr 26, 2023

Oats87 commented Apr 26, 2023 •

edited

Oats87 commented May 15, 2023

Josh-Diamond commented May 19, 2023

snasovich commented May 22, 2023

Josh-Diamond commented May 22, 2023 •

edited

Oats87 commented May 24, 2023 •

edited

Josh-Diamond commented May 25, 2023

kaioneuhauss commented May 25, 2023

snasovich commented Jun 5, 2023

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

Comments

Oats87 commented Apr 5, 2023 • edited

Oats87 commented Apr 5, 2023

Oats87 commented Apr 26, 2023

Oats87 commented Apr 26, 2023 • edited

Oats87 commented May 15, 2023

Josh-Diamond commented May 19, 2023

Ticket #41080 - Test Results - ❌ blocked

snasovich commented May 22, 2023

Josh-Diamond commented May 22, 2023 • edited

Ticket #41080 - Test Results - ✅

Oats87 commented May 24, 2023 • edited

Josh-Diamond commented May 25, 2023

Ticket #41080 - Additional Scenarios - Test Results - ✅

kaioneuhauss commented May 25, 2023

snasovich commented Jun 5, 2023

Oats87 commented Apr 5, 2023 •

edited

Oats87 commented Apr 26, 2023 •

edited

Josh-Diamond commented May 22, 2023 •

edited

Oats87 commented May 24, 2023 •

edited