Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to restore rke2/k3s provisioned clusters from etcd snapshot if cluster is completely down #41080

Closed
Oats87 opened this issue Apr 5, 2023 · 11 comments
Assignees
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release team/area2/sc team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@Oats87
Copy link
Contributor

Oats87 commented Apr 5, 2023

Rancher Server Setup

  • Rancher version: v2.7-head
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc):
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: N/A
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): I tested with custom

Describe the bug
A user is unable to restore an etcd snapshot on a custom K3s/RKE2 provisioned downstream cluster when the controlplane/etcd are completely unavailable. The system-agent-install script gets stuck waiting for a machine plan secret to be assigned to the RKE bootstrap.

To Reproduce

  1. Create an RKE2 cluster (custom) and register a controlplane+etcd node and a worker node
  2. Take an etcd snapshot on the cluster and put it in a safe place
  3. Shut down the controlplane+etcd and worker node from your cluster
  4. Wait for the cluster to die
  5. Attempt to register 2 new nodes (controlplane+etcd, worker)

Result
Note that the install script will get stuck waiting for a plan secret

Expected Result
You should be able to register new nodes

Screenshots

Additional context

SURE-6119

@Oats87 Oats87 added kind/bug Issues that are defects reported by users or that we know have reached a real release area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework labels Apr 5, 2023
@Oats87 Oats87 self-assigned this Apr 5, 2023
@Oats87
Copy link
Contributor Author

Oats87 commented Apr 5, 2023

It seems that the rkebootstrap is not being reconciled properly (the machine plan is not generated for the rkebootstrap), because the cluster control plane initialized condition is set to true after a cluster is provisioned

@Oats87
Copy link
Contributor Author

Oats87 commented Apr 26, 2023

Encountered rancher/rke2#4052 (comment) while attempting to fix the

@Oats87
Copy link
Contributor Author

Oats87 commented Apr 26, 2023

#41174
#40994
#41024
#41129
#41095

@Oats87
Copy link
Contributor Author

Oats87 commented May 15, 2023

https://github.com/rancher/rancher/pull/41459/files#diff-347b85f4b27f0bc66ce4f2e3c4ca653d3f1c1ffb9e2b4f7c35814cf7881fd8a5R526 adds e2e tests to restore etcd on both machine provisioned and custom clusters when etcd is completely down

@Josh-Diamond
Copy link
Contributor

Ticket #41080 - Test Results - ❌ blocked

Reproduced with HA Helm Rancher on v2.7.3:

  1. Fresh install of Rancher v2.7.3
  2. Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node
  3. Once active, take a snapshot
  4. Once snapshot is successfully captured, power down the nodes
  5. Once cluster "dies", register 2 new nodes - 1 etcd/cp and 1 wkr
  6. Reproduced - cluster stuck in "updating" state w/ cp/etcd node waiting for cluster agent to connect

Verified with HA Helm Rancher on v2.7-1eb478f3e77fde8a0328ed99cc3c39a1048442d2-head:

  1. Fresh install of Rancher v2.7-head
  2. Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node
  3. Once active, take a snapshot
  4. Once snapshot is successfully captured, power down the nodes
  5. Once cluster "dies", register 2 new nodes - 1 etcd/cp and 1 wkr
  6. ❌ Veri-FAILED - encountered [BUG] 000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again #39689; - BLOCKED
curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60001 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again
curl: (28) Operation timed out after 60000 milliseconds with 0 bytes received
[ERROR]  000 received while downloading Rancher connection information. Sleeping for 5 seconds and trying again

@snasovich
Copy link
Collaborator

#41459 (files) adds e2e tests to restore etcd on both machine provisioned and custom clusters when etcd is completely down

@Oats87 , is it correct to assume that for the Custom RKE2 clusters the procedure to restore is the same as one outlined for RKE1 here?
And then, could you please outline the procedure for restoring machine-provisioned clusters?

@Josh-Diamond
Copy link
Contributor

Josh-Diamond commented May 22, 2023

Ticket #41080 - Test Results - ✅

Verified with HA Helm Rancher on v2.7-707d173dc87d0dd28759fdf1a32e4be2949f7183-head:

Scenario Test Case Result
1. Fresh install - Rancher v2.7-head; reproduce + "restore" cluster
2. Upgrade - reproduce on Rancher v2.7.3 => upgrade Rancher v2.7-head and "restore" cluster

Note:

  • OS ubuntu 22.04 was used for nodes in downstream RKE2 custom clusters

Scenario 1 - (Fresh install)

  1. Fresh install of Rancher v2.7-head
  2. Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node w/ s3 snapshots configured
  3. Once active, deploy a workload in default namespace
  4. Take a snapshot (s3)
  5. Once snapshot is successfully captured, power down the nodes
  6. Once cluster "dies", register 1 new node - all-roles, then delete the existing etcd/cp node
  7. Once active, restore to snapshot taken in step 4
  8. Observe restore worked by confirming workload deployed in step 3 is still seen in cluster
  9. Register 2 more nodes - 1 etcd/cp, and 1 worker
  10. Once active, delete all-roles node created in step 7
  11. Verified - RKE2 cluster successfully restored from snapshot when cluster is down

Scenario 2 - (Upgrade)

  1. Fresh install of Rancher v2.7.3
  2. Provision a downstream RKE2 Custom cluster w/ 1 etcd/cp node and 1 wkr node w/ s3 snapshots configured
  3. Once active, deploy a workload in default namespace
  4. Take a snapshot (s3)
  5. Once snapshot is successfully captured, power down the nodes
  6. Once cluster "dies", upgrade Rancher to v2.7-6ce75a0b57ad9e88cf135254a9b3b40082788eb6-head
  7. Register 1 new node - all-roles to cluster, then delete the existing etcd/cp node
  8. Once active, restore to snapshot taken in step 4
  9. Observe restore worked by confirming workload deployed in step 3 is still seen in cluster
  10. Register 2 more nodes - 1 etcd/cp, and 1 worker
  11. Verified - RKE2 cluster successfully restored from snapshot when cluster is down

@Oats87
Copy link
Contributor Author

Oats87 commented May 24, 2023

The procedure is similar between machine provisioned and custom clusters.

If you have a complete cluster failure, you must remove all etcd nodes/machines from your cluster before you can add a "new" etcd node for restore.

NOTE If you are using local snapshots, it is VERY important that you ensure you back up the corresponding snapshot you want to restore from the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder on the etcd node you are going to be removing. You can copy the snapshot onto your new node in the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder. Furthermore, if using local snapshots and restoring to a new node, restoration cannot be done via the UI as of now.

ANOTHER NOTE This procedure is only usable with Rancher >= v2.7.6 -- if you follow this procedure with an older version of Rancher, it will not work as expected.

  1. Remove all etcd nodes from your cluster. If using custom, you can remove the machine objects from the UI -- if using machine provisioned, scale your pools that contain etcd nodes to zero. Initially, you will see the nodes hang in "deleting", but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to "short circuit" the etcd safe-removal logic.
  2. Once all etcd nodes are removed, add a new etcd node that you are planning to restore from. Rancher will proceed to error out when processing the cluster and state that restoration from etcd snapshot is required.
  3. Restore from etcd snapshot -- if S3, restore from the UI is possible, if using a local snapshot, you can define cluster.rkeConfig.etcdSnapshotRestore.nameas the file name of the snapshot on disk in/var/lib/rancher/<k3s/rke2>/server/db/snapshots/`
  4. Once restoration is successful, you can scale your etcd nodes back up to desired redundancy.

@Josh-Diamond
Copy link
Contributor

Ticket #41080 - Additional Scenarios - Test Results - ✅

Verified with HA Helm Rancher on v2.7-fbca7c34e5aeae36b85d8b0e9af12a2f6f13e0ea-head:

Scenario Test Case Result
1. Fresh install - Downstream AWS RKE2 Node driver cluster - single-node: all roles
2. Fresh install - Downstream AWS RKE2 Node driver cluster - 2 nodes, split roles: 1 etcd/cp + 1 wkr

Scenario 1 - (single-node: all roles)

  1. Fresh install of Rancher v2.7-head
  2. Provision a downstream RKE2 AWS Node driver cluster w/ 1 all roles node w/ s3 snapshots configured
  3. Once active, deploy a workload in default namespace
  4. Take a snapshot (s3)
  5. Once snapshot is successfully captured, power down the node
  6. Once cluster "dies", navigate to Cluster Management > Cluster Details > Machines and delete the node
  7. New node begins to provision (1 all roles) and Rancher informs user to restore from etcd snapshot - proceed w/ restoring to snapshot taken in step 4
  8. Once active, navigate to Cluster Explorer > Pods > All Namespaces and observe restore worked by confirming workload deployed in step 3 is present in the cluster
  9. Verified - RKE2 cluster successfully restored from snapshot when cluster is down

Scenario 2 - (2 nodes, split-roles: 1 etcd/cp + 1 wkr)

  1. Fresh install of Rancher v2.7-head
  2. Provision a downstream RKE2 AWS Node driver cluster w/ 2 nodes - [ 1 etcd/cp + 1 wkr ] - w/ s3 snapshots configured
  3. Once active, deploy a workload in default namespace
  4. Take a snapshot (s3)
  5. Once snapshot is successfully captured, power down the nodes
  6. Once cluster "dies", navigate to Cluster Management > Cluster Details > Machines and delete both nodes
  7. New nodes begin to provision (1 etcd/cp and 1 wkr) and Rancher informs user to restore from etcd snapshot - proceed w/ restoring to snapshot taken in step 4
  8. Once active, navigate to Cluster Explorer > Pods > All Namespaces and observe restore worked by confirming workload deployed in step 3 is present in the cluster
  9. Verified - RKE2 cluster successfully restored from snapshot when cluster is down

@kaioneuhauss
Copy link

@Oats87 Is there a plan for this procedure to also work on Rancher Prime 2.7.3?

@snasovich
Copy link
Collaborator

@kaioneuhauss , code changes in linked PR(s) are necessary for the procedure so it will only work on the upcoming Q2 feature release (2.7.5, including Prime version).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release team/area2/sc team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

4 participants