Cannot do two etcd restores in a row on the same host #353

cjellick · 2020-09-23T12:35:43Z

Environmental Info:
K3s Version:
1.19.1+k3s1 (actually it was a late RC that I used in a demo)

Node(s) CPU architecture, OS, and Version:
A medium/average size digital ocean droplet running Ubuntu 20.04

Cluster Configuration:
3 masters using embedded etcd (though it should repro with just 1 master)

Describe the bug:
If you try to do two etcd restores from the same host, the second one will fail.
We apparently have a check that looks for /var/lib/rancher/k3s/server/db/etcd-old/ and refuses to do the restore if that director is there, thinking that the db has "already" been restored.

Steps To Reproduce:
Warning: these steps are mostly from memory, so may not be perfect

bring k3s up in etcd mode (with more frequent backups so easier to test)

sudo mkdir -p /etc/rancher/k3s && \
sudo vi /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "*/2 * * * *"
cluster-init: true

Wait for a backup to occur and then attempt to restore to it.
2a. stop k3s

systemctl stop k3s # maybe its systemctl stop k3s-server, i dont have the exact command any more

2b. perform the restore just using the k3s binary:

k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-1600268280

2c. start k3s backup

systemctl start k3s

2d. Observe cluster come back up successfully

kubectl get nodes

Repeat step 2
This will break

Expected behavior:
Cluster should restore cleaning the second time

Actual behavior:
Cluster fails to restore the second time with this error:

INFO[2020-09-16T15:14:32.020933020Z] etcd already restored from a snapshot. Restart without --snapshot-restore-path flag. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes

The text was updated successfully, but these errors were encountered:

cjellick · 2020-09-23T14:09:15Z

This was encountered in k3s, but I have moved the issue to rke2 because it should have the same problem here and I want it in the GA milestone.

cjellick · 2020-09-23T14:09:33Z

cc @rancher-max - this is the issue i was chatting with you about while prepping for the demo

brandond · 2020-09-23T17:51:02Z

I thought that we had done this on purpose to prevent users from accidentally putting the restore command in their systemd unit and wedging it into a restore loop.

I think we shouldn't change this behavior, but maybe add an additional message instructing the user to delete the etcd-olddirectory if they really want to restore again.

Or do you think we should delete the etcd-old directory if it is found when starting without the restore flag? That might work.

cjellick · 2020-09-24T16:03:52Z

I understand protecting against multiple accidental restores, but there needs to be some way to do more than one restore.

Telling the user in logs to delete a random etcd-old directory doesn't feel like a very elegant solution.

Restore is now tied to --cluster-reset, which exits at the end of resetting. Therefore, you cannot accidentally have a situation where k3s goes up in restore mode, runs fuctionally, and then restarts and restores again.

ShylajaDevadiga · 2020-10-01T23:41:23Z

Validated using k3s commit id as well as rke2 beta23

k3s version v1.19.2+k3s-c3c98319
rke2 version v1.18.9-beta23+rke2

Following the above instructions restoring multiple times from the snapshots was successful.
Cluster is up and running

cjellick assigned galal-hussein Sep 23, 2020

cjellick transferred this issue from k3s-io/k3s Sep 23, 2020

cjellick added this to the GA milestone Sep 23, 2020

cjellick added the [zube]: Next Up label Sep 23, 2020

cjellick unassigned galal-hussein Sep 23, 2020

cjellick added [zube]: Backlog and removed [zube]: Next Up labels Sep 23, 2020

rancher-max mentioned this issue Sep 23, 2020

Support for etcd snapshot and restore #45

Closed

galal-hussein self-assigned this Sep 24, 2020

galal-hussein added [zube]: Working and removed [zube]: Backlog labels Sep 24, 2020

galal-hussein mentioned this issue Sep 24, 2020

Allow for multiple etcd snapshot restoration k3s-io/k3s#2307

Merged

galal-hussein added [zube]: Peer Review area/etcd-snapshot-restore and removed [zube]: Working labels Sep 25, 2020

cjellick added [zube]: Peer Review and removed [zube]: Working labels Sep 29, 2020

galal-hussein added the [zube]: To Test label Oct 1, 2020

zube bot removed the [zube]: Peer Review label Oct 1, 2020

davidnuzik assigned rancher-max Oct 1, 2020

rancher-max assigned ShylajaDevadiga and unassigned rancher-max Oct 1, 2020

ShylajaDevadiga closed this as completed Oct 1, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Oct 1, 2020

zube bot removed the [zube]: Done label Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot do two etcd restores in a row on the same host #353

Cannot do two etcd restores in a row on the same host #353

cjellick commented Sep 23, 2020

cjellick commented Sep 23, 2020

cjellick commented Sep 23, 2020

brandond commented Sep 23, 2020 •

edited

Loading

cjellick commented Sep 24, 2020

ShylajaDevadiga commented Oct 1, 2020

Cannot do two etcd restores in a row on the same host #353

Cannot do two etcd restores in a row on the same host #353

Comments

cjellick commented Sep 23, 2020

cjellick commented Sep 23, 2020

cjellick commented Sep 23, 2020

brandond commented Sep 23, 2020 • edited Loading

cjellick commented Sep 24, 2020

ShylajaDevadiga commented Oct 1, 2020

brandond commented Sep 23, 2020 •

edited

Loading