Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot do two etcd restores in a row on the same host #353

Closed
cjellick opened this issue Sep 23, 2020 · 5 comments
Closed

Cannot do two etcd restores in a row on the same host #353

cjellick opened this issue Sep 23, 2020 · 5 comments

Comments

@cjellick
Copy link
Contributor

Environmental Info:
K3s Version:
1.19.1+k3s1 (actually it was a late RC that I used in a demo)

Node(s) CPU architecture, OS, and Version:
A medium/average size digital ocean droplet running Ubuntu 20.04

Cluster Configuration:
3 masters using embedded etcd (though it should repro with just 1 master)

Describe the bug:
If you try to do two etcd restores from the same host, the second one will fail.
We apparently have a check that looks for /var/lib/rancher/k3s/server/db/etcd-old/ and refuses to do the restore if that director is there, thinking that the db has "already" been restored.

Steps To Reproduce:
Warning: these steps are mostly from memory, so may not be perfect

  1. bring k3s up in etcd mode (with more frequent backups so easier to test)
sudo mkdir -p /etc/rancher/k3s && \
sudo vi /etc/rancher/k3s/config.yaml
etcd-snapshot-schedule-cron: "*/2 * * * *"
cluster-init: true
  1. Wait for a backup to occur and then attempt to restore to it.
    2a. stop k3s
systemctl stop k3s # maybe its systemctl stop k3s-server, i dont have the exact command any more

2b. perform the restore just using the k3s binary:

k3s server --cluster-reset --cluster-reset-restore-path=/var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-1600268280

2c. start k3s backup

systemctl start k3s

2d. Observe cluster come back up successfully

kubectl get nodes
  1. Repeat step 2
    This will break

Expected behavior:
Cluster should restore cleaning the second time

Actual behavior:
Cluster fails to restore the second time with this error:

INFO[2020-09-16T15:14:32.020933020Z] etcd already restored from a snapshot. Restart without --snapshot-restore-path flag. Backup and delete ${datadir}/server/db on each peer etcd server and rejoin the nodes
@cjellick cjellick transferred this issue from k3s-io/k3s Sep 23, 2020
@cjellick cjellick added this to the GA milestone Sep 23, 2020
@cjellick
Copy link
Contributor Author

This was encountered in k3s, but I have moved the issue to rke2 because it should have the same problem here and I want it in the GA milestone.

@cjellick
Copy link
Contributor Author

cc @rancher-max - this is the issue i was chatting with you about while prepping for the demo

@brandond
Copy link
Member

brandond commented Sep 23, 2020

I thought that we had done this on purpose to prevent users from accidentally putting the restore command in their systemd unit and wedging it into a restore loop.

I think we shouldn't change this behavior, but maybe add an additional message instructing the user to delete the etcd-olddirectory if they really want to restore again.

Or do you think we should delete the etcd-old directory if it is found when starting without the restore flag? That might work.

@cjellick
Copy link
Contributor Author

I understand protecting against multiple accidental restores, but there needs to be some way to do more than one restore.

Telling the user in logs to delete a random etcd-old directory doesn't feel like a very elegant solution.

Restore is now tied to --cluster-reset, which exits at the end of resetting. Therefore, you cannot accidentally have a situation where k3s goes up in restore mode, runs fuctionally, and then restarts and restores again.

@ShylajaDevadiga
Copy link
Contributor

Validated using k3s commit id as well as rke2 beta23

k3s version v1.19.2+k3s-c3c98319
rke2 version v1.18.9-beta23+rke2

Following the above instructions restoring multiple times from the snapshots was successful.
Cluster is up and running

@zube zube bot removed the [zube]: Done label Dec 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants