Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Closed
Mario-F opened this issue Mar 14, 2022 · 9 comments
Closed

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Mario-F opened this issue Mar 14, 2022 · 9 comments

Comments

@Mario-F
Copy link

Mario-F commented Mar 14, 2022

Rancher Server Setup

  • Rancher version: v2.6.3-patch1
  • Installation option (Docker install/Helm Chart): docker

Information about the Cluster

  • Kubernetes version: 1.22.6
  • Cluster Type (Local/Downstream): Downstream
    • Custom RKE

User Information

  • What is the role of the user logged in? Owner/Cluster

Describe the bug
After upgrading our 1.21 Cluster to 1.22 it will use etcd v3.5.0 successfully, but few days later we want to migrate our control (etcd/control) to new nodes (for performance reasons).
When adding new etc/control nodes no problems occurred, but after deleting one old control node in Cluster Managment all remaining etcd nodes crashed with:
panic: unexpected removal of unknown remote

This is very likely introduced by etcd v3.5.0: etcd-io/etcd#13119

To Reproduce
Remove an etcd node from an custom RKE 1.22 Cluster that was upgraded from 1.21

Result
All remaining etcd instances should panic

Workaround
We were workaround this problem by manually running v3.5.2 with the same parameters as the etcd container on all nodes, this fixes the database and after this v3.5.0 will also ran normally. (until removing another etcd node)

@mrulke
Copy link

mrulke commented Apr 6, 2022

@Mario-F how did you run 3.5.2 manually

@Mario-F
Copy link
Author

Mario-F commented Apr 6, 2022

@mrulke i used this handy tool to create an docker run command: docker run --rm -v /var/run/docker.sock:/var/run/docker.sock:ro assaflavie/runlike etcd

Then replaced the version, stopped the running etcd docker stop etcd, removed the container docker container rm etcd and executed the modified docker run command.

@mrulke
Copy link

mrulke commented Apr 6, 2022

Thanks @Mario-F
got me going then i updated the cluster.yaml with

system_images:
  etcd: rancher/mirrored-coreos-etcd:v3.5.2

@terickson
Copy link

So we had a similar problem. After we added an etcd node and tried to remove a node, our etcd cluster failed using rancher 2.6.4. We tried @mrulke fix and that did seem to fix etcd for a time but then the database did seem to get corrupted again. We have tried this in other clusters and have had the same issue. The clusters that we have not tried to remove etcd nodes from have been very stable, so adding an etcd node and then removing one in this version of rancher does seem to break etcd.

@mrulke
Copy link

mrulke commented Apr 28, 2022

Yeah i had the same issue as well. even with the check-in place. in the process of rebuilding my lab

@terickson
Copy link

So we did some further research. This seems to be related to this: https://www.suse.com/support/kb/doc/?id=000020632 which is related to this: etcd-io/etcd#13922. So the long and short of it is etcd 3.5 branch can corrupt the database when a node is killed and that corruption can get replicated to all nodes until 3.5.3 release. Rancher's 1.22 K8 release uses the etcd 3.5 branch. It's not until Rancher's 1.22.9 release of k8s they they use the 3.5.3 release of etcd and we have been using that for a couple days and it seems to be fairly stable. The problem with that is currently you have to be running v2.6.5-rc2 in order to get that release. v2.6.5 is suppose to become GA in the next couple weeks but until then you have to run a release candidate in order to get this version of k8s.

@mrulke
Copy link

mrulke commented Apr 29, 2022

Thanks
can I just update my system_images in my YAML for a build now? even a new build or specify running the RC version not sure what to put in the YAML for that
eg

system_images:
  etcd: rancher/mirrored-coreos-etcd:v3.5.3

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@Mario-F
Copy link
Author

Mario-F commented Jul 12, 2022

/remove-lifecycle stale

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants