Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Mario-F · 2022-03-14T12:23:19Z

Rancher Server Setup

Rancher version: v2.6.3-patch1
Installation option (Docker install/Helm Chart): docker

Information about the Cluster

Kubernetes version: 1.22.6
Cluster Type (Local/Downstream): Downstream
- Custom RKE

User Information

What is the role of the user logged in? Owner/Cluster

Describe the bug
After upgrading our 1.21 Cluster to 1.22 it will use etcd v3.5.0 successfully, but few days later we want to migrate our control (etcd/control) to new nodes (for performance reasons).
When adding new etc/control nodes no problems occurred, but after deleting one old control node in Cluster Managment all remaining etcd nodes crashed with:
panic: unexpected removal of unknown remote

This is very likely introduced by etcd v3.5.0: etcd-io/etcd#13119

To Reproduce
Remove an etcd node from an custom RKE 1.22 Cluster that was upgraded from 1.21

Result
All remaining etcd instances should panic

Workaround
We were workaround this problem by manually running v3.5.2 with the same parameters as the etcd container on all nodes, this fixes the database and after this v3.5.0 will also ran normally. (until removing another etcd node)

The text was updated successfully, but these errors were encountered:

mrulke · 2022-04-06T06:38:07Z

@Mario-F how did you run 3.5.2 manually

Mario-F · 2022-04-06T07:12:48Z

@mrulke i used this handy tool to create an docker run command: docker run --rm -v /var/run/docker.sock:/var/run/docker.sock:ro assaflavie/runlike etcd

Then replaced the version, stopped the running etcd docker stop etcd, removed the container docker container rm etcd and executed the modified docker run command.

mrulke · 2022-04-06T09:21:07Z

Thanks @Mario-F
got me going then i updated the cluster.yaml with

system_images:
  etcd: rancher/mirrored-coreos-etcd:v3.5.2

terickson · 2022-04-22T21:06:48Z

So we had a similar problem. After we added an etcd node and tried to remove a node, our etcd cluster failed using rancher 2.6.4. We tried @mrulke fix and that did seem to fix etcd for a time but then the database did seem to get corrupted again. We have tried this in other clusters and have had the same issue. The clusters that we have not tried to remove etcd nodes from have been very stable, so adding an etcd node and then removing one in this version of rancher does seem to break etcd.

mrulke · 2022-04-28T01:10:13Z

Yeah i had the same issue as well. even with the check-in place. in the process of rebuilding my lab

terickson · 2022-04-28T14:02:49Z

So we did some further research. This seems to be related to this: https://www.suse.com/support/kb/doc/?id=000020632 which is related to this: etcd-io/etcd#13922. So the long and short of it is etcd 3.5 branch can corrupt the database when a node is killed and that corruption can get replicated to all nodes until 3.5.3 release. Rancher's 1.22 K8 release uses the etcd 3.5 branch. It's not until Rancher's 1.22.9 release of k8s they they use the 3.5.3 release of etcd and we have been using that for a couple days and it seems to be fairly stable. The problem with that is currently you have to be running v2.6.5-rc2 in order to get that release. v2.6.5 is suppose to become GA in the next couple weeks but until then you have to run a release candidate in order to get this version of k8s.

mrulke · 2022-04-29T00:02:48Z

Thanks
can I just update my system_images in my YAML for a build now? even a new build or specify running the RC version not sure what to put in the YAML for that
eg

system_images:
  etcd: rancher/mirrored-coreos-etcd:v3.5.3

github-actions · 2022-06-28T02:14:30Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

Mario-F · 2022-07-12T07:20:55Z

/remove-lifecycle stale

noamles mentioned this issue May 3, 2022

Issue removing etcd role from node rancher/rke#2922

Closed

github-actions bot added the status/stale label Jun 28, 2022

github-actions bot closed this as completed Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Mario-F commented Mar 14, 2022

mrulke commented Apr 6, 2022

Mario-F commented Apr 6, 2022

mrulke commented Apr 6, 2022 •

edited

Loading

terickson commented Apr 22, 2022

mrulke commented Apr 28, 2022

terickson commented Apr 28, 2022

mrulke commented Apr 29, 2022

github-actions bot commented Jun 28, 2022

Mario-F commented Jul 12, 2022

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874

Comments

Mario-F commented Mar 14, 2022

mrulke commented Apr 6, 2022

Mario-F commented Apr 6, 2022

mrulke commented Apr 6, 2022 • edited Loading

terickson commented Apr 22, 2022

mrulke commented Apr 28, 2022

terickson commented Apr 28, 2022

mrulke commented Apr 29, 2022

github-actions bot commented Jun 28, 2022

Mario-F commented Jul 12, 2022

mrulke commented Apr 6, 2022 •

edited

Loading