-
Notifications
You must be signed in to change notification settings - Fork 2.9k
-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing etcd Node on 1.22 RKE1 Cluster kills etcd cluster #36874
Comments
@Mario-F how did you run 3.5.2 manually |
@mrulke i used this handy tool to create an docker run command: Then replaced the version, stopped the running etcd |
Thanks @Mario-F
|
So we had a similar problem. After we added an etcd node and tried to remove a node, our etcd cluster failed using rancher 2.6.4. We tried @mrulke fix and that did seem to fix etcd for a time but then the database did seem to get corrupted again. We have tried this in other clusters and have had the same issue. The clusters that we have not tried to remove etcd nodes from have been very stable, so adding an etcd node and then removing one in this version of rancher does seem to break etcd. |
Yeah i had the same issue as well. even with the check-in place. in the process of rebuilding my lab |
So we did some further research. This seems to be related to this: https://www.suse.com/support/kb/doc/?id=000020632 which is related to this: etcd-io/etcd#13922. So the long and short of it is etcd 3.5 branch can corrupt the database when a node is killed and that corruption can get replicated to all nodes until 3.5.3 release. Rancher's 1.22 K8 release uses the etcd 3.5 branch. It's not until Rancher's 1.22.9 release of k8s they they use the 3.5.3 release of etcd and we have been using that for a couple days and it seems to be fairly stable. The problem with that is currently you have to be running v2.6.5-rc2 in order to get that release. v2.6.5 is suppose to become GA in the next couple weeks but until then you have to run a release candidate in order to get this version of k8s. |
Thanks
|
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
/remove-lifecycle stale |
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
After upgrading our 1.21 Cluster to 1.22 it will use etcd v3.5.0 successfully, but few days later we want to migrate our control (etcd/control) to new nodes (for performance reasons).
When adding new etc/control nodes no problems occurred, but after deleting one old control node in Cluster Managment all remaining etcd nodes crashed with:
panic: unexpected removal of unknown remote
This is very likely introduced by etcd v3.5.0: etcd-io/etcd#13119
To Reproduce
Remove an etcd node from an custom RKE 1.22 Cluster that was upgraded from 1.21
Result
All remaining etcd instances should panic
Workaround
We were workaround this problem by manually running v3.5.2 with the same parameters as the etcd container on all nodes, this fixes the database and after this v3.5.0 will also ran normally. (until removing another etcd node)
The text was updated successfully, but these errors were encountered: