-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes fail to join cluster during update to v1.22.3 #13118
Comments
@tobiasamft can you try downgrading containerd version to you can specify to config like
|
@zetaab I tried out different things:
The result is the same for all distributions and containerd versions: upgrade fails, new node can't join the cluster. Additional findings:
|
ok then this sounds different problem. For me it looks like you have problems in either etcd or kube-apiserver. Check the logs under |
We had the following findings and don't know how to classify them:
|
I also experienced the etcd issue when upgrading a kops cluster running 1.22.4 (with kops 1.22.1) to 1.22.5 (With kops 1.22.2). I got stuck in the same upgrade failed loop and had to forcefully downgrade using kops 1.22.1 and killing off the control plane so it booted back up with etcd 3.5.0. |
CNI won't work when apiserver isn't working, and apiserver won't work if etcd isn't working. I think you need to do a bit of triaging to find the member that isn't healthy and eventually what it complains about |
@erismaster I encourage you to use kops 1.22.3, which comes with a newer version of etcd-manager that has some fixes for 3.5.x |
We had a dedicated look at the etcd logs which revealed the following:
|
Yeah. That PR means 3.5.1 will be used for upgrading 3.5 clusters. So weird it would want to use 3.5.0 for something. Unless kops sets this somewhere |
@olemarkus I am hitting this same issue with kops 1.22.3. It looks like I had 3.5.0 etcd cluster installed and now it tries to use 3.5.1 but fails. |
cannot use 2022-01-25T17:13:24.238674833Z stdout F W0125 17:13:24.238586 5215 etcdserver.go:118] error running etcd: unknown etcd version v3.5.0: not found in [/opt/etcd-v3.5.0-linux-amd64] either I am now using following to override etcd-manager image
it looks like image edit: can confirm that modifying etcd manager image and defining etcd version works. |
If you have a broken cluster, try setting etcd manager to Also see kubernetes-sigs/etcdadm#279 |
/kind blocks-next |
@olemarkus: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Given that there have been several cluster breaking bugs from kops updates lately, Is it safe to assume that there are no automated testing (or routine manual testing) of kops update cluster before a release is made? |
There are many. See https://testgrid.k8s.io/kops-misc But for various reasons they didn't catch this one. Plan on remedying that. |
@btalbot also one problem is that there can be multiple ways to use kops. You can configure lots of different things and those are not all tested automatically(no sense to run 1000 different tests with different combinations). |
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
Version 1.22.3 (git-241bfeba5931838fd32f2260aff41dd89a585fba)
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5", GitCommit:"aea7bbadd2fc0cd689de94a54e5b7b758869d691", GitTreeState:"clean", BuildDate:"2021-09-15T21:10:45Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.5", GitCommit:"5c99e2ac2ff9a3c549d9ca665e7bc05a3e18f07e", GitTreeState:"clean", BuildDate:"2021-12-16T08:32:32Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
Upgrade a (freshly created) v1.22.2 cluster to v1.22.3
5. What happened after the commands executed?
6. What did you expect to happen?
Proper cluster update without errors
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
8. Please run the commands with most verbose logging by adding the
-v 10
flag.Paste the logs into this report, or in a gist and provide the gist link here.
9. Anything else do we need to know?
Might be related to issue #13116.
Journalctl (partial) log of the instance which is not able to join the cluster (starting at first error E0118):
The text was updated successfully, but these errors were encountered: