Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.2.4] etcd/controlplane master not registering on new clusters #21514

Closed
superxor opened this issue Jul 13, 2019 · 21 comments
Closed

[2.2.4] etcd/controlplane master not registering on new clusters #21514

superxor opened this issue Jul 13, 2019 · 21 comments

Comments

@superxor
Copy link

Fresh custom cluster. Rancher 2.2.4
k8s 1.13, flannel

steps: create new cluster, add master node

Master node can't register, one of the pods endlessly loops:

  • docker start kubelet
    Error response from daemon: {"message":"No such container: kubelet"}
    Error: failed to start containers: kubelet
  • sleep 2

etcd pod shows:

WARNING: 2019/07/13 10:19:35 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

@superxor superxor changed the title master not registering on new cluster v2.2.4 [2.2.4] master not registering on new clusters Jul 13, 2019
@slamdev
Copy link

slamdev commented Jul 13, 2019

when k8s 1.14 is used, this error appears in etcd logs:

2019-07-13 11:14:24.030026 I | embed: rejected connection from "10.3.1.20:40787" (error "EOF", ServerName "")
2019-07-13 11:14:32.866684 I | embed: rejected connection from "10.3.1.20:58550" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2019-07-13 11:14:37.882558 I | embed: rejected connection from "10.3.1.20:58552" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")
2019-07-13 11:14:42.905844 I | embed: rejected connection from "10.3.1.20:58554" (error "tls: failed to verify client's certificate: x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"kube-ca\")", ServerName "")

@superxor
Copy link
Author

can someone please check this? its a show-stopper

its clearly an issue with rancher: tls: bad certificate
I can reproduce it every time

@superxor superxor changed the title [2.2.4] master not registering on new clusters [2.2.4] etcd/controlplane master not registering on new clusters Jul 24, 2019
@superxor
Copy link
Author

related #20909

@superseb
Copy link
Contributor

The error shown usually indicates nodes being re-used and data is left behind which clashes with the newly generated data.

If its reproducible every time, please provide exact steps that you've taken so I can use that to reproduce on my end (see https://rancher.com/docs/rancher/v2.x/en/contributing/#bugs-issues-or-questions. Adding a etcd + controlplane node in a custom cluster (1.13 + flannel) does not reproduce.

@superxor
Copy link
Author

superxor commented Jul 24, 2019

@superseb I suspect the issue lies with the rancher-master. The etcd node is a brand new fresh VM with fresh hostname (because of the famous hostname issue...when is that ever going to be fixed?).

I tried this 4-5 times now, making sure its always a new VM with new disk and new different hostname

steps:

  1. Create new cluster
  2. spawn new VM/node
  3. run docker command...
  4. see etcd pod error about tls cert failure

Ubuntu 18.04 LTS
docker version
Client:
Version: 18.09.7
API version: 1.39
Go version: go1.10.8
Git commit: 2d0083d
Built: Thu Jun 27 17:56:23 2019
OS/Arch: linux/amd64
Experimental: false

@superseb
Copy link
Contributor

When did it start to happen (as it doesnt happen on a fresh 2.2.4 Rancher)? What is in the log of the rancher container (possible leads to why its failing)

@superxor
Copy link
Author

superxor commented Jul 24, 2019

Since the upgrade to 2.2.4.

I dont see anything on the rancher master, nor in the other pods

2019/07/24 09:12:29 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.60:44214: i/o timeout
2019/07/24 09:12:29 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.72:49762: i/o timeout
2019/07/24 09:12:30 [INFO] Handling backend connection request [c-q89w5:m-4c0602df0d65]
2019/07/24 09:12:30 [INFO] Handling backend connection request [c-q89w5:m-e5f8240f1893]
2019/07/24 09:12:30 [INFO] Handling backend connection request [c-5q2fz:m-623b23917cdf]
2019/07/24 09:12:30 [INFO] Handling backend connection request [c-5q2fz:m-f48ee2daad85]
2019/07/24 09:12:32 [INFO] Handling backend connection request [c-q89w5:m-b0676695ae93]
2019/07/24 09:12:32 [INFO] Handling backend connection request [c-5q2fz:m-25bd4ee7511c]
2019/07/24 09:12:33 [INFO] Handling backend connection request [c-5q2fz:m-a885899d4c85]
2019/07/24 09:12:34 [INFO] Handling backend connection request [c-5q2fz:m-2b5b7779d339]
2019/07/24 09:12:36 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.26:39580: i/o timeout
2019/07/24 09:12:36 [INFO] Handling backend connection request [c-5q2fz:m-e81840c53f86]

same time on the etcd pod:

2019-07-24 09:12:22.726490 W | auth: simple token is not cryptographically signed
2019-07-24 09:12:22.730662 I | etcdserver: starting server... [version: 3.2.24, cluster version: to_be_decided]
2019-07-24 09:12:22.731466 I | etcdserver/membership: added member fd1bb75040a79e2d [https://10.3.1.20:2380] to cluster 3bf383ae9c8f9b68
2019-07-24 09:12:22.731580 N | etcdserver/membership: set the initial cluster version to 3.2
2019-07-24 09:12:22.731638 I | etcdserver/api: enabled capabilities for version 3.2
2019-07-24 09:12:22.732407 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-10-3-1-20.pem, key = /etc/kubernetes/ssl/kube-etcd-10-3-1-20-key.pem, ca = , trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true
2019-07-24 09:12:30.221083 I | raft: fd1bb75040a79e2d is starting a new election at term 2
2019-07-24 09:12:30.221215 I | raft: fd1bb75040a79e2d became candidate at term 3
2019-07-24 09:12:30.221258 I | raft: fd1bb75040a79e2d received MsgVoteResp from fd1bb75040a79e2d at term 3
2019-07-24 09:12:30.221306 I | raft: fd1bb75040a79e2d became leader at term 3
2019-07-24 09:12:30.221323 I | raft: raft.node: fd1bb75040a79e2d elected leader fd1bb75040a79e2d at term 3
2019-07-24 09:12:30.221963 I | etcdserver: published {Name:etcd-slamdev-master-t13-20 ClientURLs:[https://10.3.1.20:2379 https://10.3.1.20:4001]} to cluster 3bf383ae9c8f9b68
2019-07-24 09:12:30.222173 I | embed: ready to serve client requests
2019-07-24 09:12:30.223095 I | embed: serving client requests on [::]:2379
WARNING: 2019/07/24 09:12:30 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

@superxor
Copy link
Author

@superseb Where are the certs coming from? I assume the issue is the certs provided by the rancher master? Maybe because we started out with 2.1 or something and upgrades did not fix some issue with them. Would explain why a fresh 2.4 setup does not have this problem.

@superseb
Copy link
Contributor

When you create a new cluster and add the node, rancher starts provisioning the cluster. Logging from this would be helpful as it will show what part of the provisioning process succeeds and where it fails.

Also agent logs would help as I see connections to Rancher dropping (from the new node in the new cluster and possibly from other nodes in other clusters)

@superxor
Copy link
Author

superxor commented Jul 24, 2019

rancher master logs during cluster creation and node adding (I removed a few i/o timeout and warning lines that are unrelated, was too much log spam)

2019/07/24 11:45:32 [INFO] Provisioning cluster [c-8gf44]
2019/07/24 11:45:32 [INFO] Creating cluster [c-8gf44]
2019/07/24 11:45:35 [INFO] Handling backend connection request [c-8gf44:m-91f491207258]
2019/07/24 11:45:35 [INFO] Handling backend connection request [c-5q2fz:m-3057e8ce2de9]
2019/07/24 11:45:36 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.86:37826: i/o timeout
2019/07/24 11:45:37 [INFO] Handling backend connection request [c-5q2fz:m-4b59508bd6b8]
2019/07/24 11:45:37 [INFO] kontainerdriver rancherkubernetesengine listening on address 127.0.0.1:10800
2019/07/24 11:45:37 [INFO] cluster [c-8gf44] provisioning: Initiating Kubernetes cluster
2019/07/24 11:45:37 [INFO] cluster [c-8gf44] provisioning: [dialer] Setup tunnel for host [10.3.1.20]
2019/07/24 11:45:37 [INFO] cluster [c-8gf44] provisioning: [state] Pulling image [rancher/rke-tools:v0.1.28] on host [10.3.1.20]
2019/07/24 11:45:42 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.37:43978: i/o timeout
2019/07/24 11:45:42 [INFO] Handling backend connection request [c-5q2fz:m-daaf4c2d51eb]
2019/07/24 11:45:42 [INFO] cluster [c-8gf44] provisioning: [state] Successfully pulled image [rancher/rke-tools:v0.1.28] on host [10.3.1.20]
2019/07/24 11:45:43 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.63:41068: i/o timeout
2019/07/24 11:45:43 [INFO] Handling backend connection request [c-5q2fz:m-41c1a245fd95]
2019/07/24 11:45:44 [INFO] cluster [c-8gf44] provisioning: [state] Successfully started [cluster-state-deployer] container on host [10.3.1.20]
2019/07/24 11:45:44 [INFO] Handling backend connection request [c-5q2fz:m-1afe65447b6f]
2019/07/24 11:45:44 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.60:51806: i/o timeout
2019/07/24 11:45:44 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating CA kubernetes certificates
2019/07/24 11:45:44 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.42:37558: i/o timeout
2019/07/24 11:45:44 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kubernetes API server aggregation layer requestheader client CA certificates
2019/07/24 11:45:44 [INFO] Handling backend connection request [c-5q2fz:m-30da5bbfa251]
2019/07/24 11:45:45 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kube Scheduler certificates
2019/07/24 11:45:45 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kube Proxy certificates
2019/07/24 11:45:45 [INFO] Handling backend connection request [c-5q2fz:m-5fa522286aa8]
2019/07/24 11:45:45 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Node certificate
2019/07/24 11:45:45 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kubernetes API server certificates
2019/07/24 11:45:45 [INFO] Handling backend connection request [c-5q2fz:m-fa0b42cd127d]
2019/07/24 11:45:45 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.119:34756: i/o timeout
2019/07/24 11:45:46 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating admin certificates and kubeconfig
2019/07/24 11:45:46 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kubernetes API server proxy client certificates
2019/07/24 11:45:47 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.83:60424: i/o timeout
2019/07/24 11:45:47 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating etcd-10.3.1.20 certificate and key
2019/07/24 11:45:47 [INFO] cluster [c-8gf44] provisioning: [certificates] Generating Kube Controller certificates
2019/07/24 11:45:47 [INFO] Handling backend connection request [c-q89w5:m-ad044186556d]
2019/07/24 11:45:47 [INFO] cluster [c-8gf44] provisioning: Successfully Deployed state file at [management-state/rke/rke-119787230/cluster.rkestate]
2019/07/24 11:45:48 [INFO] cluster [c-8gf44] provisioning: Building Kubernetes cluster
2019/07/24 11:45:48 [INFO] cluster [c-8gf44] provisioning: [dialer] Setup tunnel for host [10.3.1.20]
2019/07/24 11:45:48 [INFO] cluster [c-8gf44] provisioning: [network] Deploying port listener containers
2019/07/24 11:45:48 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.62:43140: i/o timeout
2019/07/24 11:45:48 [INFO] cluster [c-8gf44] provisioning: [network] Successfully started [rke-etcd-port-listener] container on host [10.3.1.20]
2019/07/24 11:45:49 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.88:49832: i/o timeout
2019/07/24 11:45:49 [INFO] cluster [c-8gf44] provisioning: [network] Successfully started [rke-cp-port-listener] container on host [10.3.1.20]
2019/07/24 11:45:49 [INFO] cluster [c-8gf44] provisioning: [network] Port listener containers deployed successfully
2019/07/24 11:45:49 [INFO] cluster [c-8gf44] provisioning: [network] Running control plane -> etcd port checks
2019/07/24 11:45:50 [INFO] Handling backend connection request [c-5q2fz:m-132e9f7d86bc]
2019/07/24 11:45:50 [INFO] Handling backend connection request [c-q89w5:m-b0676695ae93]
2019/07/24 11:45:50 [INFO] Handling backend connection request [c-5q2fz:m-623b23917cdf]
2019/07/24 11:45:50 [INFO] cluster [c-8gf44] provisioning: [network] Successfully started [rke-port-checker] container on host [10.3.1.20]
2019/07/24 11:45:50 [INFO] cluster [c-8gf44] provisioning: [network] Running control plane -> worker port checks
2019/07/24 11:45:51 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.91:44942: i/o timeout
2019/07/24 11:45:51 [INFO] cluster [c-8gf44] provisioning: [network] Successfully started [rke-port-checker] container on host [10.3.1.20]
2019-07-24 11:45:51.343530 I | mvcc: store.index: compact 34577767
2019/07/24 11:45:51 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.38:35090: i/o timeout
2019/07/24 11:45:51 [INFO] cluster [c-8gf44] provisioning: [network] Running workers -> control plane port checks
2019-07-24 11:45:51.490473 I | mvcc: finished scheduled compaction at 34577767 (took 136.99168ms)
2019/07/24 11:45:51 [INFO] cluster [c-8gf44] provisioning: [network] Skipping kubeapi port check
2019/07/24 11:45:51 [INFO] cluster [c-8gf44] provisioning: [network] Removing port listener containers
2019/07/24 11:45:51 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.58:57148: i/o timeout
2019/07/24 11:45:51 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.24:37620: i/o timeout
2019/07/24 11:45:51 [INFO] cluster [c-8gf44] provisioning: [remove/rke-etcd-port-listener] Successfully removed container on host [10.3.1.20]
2019/07/24 11:45:52 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.98:57630: i/o timeout
2019/07/24 11:45:52 [INFO] cluster [c-8gf44] provisioning: [remove/rke-cp-port-listener] Successfully removed container on host [10.3.1.20]
2019/07/24 11:45:52 [INFO] cluster [c-8gf44] provisioning: [network] Port listener containers removed successfully
2019/07/24 11:45:52 [INFO] cluster [c-8gf44] provisioning: [certificates] Deploying kubernetes certificates to Cluster nodes

2019/07/24 11:45:58 [INFO] cluster [c-8gf44] provisioning: [reconcile] Rebuilding and updating local kube config
I0724 11:45:58.076641 7 http.go:110] HTTP2 has been explicitly disabled
2019/07/24 11:45:58 [INFO] cluster [c-8gf44] provisioning: Successfully Deployed local admin kubeconfig at [management-state/rke/rke-119787230/kube_config_cluster.yml]
2019/07/24 11:45:58 [INFO] Handling backend connection request [c-5q2fz:m-471f71f8e09b]
2019/07/24 11:45:58 [INFO] cluster [c-8gf44] provisioning: [certificates] Successfully deployed kubernetes certificates to Cluster nodes
2019/07/24 11:45:58 [INFO] cluster [c-8gf44] provisioning: [file-deploy] Deploying file [/etc/kubernetes/kube-api-authn-webhook.yaml] to node [10.3.1.20]
2019/07/24 11:45:59 [INFO] cluster [c-8gf44] provisioning: Successfully started [file-deployer] container on host [10.3.1.20]
2019/07/24 11:45:59 [INFO] cluster [c-8gf44] provisioning: Waiting for [file-deployer] container to exit on host [10.3.1.20]
2019/07/24 11:45:59 [INFO] cluster [c-8gf44] provisioning: Waiting for [file-deployer] container to exit on host [10.3.1.20]
2019/07/24 11:45:59 [INFO] cluster [c-8gf44] provisioning: Container [file-deployer] is still running on host [10.3.1.20]
2019/07/24 11:45:59 [INFO] Handling backend connection request [c-q89w5:m-4c0602df0d65]
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: Waiting for [file-deployer] container to exit on host [10.3.1.20]
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: [remove/file-deployer] Successfully removed container on host [10.3.1.20]
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: [/etc/kubernetes/kube-api-authn-webhook.yaml] Successfully deployed authentication webhook config Cluster nodes
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: [reconcile] Reconciling cluster state
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: [reconcile] This is newly generated cluster
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: Pre-pulling kubernetes images
2019/07/24 11:46:00 [INFO] cluster [c-8gf44] provisioning: [pre-deploy] Pulling image [rancher/hyperkube:v1.13.5-rancher1] on host [10.3.1.20]

2019/07/24 11:46:20 [INFO] cluster [c-8gf44] provisioning: [pre-deploy] Successfully pulled image [rancher/hyperkube:v1.13.5-rancher1] on host [10.3.1.20]
2019/07/24 11:46:20 [INFO] cluster [c-8gf44] provisioning: Kubernetes images pulled successfully
2019/07/24 11:46:20 [INFO] cluster [c-8gf44] provisioning: [etcd] Building up etcd plane..
2019/07/24 11:46:20 [INFO] cluster [c-8gf44] provisioning: [etcd] Pulling image [rancher/coreos-etcd:v3.2.24-rancher1] on host [10.3.1.20]
2019/07/24 11:46:20 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.122:41248: i/o timeout
2019/07/24 11:46:20 [INFO] Handling backend connection request [c-5q2fz:m-f48ee2daad85]
2019/07/24 11:46:23 [INFO] cluster [c-8gf44] provisioning: [etcd] Successfully pulled image [rancher/coreos-etcd:v3.2.24-rancher1] on host [10.3.1.20]
2019/07/24 11:46:24 [INFO] Handling backend connection request [c-5q2fz:m-471f71f8e09b]
2019/07/24 11:46:25 [INFO] Handling backend connection request [c-5q2fz:m-132e9f7d86bc]
2019/07/24 11:46:25 [INFO] Handling backend connection request [c-q89w5:m-4c0602df0d65]
2019/07/24 11:46:26 [INFO] cluster [c-8gf44] provisioning: [etcd] Successfully started [etcd] container on host [10.3.1.20]
2019/07/24 11:46:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.124:57546: i/o timeout
2019/07/24 11:46:26 [INFO] cluster [c-8gf44] provisioning: [etcd] Successfully started [rke-log-linker] container on host [10.3.1.20]
2019/07/24 11:46:26 [INFO] Handling backend connection request [c-q89w5:m-10f33b2daa15]
2019/07/24 11:46:26 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.43:56516: i/o timeout
2019/07/24 11:46:27 [INFO] cluster [c-8gf44] provisioning: [remove/rke-log-linker] Successfully removed container on host [10.3.1.20]
2019/07/24 11:46:27 [INFO] cluster [c-8gf44] provisioning: [etcd] Successfully started etcd plane.. Checking etcd cluster health

2019/07/24 11:46:40 [INFO] error in remotedialer server [400]: read tcp 172.17.0.3:443->10.3.1.119:37006: i/o timeout
2019/07/24 11:46:41 [INFO] Handling backend connection request [c-5q2fz:m-623b23917cdf]
2019/07/24 11:46:42 [ERROR] cluster [c-8gf44] provisioning: [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy
2019/07/24 11:46:42 [INFO] kontainerdriver rancherkubernetesengine stopped
2019/07/24 11:46:42 [ERROR] ClusterController c-8gf44 [cluster-provisioner-controller] failed with : [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy

Agent pods only show this:

INFO: Using resolv.conf: nameserver 10.3.3.254
INFO: https://rxxxxx/ping is accessible
INFO: rxxxx resolves to 10.3.1.1
time="2019-07-24T11:45:34Z" level=info msg="Rancher agent version v2.2.4 is starting"
time="2019-07-24T11:45:34Z" level=info msg="Option customConfig=map[address:10.3.1.20 internalAddress: roles:[] label:map[]]"
time="2019-07-24T11:45:34Z" level=info msg="Option etcd=false"
time="2019-07-24T11:45:34Z" level=info msg="Option controlPlane=false"
time="2019-07-24T11:45:34Z" level=info msg="Option worker=false"
time="2019-07-24T11:45:34Z" level=info msg="Option requestedHostname=slamdev-master-t14-20"
time="2019-07-24T11:45:34Z" level=info msg="Listening on /tmp/log.sock"
time="2019-07-24T11:45:34Z" level=info msg="Connecting to wss://rancher.adnz.co/v3/connect with token h9vt6wmf6kwv6s7m9qm257gwx8bv5wv5b8zq5qfs984mvmtk976lhw"
time="2019-07-24T11:45:34Z" level=info msg="Connecting to proxy" url="wss://rxxxxxxo/v3/connect"
time="2019-07-24T11:45:34Z" level=info msg="Starting plan monitor"

@superxor
Copy link
Author

@superseb same issue on 2.2.6 master and 2.2.6 agent. This is 100% related to rancher upgrades from 2.1 or 2.2

2019-07-25 09:11:20.258357 N | etcdserver/membership: set the initial cluster version to 3.2
2019-07-25 09:11:20.258443 I | etcdserver/api: enabled capabilities for version 3.2
WARNING: 2019/07/25 09:11:20 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

@superseb
Copy link
Contributor

Can you share the output of:

openssl x509 -in /etc/kubernetes/ssl/kube-etcd-10-3-1-20.pem -text -noout
openssl x509 -in /etc/kubernetes/ssl/kube-ca.pem -text -noout
ls -la /etc/kubernetes/ssl

@superxor
Copy link
Author

Certificate:
Data:
Version: 3 (0x2)
Serial Number: 2119889748694189234 (0x1d6b5c7ea9e8c4b2)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = kube-ca
Validity
Not Before: Jul 24 11:52:40 2019 GMT
Not After : Jul 22 09:10:43 2029 GMT
Subject: CN = kube-etcd
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
RSA Public-Key: (2048 bit)
Modulus:
00:b2:3b:92:bc:f8:16:de:b6:fd:2c:c0:7d:ef:6d:
1a:94:e8:39:77:f8:0f:25:c1:bc:54:a8:5e:e8:7c:
e6:dd:d3:8f:ba:e8:bd:f8:cf:2e:f3:b9:5d:86:d1:
c6:98:4d:d1:27:d3:50:94:ec:0f:9b:1c:0e:e1:06:
d0:3a:5b:92:7c:73:60:2d:78:c6:9c:0f:4d:79:50:
7b:28:be:07:eb:77:a5:dc:f4:86:6f:b8:88:50:4b:
c5:be:66:fe:5b:2a:cd:58:79:ae:6b:31:e4:08:6c:
f7:95:03:53:50:e5:e2:2c:81:5d:66:1d:ca:8a:fb:
95:eb:27:3d:ae:82:15:a9:ba:b9:4e:36:0e:1a:c2:
6f:c3:d2:3d:65:00:c3:ec:b6:16:4f:35:27:1b:7f:
65:19:24:86:83:99:1b:a3:fd:cc:b8:30:3d:36:9a:
f3:52:d1:c5:ad:b9:42:e9:53:d2:bb:07:3d:f9:75:
02:80:85:a5:74:fe:31:f4:72:cc:37:e4:f7:5d:e6:
b1:0b:a5:98:fd:00:4e:ce:ac:14:aa:b2:92:bf:79:
b6:f3:90:b5:c2:46:f8:8e:c8:8e:cc:23:2b:76:43:
69:0b:19:5c:e4:86:27:00:6b:f5:da:b4:cd:c4:49:
f8:0e:59:74:6d:42:e3:e8:6e:34:35:e1:64:c1:f3:
bc:b3
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Client Authentication, TLS Web Server Authentication
X509v3 Subject Alternative Name:
DNS:slamdev-master-t17-20, DNS:localhost, DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster.local, IP Address:10.3.1.20, IP Address:127.0.0.1, IP Address:10.43.0.1
Signature Algorithm: sha256WithRSAEncryption
26:47:d8:ae:1c:89:41:cb:2f:72:c5:c4:24:91:8a:ad:26:18:
39:56:fe:5b:18:54:92:83:3c:54:30:a8:18:1b:ef:bc:22:74:
4b:38:79:e7:d9:26:18:01:4a:2a:0c:77:d3:ac:4b:5b:e9:7b:
93:f6:69:43:c2:48:d2:15:60:13:0d:31:b1:5d:3c:c4:c5:08:
45:d5:89:b7:e2:46:9a:ef:01:3a:42:96:32:aa:21:6f:4b:d7:
13:69:5a:dc:12:42:9c:77:eb:b6:0a:47:7f:03:49:12:c9:32:
e7:81:85:54:8a:8d:04:6d:bf:c0:9f:6a:e0:ba:7d:0b:33:dd:
fe:45:0c:3e:fc:6a:4c:4b:4e:44:49:43:b7:e6:cc:12:f6:af:
94:d2:6a:cb:d9:d5:26:4e:a3:be:3f:5e:45:cc:ea:17:54:94:
5c:4a:f7:d8:2b:15:51:1a:9c:9e:ac:8d:fc:30:1e:e2:98:f9:
00:cf:f3:15:3a:7e:be:6c:1a:14:1f:be:a4:a2:9d:d8:09:d0:
49:10:b0:9a:46:4b:ae:da:59:63:3d:52:56:47:9f:d2:4d:62:
e5:af:27:59:e8:52:64:24:fd:dc:6b:58:23:70:a5:46:a0:8d:
bb:db:65:2b:49:58:df:85:35:4f:c9:f5:51:06:bf:70:09:b3:
81:a7:72:12

the ca.pem

Certificate:
Data:
Version: 3 (0x2)
Serial Number: 0 (0x0)
Signature Algorithm: sha256WithRSAEncryption
Issuer: CN = kube-ca
Validity
Not Before: Jan 13 13:00:00 2019 GMT
Not After : Jan 10 13:00:00 2029 GMT
Subject: CN = kube-ca
Subject Public Key Info:
Public Key Algorithm: rsaEncryption
RSA Public-Key: (2048 bit)
Modulus:
00:eb:d1:3e:ef:3c:21:60:25:02:47:46:d6:bd:41:
44:93:da:3e:b5:1c:fc:d0:c5:d8:f4:fb:05:05:c9:
7b:95:9c:93:ea:4f:5d:a0:e9:61:58:b1:49:fb:09:
d7:0b:1c:84:4f:cc:e1:ab:96:dc:76:bb:12:3f:5a:
e8:d3:46:c6:95:a2:c9:42:2c:a9:37:b2:1d:44:79:
66:f2:73:ce:6b:0a:23:f1:0b:da:85:72:69:bd:23:
8d:f7:a9:17:bc:cd:ee:81:2f:44:be:20:ed:67:6f:
1c:86:5a:e4:8a:96:be:c1:9c:ed:76:2d:88:3c:e3:
3c:a4:67:96:4f:9a:e5:93:a0:f6:89:a3:ee:d6:79:
5f:f7:28:95:b2:29:93:fe:e3:35:8e:ed:54:49:a7:
20:da:66:89:16:ac:a6:02:d5:83:75:9f:6b:7e:11:
0d:d4:3e:0b:f6:2b:ff:1d:64:2f:a6:a3:44:50:77:
eb:24:1f:ff:e9:63:34:ac:77:d8:e2:f6:c5:c7:15:
c2:0f:32:9f:0a:f2:df:5f:87:58:4e:a1:90:10:35:
a4:fd:76:98:49:e6:f7:f9:e5:bf:a8:0c:32:38:cb:
4c:08:0c:7c:fa:2b:50:19:22:cc:b6:96:3b:dc:37:
70:a4:92:2b:a9:93:fc:06:b0:9d:88:60:17:2a:76:
76:db
Exponent: 65537 (0x10001)
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment, Certificate Sign
X509v3 Basic Constraints: critical
CA:TRUE
Signature Algorithm: sha256WithRSAEncryption
92:56:d0:0b:5d:29:81:94:fe:b0:31:8f:45:84:e3:2c:d0:6c:
82:34:f0:67:ad:98:9a:08:8a:da:66:67:b6:24:07:c0:6e:13:
e8:bf:1f:32:ab:25:39:0a:65:c9:a8:4b:2a:bd:d9:6a:01:79:
96:fa:98:33:f0:ba:4b:b4:64:53:ff:24:98:31:e1:99:71:51:
9c:9a:35:25:80:fb:59:01:05:ce:dd:bc:b8:b7:cf:6c:f4:f7:
43:13:b1:13:92:16:b8:c0:9a:c2:ad:62:01:21:bd:9b:81:44:
2d:91:ee:8f:dc:a6:b0:9a:d3:5d:47:e2:d9:ba:00:28:18:bc:
2b:c0:d3:08:11:4c:c4:d5:49:6a:26:25:ca:e3:3f:bc:17:a4:
67:a7:26:64:43:98:e9:d4:71:27:de:77:75:00:f7:46:6c:87:
9e:c3:67:ad:4f:f0:88:1a:a4:91:56:ad:ca:7e:fe:1c:44:fb:
42:7e:cb:21:22:04:0d:fd:08:28:cc:29:9c:66:0e:21:4f:55:
53:93:a8:cf:f8:02:cf:67:ed:0e:2b:71:59:98:b6:c9:b1:28:
d1:37:50:42:a0:41:16:99:85:47:15:d5:10:49:a5:8e:b1:34:
9c:98:85:9b:b2:8b:4a:da:c8:77:18:45:c7:da:a8:77:6f:c4:
fd:ba:eb:ab

root@slamdev-master-t17-20:~# ls -la /etc/kubernetes/ssl
total 112
drwx------ 2 root root 4096 Jul 25 09:10 .
drwxr-xr-x 3 root root 4096 Jul 25 12:41 ..
-rw-r--r-- 1 root root 1679 Jul 25 09:10 kube-apiserver-key.pem
-rw-r--r-- 1 root root 1675 Jul 25 09:10 kube-apiserver-proxy-client-key.pem
-rw-r--r-- 1 root root 1107 Jul 25 09:10 kube-apiserver-proxy-client.pem
-rw-r--r-- 1 root root 1679 Jul 25 09:10 kube-apiserver-requestheader-ca-key.pem
-rw-r--r-- 1 root root 1082 Jul 25 09:10 kube-apiserver-requestheader-ca.pem
-rw-r--r-- 1 root root 1273 Jul 25 09:10 kube-apiserver.pem
-rw-r--r-- 1 root root 1679 Jul 25 09:10 kube-ca-key.pem
-rw------- 1 root root 1017 May 31 04:36 kube-ca.pem
-rw-r--r-- 1 root root 1675 Jul 25 09:10 kube-controller-manager-key.pem
-rw-r--r-- 1 root root 1062 Jul 25 09:10 kube-controller-manager.pem
-rw-r--r-- 1 root root 1675 Jul 25 09:10 kube-etcd-10-3-1-20-key.pem
-rw-r--r-- 1 root root 1269 Jul 25 09:10 kube-etcd-10-3-1-20.pem
-rw------- 1 root root 1675 May 31 04:36 kube-node-key.pem
-rw------- 1 root root 1070 May 31 04:36 kube-node.pem
-rw------- 1 root root 1679 May 31 04:36 kube-proxy-key.pem
-rw------- 1 root root 1046 May 31 04:36 kube-proxy.pem
-rw-r--r-- 1 root root 1675 Jul 25 09:10 kube-scheduler-key.pem
-rw-r--r-- 1 root root 1050 Jul 25 09:10 kube-scheduler.pem
-rw-r--r-- 1 root root 1675 Jul 25 09:10 kube-service-account-token-key.pem
-rw-r--r-- 1 root root 1058 Jul 25 09:10 kube-service-account-token.pem
-rw-r--r-- 1 root root 517 Jul 25 09:10 kubecfg-kube-apiserver-proxy-client.yaml
-rw-r--r-- 1 root root 533 Jul 25 09:10 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw-r--r-- 1 root root 501 Jul 25 09:10 kubecfg-kube-controller-manager.yaml
-rw------- 1 root root 444 May 31 04:36 kubecfg-kube-node.yaml
-rw------- 1 root root 448 May 31 04:36 kubecfg-kube-proxy.yaml
-rw-r--r-- 1 root root 465 Jul 25 09:10 kubecfg-kube-scheduler.yaml

@superseb
Copy link
Contributor

The timestamps of the certificates don't match, what is the output from ls -la /etc/kubernetes/ssl before you run the docker run command? This explains why you get the etcd error at least, are you saying that this node has nothing in /etc/kubernetes/ssl and then get all those files with different modification times placed on the node?

@superxor
Copy link
Author

@superseb You're right, our base image was polluted for unknown reasons. I forgot to check that.
Mea culpa

@Robbie-Cook
Copy link

Deleting my /etc/kubernetes directory and running the docker run command again worked for me

@norcis
Copy link

norcis commented Jan 18, 2020

/etc/kubernetes

Thank you! I spent all day while found it!

I believe Rancher team could improve provisioning script to detect if these files already persists and delete/warn about them.

@Chrislevi
Copy link

rke cert rotate Did the job for me.

@MoulderUtes
Copy link

Clearing /etc/kubernets worked for me thanks.

@tinohager
Copy link

Clearing /opt/rke/etc/kubernetes/ssl -f -r worked for me

@travis-south
Copy link

I've encountered this issue on a fresh VM install using v2.5.8. I checked the ssl and everything was correct and updated. I did cleanup all folders and followed the steps here to clean the nodes and still no luck.

The only thing that fixed it was to turn-off Authorized Cluster Endpoint in the cluster config. I'm a newbie in k8s and I still don't know why it fixed it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants