Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

Closed
twittyc opened this issue Aug 14, 2018 · 16 comments
Closed

Comments

@twittyc
Copy link

twittyc commented Aug 14, 2018

Rancher versions:
rancher/server or rancher/rancher: 2.0.7
rancher/agent or rancher/rancher-agent: 2.0.6

Docker version: (docker version,docker info preferred)
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.6
Git commit: 9ee9f40
Built: Thu Apr 26 04:27:49 2018
OS/Arch: linux/amd64
Experimental: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack

Steps to Reproduce:
Deploy Kubernetes 1.11.1 cluster with RKE using the rke_config.yml

rke config:

addon_job_timeout: 30
authentication: 
  strategy: "x509"
ignore_docker_version: true

cloud_provider:
  name: openstack
  openstackCloudProvider:
    global:
      username: {{ openstack_username }}
      password: {{ openstack_password }}
      auth-url: {{ openstack_auth_url }}
      tenant-id: {{ openstack_tenant_id }}
      domain-id: {{ openstack_domain_id }}
    block_storage:
      ignore-volume-az: false
 
ingress: 
  provider: "none"

kubernetes_version: 1.11.1

network: 
  plugin: "canal"

services: 
  etcd: 
    extra_args: 
      heartbeat-interval: 500
      election-timeout: 5000
    snapshot: false
  kubelet:
    extra_args:
      authentication-token-webhook: true
  kube_api: 
    pod_security_policy: false
    extra_args:
      requestheader-client-ca-file: "/etc/kubernetes/ssl/kube-ca.pem"
      requestheader-extra-headers-prefix: "X-Remote-Extra-"
      requestheader-group-headers: "X-Remote-Group"
      requestheader-username-headers: "X-Remote-User"
      proxy-client-cert-file: "/etc/kubernetes/ssl/kube-proxy.pem"
      proxy-client-key-file: "/etc/kubernetes/ssl/kube-proxy-key.pem"

ssh_agent_auth: false

Results

~ $ kubectl get nodes -o wide                                                                                                                                 4339ms  Tue Aug 14 09:57:37 2018
NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   3d        v1.11.1   10.144.6.137    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              3d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              3d        v1.11.1   10.144.6.142    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              3d        v1.11.1   10.144.6.145    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              3d        v1.11.1   10.144.10.137   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              3d        v1.11.1   10.144.10.148   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

A restart of the kubelet container on the affected nodes resolves this issue.

@HighwayofLife
Copy link
Contributor

I just saw this in our 1.11 cluster yesterday as well. Have not had a chance to investigate. It's causing the key vault flex volume to fail.

@superseb
Copy link
Contributor

If a restart of the kubelet fixes this, can you provide the kubelet logging @twittyc ?

@twittyc
Copy link
Author

twittyc commented Aug 14, 2018

I've added the full log file from one of the affected nodes to this gist:
https://gist.github.com/twittyc/1878c60d78979e92acb87bb5997b2777

@twittyc
Copy link
Author

twittyc commented Aug 14, 2018

At some point all our nodes in the cluster lost their internal IP (The two with IPs had their kubelet container restarted.)

NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   4d        v1.11.1   10.144.10.134   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              4d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

Here are the kubelet logs from a nodes that regained its internal ip after a restart of the kubelet container:
https://gist.github.com/twittyc/40ea96e29c51a1c6018d2a045ae537c2

@moelsayed
Copy link
Contributor

I am unable to reproduce this on a bare-metal setup:

$ kubectl get nodes -o wide
NAME      STATUS    ROLES                      AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
node-1    Ready     controlplane,etcd,worker   1d        v1.11.1   172.31.18.219   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-2    Ready     worker                     1d        v1.11.1   172.31.20.39    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-3    Ready     worker                     1d        v1.11.1   172.31.22.143   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-4    Ready     worker                     1d        v1.11.1   172.31.29.29    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

I will try with a cloud provider and see how it goes.

@stieler-it
Copy link

stieler-it commented Sep 4, 2018

I think it's more an issue of Kubernetes and the OpenStack cloud provider, maybe in addition to an unstable OpenStack API endpoint, please also refer my issue that contains some log excerpts: kubernetes/cloud-provider-openstack#280

@HighwayofLife
Copy link
Contributor

This isn't specific to the OpenStack cloud provider. We're using Azure and encountered the lost Internal IPs as well.

@galal-hussein
Copy link
Contributor

This problem is reported in kubernetes kubernetes/kubernetes#68270, the kubelet fails to get the address of the node from the cloud provider and it fails to update the node status, i will keep the issue open until the issue in k8s is resolved.
i can see the following logs from @twittyc gist:

cat kubelet-log.json |  grep -v "Volume not attached" | grep "node status" 
{"log":"E0813 20:38:56.118941   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T20:38:56.121316062Z"}
{"log":"E0813 23:50:34.361091   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T23:50:34.361316592Z"}
{"log":"W0814 01:13:16.823637   18989 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s\n","stream":"stderr","time":"2018-08-14T01:13:16.823960043Z"}
{"log":"E0814 07:01:54.730030   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-14T07:01:54.731184923Z"}

@galal-hussein galal-hussein removed this from the v0.1.10 milestone Sep 6, 2018
@alena1108 alena1108 added this to the v0.1.11 milestone Sep 21, 2018
@alena1108
Copy link

Looks like the issue has been fixed in k8s 1.12, and there are currently requests to backport it to 1.11, but no confirmation that it will be backported

@alena1108
Copy link

The issue has been fixed in k8s 1.11.6; once we make v1.11.6 available to rke and rancher.

@stieler-it
Copy link

We upgraded to 1.11.5 a few days ago and the bug that nodes lose their internal IP has not occurred again.

@alena1108
Copy link

alena1108 commented Dec 19, 2018

@stieler-it the fix is not a part of 1.11.5; it is a part of k8s v1.11.6. So wonder if the fact that the bug hasn't occurred after the upgrade to v1.11.5 on your setup can be coincidental.

@stieler-it
Copy link

@alena1108 Ok, good to know. However, the bug appeared pretty often (every few hours at least) and now didn't appear for like 6 days. Maybe something else mitigated the problem - or it is just luck so far. We'll see.

@sangeethah sangeethah assigned jiaqiluo and unassigned sangeethah Jan 4, 2019
@jiaqiluo
Copy link
Member

jiaqiluo commented Jan 4, 2019

Validated # 1 on master 1/4
Steps:

  • Ran Rancher server version master 1/4 (single Install)
  • Ran test scrips on this cluster:
    • Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
  • Tests cover the following areas:
    • workload
    • dns
    • rbac
    • communication
    • ingress
    • secret
    • registry
    • service discovery

Validated # 2 on v2.1.5-rc3 which is derived from master 1/2
Steps:

  • ran Rancher server version v2.1.5-rc3 (single Install)
  • Ran test scrips on three clusters:
    • Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6
    • RancherOS 1.4.2 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
    • Rhel 7.5, native docker 1.13 with selinux enabled, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
  • Tests cover the following areas:
    • workload
    • dns
    • rbac
    • communication
    • ingress
    • secret
    • registry
    • service discovery

@jiaqiluo jiaqiluo closed this as completed Jan 4, 2019
@mrmason
Copy link

mrmason commented Jan 8, 2019

Is it possible to get this fixed in the 1.6 branch too ? Kube version there is k8s:v1.11.5-rancher1-1 which also has this problem.

@alena1108
Copy link

@mrmason we are going to address it there as well; here is the corresponding issue: rancher/rancher#14600

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests