Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

twittyc · 2018-08-14T15:22:38Z

Rancher versions:
rancher/server or rancher/rancher: 2.0.7
rancher/agent or rancher/rancher-agent: 2.0.6

Docker version: (docker version,docker info preferred)
Server:
Engine:
Version: 18.03.1-ce
API version: 1.37 (minimum version 1.12)
Go version: go1.9.6
Git commit: 9ee9f40
Built: Thu Apr 26 04:27:49 2018
OS/Arch: linux/amd64
Experimental: false

Operating system and kernel: (cat /etc/os-release, uname -r preferred)
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.6.0
VERSION_ID=1800.6.0
BUILD_ID=2018-08-04-0323
PRETTY_NAME="Container Linux by CoreOS 1800.6.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO)
OpenStack

Steps to Reproduce:
Deploy Kubernetes 1.11.1 cluster with RKE using the rke_config.yml

rke config:

addon_job_timeout: 30
authentication: 
  strategy: "x509"
ignore_docker_version: true

cloud_provider:
  name: openstack
  openstackCloudProvider:
    global:
      username: {{ openstack_username }}
      password: {{ openstack_password }}
      auth-url: {{ openstack_auth_url }}
      tenant-id: {{ openstack_tenant_id }}
      domain-id: {{ openstack_domain_id }}
    block_storage:
      ignore-volume-az: false
 
ingress: 
  provider: "none"

kubernetes_version: 1.11.1

network: 
  plugin: "canal"

services: 
  etcd: 
    extra_args: 
      heartbeat-interval: 500
      election-timeout: 5000
    snapshot: false
  kubelet:
    extra_args:
      authentication-token-webhook: true
  kube_api: 
    pod_security_policy: false
    extra_args:
      requestheader-client-ca-file: "/etc/kubernetes/ssl/kube-ca.pem"
      requestheader-extra-headers-prefix: "X-Remote-Extra-"
      requestheader-group-headers: "X-Remote-Group"
      requestheader-username-headers: "X-Remote-User"
      proxy-client-cert-file: "/etc/kubernetes/ssl/kube-proxy.pem"
      proxy-client-key-file: "/etc/kubernetes/ssl/kube-proxy-key.pem"

ssh_agent_auth: false

Results

~ $ kubectl get nodes -o wide                                                                                                                                 4339ms  Tue Aug 14 09:57:37 2018
NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   3d        v1.11.1   10.144.6.137    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              3d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              3d        v1.11.1   10.144.6.142    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              3d        v1.11.1   10.144.6.145    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              3d        v1.11.1   10.144.10.137   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              3d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              3d        v1.11.1   10.144.10.148   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

A restart of the kubelet container on the affected nodes resolves this issue.

The text was updated successfully, but these errors were encountered:

HighwayofLife · 2018-08-14T15:43:13Z

I just saw this in our 1.11 cluster yesterday as well. Have not had a chance to investigate. It's causing the key vault flex volume to fail.

superseb · 2018-08-14T15:44:58Z

If a restart of the kubelet fixes this, can you provide the kubelet logging @twittyc ?

twittyc · 2018-08-14T16:20:07Z

I've added the full log file from one of the affected nodes to this gist:
https://gist.github.com/twittyc/1878c60d78979e92acb87bb5997b2777

twittyc · 2018-08-14T20:55:19Z

At some point all our nodes in the cluster lost their internal IP (The two with IPs had their kubelet container restarted.)

NAME                                     STATUS    ROLES               AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
k8s-corp-prod-0-master-us-corp-kc-8a-0   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8b-1   Ready     controlplane,etcd   4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-master-us-corp-kc-8c-2   Ready     controlplane,etcd   4d        v1.11.1   10.144.10.134   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8a-2   Ready     worker              4d        v1.11.1   10.144.2.141    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8b-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-0   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-1   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
k8s-corp-prod-0-worker-us-corp-kc-8c-2   Ready     worker              4d        v1.11.1   <none>          <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

Here are the kubelet logs from a nodes that regained its internal ip after a restart of the kubelet container:
https://gist.github.com/twittyc/40ea96e29c51a1c6018d2a045ae537c2

moelsayed · 2018-08-16T14:56:32Z

I am unable to reproduce this on a bare-metal setup:

$ kubectl get nodes -o wide
NAME      STATUS    ROLES                      AGE       VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                                        KERNEL-VERSION      CONTAINER-RUNTIME
node-1    Ready     controlplane,etcd,worker   1d        v1.11.1   172.31.18.219   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-2    Ready     worker                     1d        v1.11.1   172.31.20.39    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-3    Ready     worker                     1d        v1.11.1   172.31.22.143   <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1
node-4    Ready     worker                     1d        v1.11.1   172.31.29.29    <none>        Container Linux by CoreOS 1800.6.0 (Rhyolite)   4.14.59-coreos-r2   docker://18.3.1

I will try with a cloud provider and see how it goes.

stieler-it · 2018-09-04T16:49:37Z

I think it's more an issue of Kubernetes and the OpenStack cloud provider, maybe in addition to an unstable OpenStack API endpoint, please also refer my issue that contains some log excerpts: kubernetes/cloud-provider-openstack#280

HighwayofLife · 2018-09-04T21:28:36Z

This isn't specific to the OpenStack cloud provider. We're using Azure and encountered the lost Internal IPs as well.

galal-hussein · 2018-09-06T20:08:58Z

This problem is reported in kubernetes kubernetes/kubernetes#68270, the kubelet fails to get the address of the node from the cloud provider and it fails to update the node status, i will keep the issue open until the issue in k8s is resolved.
i can see the following logs from @twittyc gist:

cat kubelet-log.json |  grep -v "Volume not attached" | grep "node status" 
{"log":"E0813 20:38:56.118941   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T20:38:56.121316062Z"}
{"log":"E0813 23:50:34.361091   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-13T23:50:34.361316592Z"}
{"log":"W0814 01:13:16.823637   18989 kubelet_node_status.go:1114] Failed to set some node status fields: failed to get node address from cloud provider: Timeout after 10s\n","stream":"stderr","time":"2018-08-14T01:13:16.823960043Z"}
{"log":"E0814 07:01:54.730030   18989 kubelet_node_status.go:391] Error updating node status, will retry: error getting node \"k8s-corp-prod-0-worker-us-corp-kc-8b-1\": Get https://127.0.0.1:6443/api/v1/nodes/k8s-corp-prod-0-worker-us-corp-kc-8b-1?resourceVersion=0\u0026timeout=10s: unexpected EOF\n","stream":"stderr","time":"2018-08-14T07:01:54.731184923Z"}

alena1108 · 2018-10-22T18:41:23Z

Looks like the issue has been fixed in k8s 1.12, and there are currently requests to backport it to 1.11, but no confirmation that it will be backported

alena1108 · 2018-12-18T21:38:28Z

The issue has been fixed in k8s 1.11.6; once we make v1.11.6 available to rke and rancher.

stieler-it · 2018-12-18T22:00:54Z

We upgraded to 1.11.5 a few days ago and the bug that nodes lose their internal IP has not occurred again.

alena1108 · 2018-12-19T03:56:27Z

@stieler-it the fix is not a part of 1.11.5; it is a part of k8s v1.11.6. So wonder if the fact that the bug hasn't occurred after the upgrade to v1.11.5 on your setup can be coincidental.

stieler-it · 2018-12-19T05:30:03Z

@alena1108 Ok, good to know. However, the bug appeared pretty often (every few hours at least) and now didn't appear for like 6 days. Maybe something else mitigated the problem - or it is just luck so far. We'll see.

jiaqiluo · 2019-01-04T19:08:41Z

Validated # 1 on master 1/4
Steps:

Ran Rancher server version master 1/4 (single Install)
Ran test scrips on this cluster:
- Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
Tests cover the following areas:
- workload
- dns
- rbac
- communication
- ingress
- secret
- registry
- service discovery

Validated # 2 on v2.1.5-rc3 which is derived from master 1/2
Steps:

ran Rancher server version v2.1.5-rc3 (single Install)
Ran test scrips on three clusters:
- Ubuntu 16.04 docker 17.03.2-ce, k8s 1.11.6
- RancherOS 1.4.2 docker 17.03.2-ce, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
- Rhel 7.5, native docker 1.13 with selinux enabled, k8s 1.11.6, 1 control plane, 1 etcd, and 3 workers
Tests cover the following areas:
- workload
- dns
- rbac
- communication
- ingress
- secret
- registry
- service discovery

mrmason · 2019-01-08T11:22:35Z

Is it possible to get this fixed in the 1.6 branch too ? Kube version there is k8s:v1.11.5-rancher1-1 which also has this problem.

alena1108 · 2019-01-08T18:05:10Z

@mrmason we are going to address it there as well; here is the corresponding issue: rancher/rancher#14600

deniseschannon assigned moelsayed Aug 14, 2018

deniseschannon added the kind/bug label Aug 14, 2018

deniseschannon added this to the v0.1.10 milestone Aug 14, 2018

stieler-it mentioned this issue Sep 5, 2018

One Node loose private and external ip address kubernetes/kubernetes#68270

Closed

alena1108 added the priority/1 label Sep 5, 2018

alena1108 assigned galal-hussein Sep 6, 2018

galal-hussein added the status/has-dependency label Sep 6, 2018

galal-hussein removed this from the v0.1.10 milestone Sep 6, 2018

alena1108 added this to the v0.1.11 milestone Sep 21, 2018

alena1108 mentioned this issue Oct 22, 2018

Internal/External ipaddress missing for hosts in rhel clusters. rancher/rancher#14600

Closed

alena1108 added internal and removed priority/1 labels Oct 31, 2018

alena1108 mentioned this issue Dec 19, 2018

k8s 1.11.6 support #1068

Merged

alena1108 added status/resolved labels Dec 19, 2018

sangeethah self-assigned this Dec 21, 2018

alena1108 mentioned this issue Jan 2, 2019

Backport: k8s 1.11.6 support to address "Kubernetes 1.11.1 nodes occasionally do not register internal IP address" rancher/rancher#17280

Closed

sangeethah assigned jiaqiluo and unassigned sangeethah Jan 4, 2019

jiaqiluo closed this as completed Jan 4, 2019

vitobotta mentioned this issue Jun 23, 2020

Problems with Rancher deployed clusters hetznercloud/hcloud-cloud-controller-manager#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

twittyc commented Aug 14, 2018

HighwayofLife commented Aug 14, 2018

superseb commented Aug 14, 2018

twittyc commented Aug 14, 2018

twittyc commented Aug 14, 2018

moelsayed commented Aug 16, 2018

stieler-it commented Sep 4, 2018 •

edited

Loading

HighwayofLife commented Sep 4, 2018

galal-hussein commented Sep 6, 2018

alena1108 commented Oct 22, 2018

alena1108 commented Dec 18, 2018

stieler-it commented Dec 18, 2018

alena1108 commented Dec 19, 2018 •

edited

Loading

stieler-it commented Dec 19, 2018

jiaqiluo commented Jan 4, 2019 •

edited

Loading

mrmason commented Jan 8, 2019

alena1108 commented Jan 8, 2019

Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

Kubernetes 1.11.1 nodes occasionally do not register internal IP address #860

Comments

twittyc commented Aug 14, 2018

HighwayofLife commented Aug 14, 2018

superseb commented Aug 14, 2018

twittyc commented Aug 14, 2018

twittyc commented Aug 14, 2018

moelsayed commented Aug 16, 2018

stieler-it commented Sep 4, 2018 • edited Loading

HighwayofLife commented Sep 4, 2018

galal-hussein commented Sep 6, 2018

alena1108 commented Oct 22, 2018

alena1108 commented Dec 18, 2018

stieler-it commented Dec 18, 2018

alena1108 commented Dec 19, 2018 • edited Loading

stieler-it commented Dec 19, 2018

jiaqiluo commented Jan 4, 2019 • edited Loading

mrmason commented Jan 8, 2019

alena1108 commented Jan 8, 2019

stieler-it commented Sep 4, 2018 •

edited

Loading

alena1108 commented Dec 19, 2018 •

edited

Loading

jiaqiluo commented Jan 4, 2019 •

edited

Loading