Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI (installed by RKE) Crashloopback on Ubuntu 22.04 nodes #3114

Closed
quadeare opened this issue Dec 1, 2022 · 11 comments
Closed

CNI (installed by RKE) Crashloopback on Ubuntu 22.04 nodes #3114

quadeare opened this issue Dec 1, 2022 · 11 comments

Comments

@quadeare
Copy link

quadeare commented Dec 1, 2022

RKE version: 1.4.1 (same with 1.3.16)

Docker version: (docker version,docker info preferred)

Client: Docker Engine - Community
 Version:           20.10.21
 API version:       1.41
 Go version:        go1.18.7
 Git commit:        baeda1f
 Built:             Tue Oct 25 18:01:58 2022
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.21
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.18.7
  Git commit:       3056208
  Built:            Tue Oct 25 17:59:49 2022
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.10
  GitCommit:        770bd0108c32f3fb5c73ae1264f7e503fe7b2661
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


Linux chell 5.15.0-53-generic #59-Ubuntu SMP Mon Oct 17 18:53:30 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) : Baremetal and Scaleway VM (tested on both

cluster.yml file:

nodes:

    - address: **Baremetal host on 20.04**
      internal_address: 10.10.10.xx
      user: **
      role:
        - controlplane
        - etcd
        - worker
      port: 22

    - address: **VM host on 20.04** ==> OK, no issue
      internal_address: 10.10.10.xx
      user: **
      role:
        - worker
      port: 22

    - address: **VM host on 22.04** ==> **NOK**, issue with CNI
      internal_address: 10.10.10.xx
      user: **
      role:
        - worker
      port: 22

    - address: **Baremetal on 22.04** ==> **NOK**, issue with CNI
      internal_address: 10.10.10.xx
      user: **
      role:
        - worker
      port: 22

services:
  etcd:
    backup_config:
      interval_hours: 24
      retention: 12
      s3backupconfig:
        access_key: xxxx
        secret_key: xxx
        bucket_name: xxx
        region: "fr-par"
        folder: "etcd"
        endpoint: xxxx

  kubelet:
    extra_args:
      container-runtime: remote
      container-runtime-endpoint: 'unix:///run/containerd/containerd.sock'
      max-pods: 250
    extra_binds:
      - '/var/gitpod:/var/gitpod'
      - '/var/lib/containerd:/var/lib/containerd'


# Disable docker-cri
enable_cri_dockerd: false

# If set to true, RKE will not fail when unsupported Docker version
# are found
ignore_docker_version: false

# Enable use of SSH agent to use SSH private keys with passphrase
# This requires the environment `SSH_AUTH_SOCK` configured pointing
#to your SSH agent which has the private key added
ssh_agent_auth: true

# Set the name of the Kubernetes cluster  
cluster_name: xx-cluster

# Currently, only authentication strategy supported is x509.
# You can optionally create additional SANs (hostnames or IPs) to
# add to the API server PKI certificate.
# This is useful if you want to use a load balancer for the
# control plane servers.
authentication:
    strategy: x509
    sans:
      - "rancher.xx.xx"
      - "rancher.xx.yy"


kubernetes_version: v1.24.6-rancher1-1

network:
  plugin: flannel
  options:
      flannel_iface: ztrta4f6bp

Steps to Reproduce: Install RKE suite with the CNI you want

Results:

CNI Crashloopback ONLY on 22.04 nodes without any error logs.

Kube-system pods list :

> k get po -o wide
NAME                                       READY   STATUS      RESTARTS      AGE   IP             NODE              NOMINATED NODE   READINESS GATES
calico-kube-controllers-6c977d77bc-27l2j   1/1     Running     0             12h   10.42.2.114    20.04-NODE     <none>           <none>
coredns-64b95f5657-d7zn5                   1/1     Running     0             12h   10.42.2.151    20.04-NODE     <none>           <none>
coredns-autoscaler-d76d8479-wlqlx          1/1     Running     0             12h   10.42.2.130    20.04-NODE     <none>           <none>
kube-flannel-fnjfx                         2/2     Running     0             12h   10.10.10.4     20.04-NODE     <none>           <none>
kube-flannel-k8w95                         1/2     CrashLoopBackOff   6 (56s ago)   3m39s   10.10.10.221   22.04-NODE   <none>           <none>
metrics-server-9c47f6996-2p8xz             1/1     Running     0             12h   10.42.2.147    20.04-NODE     <none>           <none>
rke-coredns-addon-deploy-job-7qrzn         0/1     Completed   0             12h   10.10.10.4     20.04-NODE     <none>           <none>
rke-ingress-controller-deploy-job-vrt6g    0/1     Completed   0             12h   10.10.10.4     20.04-NODE     <none>           <none>
rke-metrics-addon-deploy-job-m9m6j         0/1     Completed   0             12h   10.10.10.4     20.04-NODE     <none>           <none>
rke-network-plugin-deploy-job-5vj6q        0/1     Completed   0             12h   10.10.10.4     20.04-NODE     <none>           <none>

CNI pod describe

k describe po kube-flannel-k8w95
Name:                 kube-flannel-k8w95
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      flannel
Node:                 xx.xx.xx.xx/10.10.10.221
Start Time:           Thu, 01 Dec 2022 14:22:46 +0100
Labels:               controller-revision-hash=85bb99c5bc
                      k8s-app=flannel
                      pod-template-generation=1
                      tier=node
Annotations:          <none>
Status:               Running
IP:                   10.10.10.221
IPs:
  IP:           10.10.10.221
Controlled By:  DaemonSet/kube-flannel
Containers:
  install-cni:
    Container ID:  containerd://c30f417aa2ad42df1ecfa132e5b3a0287c6cee18b5c7654ebe2765c8186d5694
    Image:         rancher/flannel-cni:v0.3.0-rancher6
    Image ID:      docker.io/rancher/flannel-cni@sha256:23d921611903f6332cef666033924e0d92370548637d497549fe1121eb370feb
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 01 Dec 2022 14:25:42 +0100
      Finished:     Thu, 01 Dec 2022 14:27:08 +0100
    Ready:          False
    Restart Count:  3
    Environment:
      CNI_NETWORK_CONFIG:  <set to the key 'cni-conf.json' of config map 'kube-flannel-cfg'>  Optional: false
      CNI_CONF_NAME:       10-flannel.conflist
    Mounts:
      /host/etc/cni/net.d from cni (rw)
      /host/opt/cni/bin/ from host-cni-bin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kkq92 (ro)
  kube-flannel:
    Container ID:  containerd://012cd8316cfc1d487a41657d2df59f75075b1e70a5a4bad58b3a987d382e61f7
    Image:         rancher/mirrored-coreos-flannel:v0.15.1
    Image ID:      docker.io/rancher/mirrored-coreos-flannel@sha256:162f82315dbe939e457697281a8ef04d469edd52bb384a1405f78468ed6fe323
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/bin/flanneld
    Args:
      --ip-masq
      --kube-subnet-mgr
      --iface=ztrta4f6bp
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 01 Dec 2022 14:25:16 +0100
      Finished:     Thu, 01 Dec 2022 14:25:28 +0100
    Ready:          False
    Restart Count:  3
    Limits:
      cpu:     100m
      memory:  50Mi
    Requests:
      cpu:     100m
      memory:  50Mi
    Environment:
      POD_NAME:       kube-flannel-k8w95 (v1:metadata.name)
      POD_NAMESPACE:  kube-system (v1:metadata.namespace)
    Mounts:
      /etc/cni/net.d from cni (rw)
      /etc/kube-flannel/ from flannel-cfg (rw)
      /run from run (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kkq92 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run
    HostPathType:  
  cni:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  flannel-cfg:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-flannel-cfg
    Optional:  false
  host-cni-bin:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  kube-api-access-kkq92:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason          Age                    From               Message
  ----     ------          ----                   ----               -------
  Normal   Scheduled       4m43s                  default-scheduler  Successfully assigned kube-system/kube-flannel-k8w95 to xx.xx.xx.xx
  Normal   Pulling         4m40s                  kubelet            Pulling image "rancher/flannel-cni:v0.3.0-rancher6"
  Normal   Pulled          4m25s                  kubelet            Successfully pulled image "rancher/flannel-cni:v0.3.0-rancher6" in 14.576588607s
  Normal   Pulling         4m24s                  kubelet            Pulling image "rancher/mirrored-coreos-flannel:v0.15.1"
  Normal   Pulled          4m17s                  kubelet            Successfully pulled image "rancher/mirrored-coreos-flannel:v0.15.1" in 7.540987373s
  Normal   Created         3m45s (x2 over 4m25s)  kubelet            Created container install-cni
  Normal   Started         3m45s (x2 over 4m25s)  kubelet            Started container install-cni
  Normal   Pulled          3m45s                  kubelet            Container image "rancher/flannel-cni:v0.3.0-rancher6" already present on machine
  Normal   Killing         3m43s (x2 over 4m16s)  kubelet            Stopping container install-cni
  Normal   SandboxChanged  3m8s (x2 over 3m45s)   kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Created         3m5s (x3 over 4m17s)   kubelet            Created container kube-flannel
  Normal   Started         3m5s (x3 over 4m17s)   kubelet            Started container kube-flannel
  Normal   Killing         3m5s (x3 over 4m16s)   kubelet            Stopping container kube-flannel
  Normal   Pulled          3m5s (x2 over 3m45s)   kubelet            Container image "rancher/mirrored-coreos-flannel:v0.15.1" already present on machine
  Warning  BackOff         3m5s                   kubelet            Back-off restarting failed container

Flannel logs :

> k logs -f kube-flannel-k8w95 -c kube-flannel
I1201 13:28:15.160984       1 main.go:217] CLI flags config: {etcdEndpoints:http://127.0.0.1:4001,http://127.0.0.1:2379 etcdPrefix:/coreos.com/network etcdKeyfile: etcdCertfile: etcdCAFile: etcdUsername: etcdPassword: help:false version:false autoDetectIPv4:false autoDetectIPv6:false kubeSubnetMgr:true kubeApiUrl: kubeAnnotationPrefix:flannel.alpha.coreos.com kubeConfigFile: iface:[ztrta4f6bp] ifaceRegex:[] ipMasq:true subnetFile:/run/flannel/subnet.env subnetDir: publicIP: publicIPv6: subnetLeaseRenewMargin:60 healthzIP:0.0.0.0 healthzPort:0 charonExecutablePath: charonViciUri: iptablesResyncSeconds:5 iptablesForwardRules:true netConfPath:/etc/kube-flannel/net-conf.json setNodeNetworkUnavailable:true}
W1201 13:28:15.161159       1 client_config.go:608] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1201 13:28:15.245174       1 kube.go:120] Waiting 10m0s for node controller to sync
I1201 13:28:15.360860       1 kube.go:378] Starting kube subnet manager
I1201 13:28:16.361051       1 kube.go:127] Node controller sync successful
I1201 13:28:16.361266       1 main.go:237] Created subnet manager: Kubernetes Subnet Manager - xx.xx.xx.xx
I1201 13:28:16.361369       1 main.go:240] Installing signal handlers
I1201 13:28:16.361683       1 main.go:459] Found network config - Backend type: vxlan
I1201 13:28:16.362468       1 main.go:698] Using interface with name ztrta4f6bp and address 10.10.10.221
I1201 13:28:16.362615       1 main.go:720] Defaulting external address to interface address (10.10.10.221)
I1201 13:28:16.362702       1 main.go:733] Defaulting external v6 address to interface address (<nil>)
I1201 13:28:16.362896       1 vxlan.go:137] VXLAN config: VNI=1 Port=8472 GBP=false Learning=false DirectRouting=false
I1201 13:28:16.363434       1 kube.go:339] Setting NodeNetworkUnavailable
I1201 13:28:16.385674       1 main.go:340] Setting up masking rules
I1201 13:28:16.747422       1 main.go:361] Changing default FORWARD chain policy to ACCEPT
I1201 13:28:16.805918       1 main.go:374] Wrote subnet file to /run/flannel/subnet.env
I1201 13:28:16.805929       1 main.go:378] Running backend.
I1201 13:28:16.805937       1 main.go:396] Waiting for all goroutines to exit
I1201 13:28:16.805971       1 vxlan_network.go:60] watching for new subnet leases
I1201 13:29:33.596767       1 main.go:443] shutdownHandler sent cancel signal...
I1201 13:29:33.596819       1 watch.go:39] context canceled, close receiver chan
I1201 13:29:33.596826       1 vxlan_network.go:75] evts chan closed
I1201 13:29:33.596833       1 main.go:399] Exiting cleanly...
@immanuelfodor
Copy link

How is it possible to run k8s v1.24.X with enable_cri_dockerd: false? Wasn't docker support removed in v1.23, so the cri flag should always be true then?

@jroose
Copy link

jroose commented Jan 8, 2023

Did you find a solution to this? I'm experiencing something similar on Kubernetes 1.25.2-00 with containerd 1.5.9 and 1.6.14, but I'm installing manually, rather than with Rancher.

@jroose
Copy link

jroose commented Jan 8, 2023

It turned out that for me, flannel was being restarted because kube-proxy was restarting frequently. I finally determined that the root cause was that I needed to set SystemdCgroup in /etc/containerd.config.toml. This fixed it for me:

mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml # Reverts the containerd config to the default configuration
sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml
systemctl restart containerd

I'm not sure why that's not set by default on Ubuntu 22.04 (at least it's not on Arm64). Maybe it's fixed in more recent versions of containerd, but containerd doesn't seem to regenerate /etc/containerd, let alone the config.toml on reinstall.

@quadeare
Copy link
Author

Hi,

Sorry for this very late response...

How is it possible to run k8s v1.24.X with enable_cri_dockerd: false? Wasn't docker support removed in v1.23, so the cri flag should always be true then?

I enable CRI on my Docker Daemon so i don't use Dockershim : gitpod-io/gitpod#5410 (comment)

Did you find a solution to this? I'm experiencing something similar on Kubernetes 1.25.2-00 with containerd 1.5.9 and 1.6.14, but I'm installing manually, rather than with Rancher.

Pffff yeah... It seems to be an issue with Kernel > 5.4 and iptables implementation.
CNI try to initialize iptables rules and it crashloopbackoff.

I'll try to get more logs next time ;)

@immanuelfodor
Copy link

I enable CRI on my Docker Daemon so i don't use Dockershim : gitpod-io/gitpod#5410 (comment)

Wow, so you actually use containerd with RKE1 which was though not possible (#94) and people should move to RKE2. No cri_dockerd in use, and no docker on the nodes, I'm amazed! And no issues so far?

Would this modification work for an existing cluster? Make the changes inside the VMs for containerd, then modify the kubelet params and do an rke up?

@immanuelfodor
Copy link

Note: I suppose containerd was installed by docker, so you actually have docker installed, but I suppose docker is not in use in your scenario. It would be also interesting to only have containerd installed without docker but it's just a note. For an existing cluster, docker could remain installed, I don't care much if the cluster actually uses containerd.

@quadeare
Copy link
Author

Sorry again for the delay... !

Wow, so you actually use containerd with RKE1 which was though not possible (#94) and people should move to RKE2. No cri_dockerd in use, and no docker on the nodes, I'm amazed! And no issues so far?

RKE1 is fully compatible with containerd and i don't have any issue with that ! That's why i don't move my homelab to RKE2 at this moment. That's not an easy move and RKE1 is really easy to use.

Would this modification work for an existing cluster? Make the changes inside the VMs for containerd, then modify the kubelet params and do an rke up?

Yes of course ! You will have a down of your production, but it is quite possible ! This is what I did on my homelab 1 year ago and it still working (upgrades included) !

Note: I suppose containerd was installed by docker, so you actually have docker installed, but I suppose docker is not in use in your scenario. It would be also interesting to only have containerd installed without docker but it's just a note. For an existing cluster, docker could remain installed, I don't care much if the cluster actually uses containerd.

docker-CE is mandatory to run k8s components started by RKE1 because RKE1 use docker-ce CLI to start containers.

We don't care about RKE1 compatibility, because RKE1 never drive pods. kubelet only can drive pods and with RKE1 we can configure it to use containerd embedded with docker-ce.

Here an example from my homelab new node

docker-CE list output :

quadeare@atlas:~$ docker ps -a
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS      PORTS     NAMES
13576fa11308   rancher/hyperkube:v1.24.6-rancher1   "/opt/rke-tools/entr…"   10 days ago   Up 6 days             kube-proxy
53f2a41165ba   rancher/hyperkube:v1.24.6-rancher1   "/opt/rke-tools/entr…"   10 days ago   Up 6 days             kubelet
0cecd6700bce   rancher/rke-tools:v0.1.87            "/bin/bash"              10 days ago   Created               service-sidekick
7dbbcaf6c66d   rancher/rke-tools:v0.1.87            "nginx-proxy CP_HOST…"   10 days ago   Up 6 days             nginx-proxy

And containerd output :

quadeare@atlas:~$ sudo ctr -n k8s.io c ls
CONTAINER                                                           IMAGE                                                               RUNTIME                  
062d4a48f99da36d3be5c8dd9bdad13cea80f716867d71cc086807b22ea076a3    registry.k8s.io/pause:3.6                                           io.containerd.runc.v2    
0ef744b6799e99f27eea0e9d14aad7ab9eb1e3a2167e376a609abab4e0f3b64f    registry.k8s.io/pause:3.6                                           io.containerd.runc.v2    
104ebeff985ec3c66a7ef2a1b76fb0746a176c7177bd6773243361ecdf6ce336    registry.k8s.io/pause:3.6                                           io.containerd.runc.v2    
136a2261945e2f30bfe6aa55482490f63cc10c351b4eabee09c1eef887dd7931    docker.io/longhornio/longhorn-manager:v1.2.4                        io.containerd.runc.v2    
1daa32452b74bdb333c1e3e485453da52df712c968c5316d8293de3de9810f39    docker.io/rancher/mirrored-calico-kube-controllers:v3.22.0          io.containerd.runc.v2    
1ddc5a1c2a83b432747b75abd36d4c66a1ca5eae21f6b187c36452fad4ef48cc    docker.io/rancher/mirrored-coredns-coredns:1.9.3                    io.containerd.runc.v2    
2054ff552c6a921faf3f66c7b1e2138349c4f029483cd46ffc329174327e78b7    docker.io/longhornio/longhorn-engine:v1.2.4                         io.containerd.runc.v2    
...
... and many other pods...

Well, you can continue using RKE1 like me, as long as RKE1 is maintained :p

@FlyingOnion
Copy link

It seems to be an issue with Kernel > 5.4 and iptables implementation. CNI try to initialize iptables rules and it crashloopbackoff.

@quadeare
Hi. Can you provide more details about this? I'm new to this project. Sorry for any disturbance.

@RaceFPV
Copy link

RaceFPV commented Mar 21, 2023

Bumping this issue as I have the same problem with ubuntu 22.04 and flannel with rke1, i tried 'sudo update-alternatives --set iptables /usr/sbin/iptables-legacy' which used to be the fix for this but it no longer seems to work.

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@quadeare
Copy link
Author

It turned out that for me, flannel was being restarted because kube-proxy was restarting frequently. I finally determined that the root cause was that I needed to set SystemdCgroup in /etc/containerd.config.toml. This fixed it for me:

mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml # Reverts the containerd config to the default configuration
sed -i 's/SystemdCgroup \= false/SystemdCgroup \= true/g' /etc/containerd/config.toml
systemctl restart containerd

I'm not sure why that's not set by default on Ubuntu 22.04 (at least it's not on Arm64). Maybe it's fixed in more recent versions of containerd, but containerd doesn't seem to regenerate /etc/containerd, let alone the config.toml on reinstall.

Sorry for this very very late response...

I just upgrade my home lab today and i test your workaround.
It just work like a charm, thank you so much !

I'm now able to run RKE1 on Ubuntu 22.04 without any issue.

Have a nice day !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants