v3.21 operator failing to set up the pod network on a k3d cluster #1709

Glen-Tigera · 2021-12-22T16:04:14Z

When using the v3.21 operator to install calico on a K3D cluster, the pod network is failing to start. This bug is a result of investigations done with the k3d team at rancher. k3d-io/k3d#898

Expected Behavior

The pod network should be up and running successfully in all namespaces. All pods are in the running state.

Current Behavior

The calico-nodes are able to run without issue but other containers are stuck in the ContainerCreating state (coredns, metrics, calico-kube-controller)

$ kubectl get pods -A
NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE     IP           NODE                             NOMINATED NODE   READINESS GATES
tigera-operator   tigera-operator-7dc6bc5777-jqgj6           1/1     Running             0          6m36s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-typha-786fc79b-hm5sr                1/1     Running             0          6m17s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-kube-controllers-78cc777977-trgbz   0/1     ContainerCreating   0          6m17s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       metrics-server-86cbb8457f-59s6k            0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       local-path-provisioner-5ff76fc89d-w7bf6    0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       coredns-7448499f4d-7rwx9                   0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-node-99jc6                          1/1     Running             0          6m17s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>

When describing the stuck pods, I see this in its events:

$ kubectl describe pod/coredns-7448499f4d-7rwx9 -n calico-system

  Warning  FailedCreatePodSandBox  6s                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "51947047f29820ea93c486fe4c18f5a31e9c9c9418e859e320b8d3b2c43bd383": netplugin failed with no error message: fork/exec /opt/cni/bin/calico: no such file or directory

Based on the error above, I went to check /opt/cni/bin/calico to see if the calico binary existed in the container, which it does:

glen@glen-tigera: $ docker exec -ti k3d-test-cluster-3-21-server-0 /bin/sh
/ # ls
bin  dev  etc  k3d  lib  opt  output  proc  run  sbin  sys  tmp  usr  var
/ # cd /opt/cni/bin/
/opt/cni/bin # ls -a
.  ..  bandwidth  **calico**  calico-ipam  flannel  host-local  install  loopback  portmap  tags.txt  tuning

CNI Config Yaml:

kubectl get cm cni-config -n calico-system -o yaml
apiVersion: v1
data:
  config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "datastore_type": "kubernetes",
          "mtu": 0,
          "nodename_file_optional": false,
          "log_level": "Info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "ipam": { "type": "calico-ipam", "assign_ipv4" : "true", "assign_ipv6" : "false"},
          "container_settings": {
              "allow_ip_forwarding": true
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "k8s_api_root":"https://10.43.0.1:443",
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        },
        {"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}
      ]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2021-12-22T16:00:02Z"
  name: cni-config
  namespace: calico-system
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: 90769081-24a2-440d-9666-a9c3b94ebd34
  resourceVersion: "635"
  uid: 609157c5-c43b-42d9-bb5b-7053a8673a49

Possible Solution

This is only occuring in v3.21 of the operator. I tested prior versions of operator and it sets up the pod network successfully. See k3d-calico-operator-install-findings.txt. The issue should lie between v3.20 and v3.21 changes of operator.

Steps to Reproduce (for bugs)

Install K3D (v.5.2.1 was used at the time of testing) - https://k3d.io/v5.2.1/
Run the list of commands in order below

k3d cluster create "test-cluster-3-21" --k3s-arg "--flannel-backend=none@server:*" --k3s-arg "--no-deploy=traefik@server:*"
kubectl apply -f https://docs.projectcalico.org/archive/v3.21/manifests/tigera-operator.yaml
curl -L https://docs.projectcalico.org/archive/v3.21/manifests/custom-resources.yaml > k3d-custom-res.yaml
yq e '.spec.calicoNetwork.containerIPForwarding="Enabled"' -i k3d-custom-res.yaml
kubectl apply -f k3d-custom-res.yaml

This should try to install calico through the operator on your k3d cluster with IP forwarding enabled.

Get all pods kubectl get pods -A
Notice that pods are stuck in the container creating state, so the pod network has failed to start.

Context

Delivery engineering wants to be able to support k3d provisioning and install in Banzai to expand our E2E coverage of supported provisioners. This would help our engineering team for testing their features on a local k3d cluster as it is much faster to setup.

Your Environment

OS: GNU/Linux
Kernel Version: 20.04.2-Ubuntu SMP
Kernel Release: 5.11.0-40-generic
Processor/HW Platform/Machine Architecture: x86_64

The text was updated successfully, but these errors were encountered:

Glen-Tigera · 2021-12-22T16:11:23Z

K3D + Calico operator install summary:
3.15 ✔️
3.16 ✔️
3.17 ✔️
3.18 ✔️
3.19 ✔️
3.20 ✔️
3.21 ❌

k3d-calico-operator-install-findings.txt

iwilltry42 · 2022-01-12T06:50:03Z

Hi @Glen-Tigera , I got an e-mail from another k3d user saying that he's seeing the same issue on k3OS with Tigera Operator v3.21, while it works with v3.15.

tmjd · 2022-01-12T14:56:17Z

@Glen-Tigera did you compare the Installation resources of v3.20 and v3.21? At a minimum the v3.21 should have had a NonPrivileged field that should have been set to Disabled, where that field was not available in v3.20 because the only option was privileged.
Also did you try looking at the calico-node Daemonset because your install-findings file shows that one calico-node was deployed and even Ready, why weren't there more calico-node pods at least being attempted? That suggests a scheduling problem that I think should be reported in the Daemonset.

Glen-Tigera · 2022-01-12T23:33:17Z

Hey @tmjd, just took a look. Yes you're right, the v3.21 installation has nonPrivileged: Disabled as the default, while this is not in v3.20. There is also controlPlaneReplicas: 2 in the v3.21 installation.

I believe the k3d cluster create command creates a 1 server and 1 agent node by default, so that is why there's only 1 calico-node deployed at the time. The number of servers and agents you want on the cluster can be tuned though with their manifest definition.
https://k3d.io/v5.2.2/usage/commands/k3d_cluster_create/#synopsis
https://k3d.io/v5.2.1/usage/configfile/

tmjd · 2022-01-14T15:20:10Z

Sorry I got confused about the nodes, I'm not sure why I thought there should be more. I don't think my previous comment was very useful except to know that nonPrivileged: Disabled is set because that means the installation isn't using the new nonPrivileged option, which is what I would have expected.

You're suggesting the difference in the different versions is the operator but it could very well be in calico-node. Could I suggest trying a v3.21 install and then putting the annotation unsupported.operator.tigera.io/ignore: "true" on the calico-node daemonset and switching the calico-node image to the v3.20 version and see if the problem still exists. You could also try installing v3.20 and then switching the calico-node image to v3.21 but I'm less confident in version compatibility with that combo.

Glen-Tigera · 2022-01-28T21:59:00Z

Hey @tmjd sorry been busy with test plans the past few weeks so couldn't address this till now. I provisioned a 3.21 calico/node first and then applied the annotation above. Then I tried changing the image field to v3.20.0 for calico node. Looks like the problem still exists unless there's a better way to downgrade the node.

kubectl annotate daemonsets calico-node -n calico-system unsupported.operator.tigera.io/ignore="true"
daemonset.apps/calico-node annotated

After annotation, the network was the same:

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          5m46s
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          5m23s
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          5m32s
calico-system     calico-kube-controllers-7cd6f7b9f9-rpjkj   0/1     ContainerCreating   0          5m32s
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          7m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          7m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          7m
calico-system     calico-node-fhkqf                          1/1     Running             0          5m32s
calico-system     calico-node-khmrj                          1/1     Running             0          5m32s
calico-system     calico-node-j2qqm                          1/1     Running             0          5m32s
calico-system     calico-node-5hv2d                          1/1     Running             0          5m32s

Then I edited the daemonset spec for this:
.spec.containers.env.image: docker.io/calico/node:v3.21.4
to
.spec.containers.env.image: docker.io/calico/node:v3.20.0

after that the daemonset terminated the v3.21.4 calico-node containers and created new ones which pulled in v3.20.0. I waited for a minute and wasn't able to see the remaining containers get healthy so you might be right it could be an issue in calico-node instead of operator.

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          22m
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          23m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          23m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          23m
calico-system     calico-kube-controllers-7cd6f7b9f9-9vhzz   0/1     ContainerCreating   0          5m41s
calico-system     calico-node-qvqm4                          1/1     Running             0          3m56s
calico-system     calico-node-wfs29                          1/1     Running             0          3m44s
calico-system     calico-node-xbdb9                          1/1     Running             0          3m22s
calico-system     calico-node-5ptfb                          1/1     Running             0          2m53s

xinbinhuang · 2022-01-31T05:33:02Z

Thanks for looking into this! I'm seeing the same issue on K3os (provisioned as proxmox VM). And I think projectcalico/calico#5356 can be relevant here.

xinbinhuang · 2022-01-31T08:14:44Z

I've fixed the problem by replacing both the init containers - cni and pod2daemon (flexvolume driver). I need to rebuild all cni-plugin and flex volume driver binaries with static link flags.

My set up is a k3s cluster all running k3os images (2 pi - arm64 + 1 proxmox VM - amd64) with Calico Operator v3.22. One thing I noticed is both archs need to rebuild pod2daemon but only the amd64 also needs to rebuild the cni.

I've pushed these two images (cni & pod2daemon-flexvol ) for both arm64 and amd64 in case you wanna test it out on your end.

Cheers

tmjd · 2022-01-31T18:27:42Z

@Glen-Tigera could you try updating the cni plugin and flexvol container to v3.20.0 also?

Glen-Tigera · 2022-01-31T20:42:42Z

@xinbinhuang Thanks for looking into this too! 🙏 Appreciate linking a related issue and there seems to be an ongoing resolution.

@tmjd That worked. Once I changed the manifest to use v3.20.0 cni and pod2daemon init containers, the pod network was functional once the daemonset re-created the calico-nodes. It is still functional with v3.22.0 of the calico-node image and v3.20.0 (cni + pod2daemon), so the issue is just in the init containers.

Casey has a PR to fix pod2daemon; that may be the source of this issue.
projectcalico/calico#5515

xinbinhuang · 2022-02-13T10:02:56Z

@tmjd while waiting for the upstream image to be fixed, is it possible to override the init containers image during operator installation?

tmjd · 2022-02-14T19:02:56Z

One way to temporarily override the init containers you could use the "unsupported" annotation on the calico-node daemonset, but with that annotation added, the daemonset will not be updated by the operator anymore.
You can see how to here, https://github.com/tigera/operator/blob/master/README.md#making-temporary-changes-to-components-the-operator-manages.

caseydavenport · 2022-04-11T23:56:27Z

This isn't an operator issue - the cause of this was us switching to dynamically linked builds of some host binaries (CNI and pod2daemon flexvol).

These both have fixes that will be available in the next Calico release (v3.23).

caseydavenport closed this as completed Apr 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v3.21 operator failing to set up the pod network on a k3d cluster #1709

v3.21 operator failing to set up the pod network on a k3d cluster #1709

Glen-Tigera commented Dec 22, 2021 •

edited

Loading

Glen-Tigera commented Dec 22, 2021

iwilltry42 commented Jan 12, 2022

tmjd commented Jan 12, 2022

Glen-Tigera commented Jan 12, 2022 •

edited

Loading

tmjd commented Jan 14, 2022

Glen-Tigera commented Jan 28, 2022

xinbinhuang commented Jan 31, 2022 •

edited

Loading

xinbinhuang commented Jan 31, 2022 •

edited

Loading

tmjd commented Jan 31, 2022

Glen-Tigera commented Jan 31, 2022 •

edited

Loading

xinbinhuang commented Feb 13, 2022

tmjd commented Feb 14, 2022

caseydavenport commented Apr 11, 2022

v3.21 operator failing to set up the pod network on a k3d cluster #1709

v3.21 operator failing to set up the pod network on a k3d cluster #1709

Comments

Glen-Tigera commented Dec 22, 2021 • edited Loading

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

Glen-Tigera commented Dec 22, 2021

iwilltry42 commented Jan 12, 2022

tmjd commented Jan 12, 2022

Glen-Tigera commented Jan 12, 2022 • edited Loading

tmjd commented Jan 14, 2022

Glen-Tigera commented Jan 28, 2022

xinbinhuang commented Jan 31, 2022 • edited Loading

xinbinhuang commented Jan 31, 2022 • edited Loading

tmjd commented Jan 31, 2022

Glen-Tigera commented Jan 31, 2022 • edited Loading

xinbinhuang commented Feb 13, 2022

tmjd commented Feb 14, 2022

caseydavenport commented Apr 11, 2022

Glen-Tigera commented Dec 22, 2021 •

edited

Loading

Glen-Tigera commented Jan 12, 2022 •

edited

Loading

xinbinhuang commented Jan 31, 2022 •

edited

Loading

xinbinhuang commented Jan 31, 2022 •

edited

Loading

Glen-Tigera commented Jan 31, 2022 •

edited

Loading