Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.21 operator failing to set up the pod network on a k3d cluster #1709

Closed
Glen-Tigera opened this issue Dec 22, 2021 · 13 comments
Closed

v3.21 operator failing to set up the pod network on a k3d cluster #1709

Glen-Tigera opened this issue Dec 22, 2021 · 13 comments

Comments

@Glen-Tigera
Copy link
Member

Glen-Tigera commented Dec 22, 2021

When using the v3.21 operator to install calico on a K3D cluster, the pod network is failing to start. This bug is a result of investigations done with the k3d team at rancher. k3d-io/k3d#898

Expected Behavior

The pod network should be up and running successfully in all namespaces. All pods are in the running state.

Current Behavior

The calico-nodes are able to run without issue but other containers are stuck in the ContainerCreating state (coredns, metrics, calico-kube-controller)

$ kubectl get pods -A
NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE     IP           NODE                             NOMINATED NODE   READINESS GATES
tigera-operator   tigera-operator-7dc6bc5777-jqgj6           1/1     Running             0          6m36s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-typha-786fc79b-hm5sr                1/1     Running             0          6m17s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-kube-controllers-78cc777977-trgbz   0/1     ContainerCreating   0          6m17s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       metrics-server-86cbb8457f-59s6k            0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       local-path-provisioner-5ff76fc89d-w7bf6    0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
kube-system       coredns-7448499f4d-7rwx9                   0/1     ContainerCreating   0          6m36s   <none>       k3d-test-cluster-3-21-server-0   <none>           <none>
calico-system     calico-node-99jc6                          1/1     Running             0          6m17s   172.29.0.3   k3d-test-cluster-3-21-server-0   <none>           <none>

When describing the stuck pods, I see this in its events:

$ kubectl describe pod/coredns-7448499f4d-7rwx9 -n calico-system

  Warning  FailedCreatePodSandBox  6s                     kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "51947047f29820ea93c486fe4c18f5a31e9c9c9418e859e320b8d3b2c43bd383": netplugin failed with no error message: fork/exec /opt/cni/bin/calico: no such file or directory

Based on the error above, I went to check /opt/cni/bin/calico to see if the calico binary existed in the container, which it does:

glen@glen-tigera: $ docker exec -ti k3d-test-cluster-3-21-server-0 /bin/sh
/ # ls
bin  dev  etc  k3d  lib  opt  output  proc  run  sbin  sys  tmp  usr  var
/ # cd /opt/cni/bin/
/opt/cni/bin # ls -a
.  ..  bandwidth  **calico**  calico-ipam  flannel  host-local  install  loopback  portmap  tags.txt  tuning

CNI Config Yaml:

kubectl get cm cni-config -n calico-system -o yaml
apiVersion: v1
data:
  config: |-
    {
      "name": "k8s-pod-network",
      "cniVersion": "0.3.1",
      "plugins": [
        {
          "type": "calico",
          "datastore_type": "kubernetes",
          "mtu": 0,
          "nodename_file_optional": false,
          "log_level": "Info",
          "log_file_path": "/var/log/calico/cni/cni.log",
          "ipam": { "type": "calico-ipam", "assign_ipv4" : "true", "assign_ipv6" : "false"},
          "container_settings": {
              "allow_ip_forwarding": true
          },
          "policy": {
              "type": "k8s"
          },
          "kubernetes": {
              "k8s_api_root":"https://10.43.0.1:443",
              "kubeconfig": "__KUBECONFIG_FILEPATH__"
          }
        },
        {
          "type": "bandwidth",
          "capabilities": {"bandwidth": true}
        },
        {"type": "portmap", "snat": true, "capabilities": {"portMappings": true}}
      ]
    }
kind: ConfigMap
metadata:
  creationTimestamp: "2021-12-22T16:00:02Z"
  name: cni-config
  namespace: calico-system
  ownerReferences:
  - apiVersion: operator.tigera.io/v1
    blockOwnerDeletion: true
    controller: true
    kind: Installation
    name: default
    uid: 90769081-24a2-440d-9666-a9c3b94ebd34
  resourceVersion: "635"
  uid: 609157c5-c43b-42d9-bb5b-7053a8673a49

Possible Solution

This is only occuring in v3.21 of the operator. I tested prior versions of operator and it sets up the pod network successfully. See k3d-calico-operator-install-findings.txt. The issue should lie between v3.20 and v3.21 changes of operator.

Steps to Reproduce (for bugs)

  1. Install K3D (v.5.2.1 was used at the time of testing) - https://k3d.io/v5.2.1/
  2. Run the list of commands in order below
k3d cluster create "test-cluster-3-21" --k3s-arg "--flannel-backend=none@server:*" --k3s-arg "--no-deploy=traefik@server:*"
kubectl apply -f https://docs.projectcalico.org/archive/v3.21/manifests/tigera-operator.yaml
curl -L https://docs.projectcalico.org/archive/v3.21/manifests/custom-resources.yaml > k3d-custom-res.yaml
yq e '.spec.calicoNetwork.containerIPForwarding="Enabled"' -i k3d-custom-res.yaml
kubectl apply -f k3d-custom-res.yaml

This should try to install calico through the operator on your k3d cluster with IP forwarding enabled.

  1. Get all pods kubectl get pods -A
  2. Notice that pods are stuck in the container creating state, so the pod network has failed to start.

Context

Delivery engineering wants to be able to support k3d provisioning and install in Banzai to expand our E2E coverage of supported provisioners. This would help our engineering team for testing their features on a local k3d cluster as it is much faster to setup.

Your Environment

OS: GNU/Linux
Kernel Version: 20.04.2-Ubuntu SMP
Kernel Release: 5.11.0-40-generic
Processor/HW Platform/Machine Architecture: x86_64

@Glen-Tigera
Copy link
Member Author

K3D + Calico operator install summary:
3.15 ✔️
3.16 ✔️
3.17 ✔️
3.18 ✔️
3.19 ✔️
3.20 ✔️
3.21 ❌

k3d-calico-operator-install-findings.txt

@iwilltry42
Copy link

Hi @Glen-Tigera , I got an e-mail from another k3d user saying that he's seeing the same issue on k3OS with Tigera Operator v3.21, while it works with v3.15.

@tmjd
Copy link
Member

tmjd commented Jan 12, 2022

@Glen-Tigera did you compare the Installation resources of v3.20 and v3.21? At a minimum the v3.21 should have had a NonPrivileged field that should have been set to Disabled, where that field was not available in v3.20 because the only option was privileged.
Also did you try looking at the calico-node Daemonset because your install-findings file shows that one calico-node was deployed and even Ready, why weren't there more calico-node pods at least being attempted? That suggests a scheduling problem that I think should be reported in the Daemonset.

@Glen-Tigera
Copy link
Member Author

Glen-Tigera commented Jan 12, 2022

Hey @tmjd, just took a look. Yes you're right, the v3.21 installation has nonPrivileged: Disabled as the default, while this is not in v3.20. There is also controlPlaneReplicas: 2 in the v3.21 installation.

Screenshot from 2022-01-12 18-32-11

I believe the k3d cluster create command creates a 1 server and 1 agent node by default, so that is why there's only 1 calico-node deployed at the time. The number of servers and agents you want on the cluster can be tuned though with their manifest definition.
https://k3d.io/v5.2.2/usage/commands/k3d_cluster_create/#synopsis
https://k3d.io/v5.2.1/usage/configfile/

@tmjd
Copy link
Member

tmjd commented Jan 14, 2022

Sorry I got confused about the nodes, I'm not sure why I thought there should be more. I don't think my previous comment was very useful except to know that nonPrivileged: Disabled is set because that means the installation isn't using the new nonPrivileged option, which is what I would have expected.

You're suggesting the difference in the different versions is the operator but it could very well be in calico-node. Could I suggest trying a v3.21 install and then putting the annotation unsupported.operator.tigera.io/ignore: "true" on the calico-node daemonset and switching the calico-node image to the v3.20 version and see if the problem still exists. You could also try installing v3.20 and then switching the calico-node image to v3.21 but I'm less confident in version compatibility with that combo.

@Glen-Tigera
Copy link
Member Author

Hey @tmjd sorry been busy with test plans the past few weeks so couldn't address this till now. I provisioned a 3.21 calico/node first and then applied the annotation above. Then I tried changing the image field to v3.20.0 for calico node. Looks like the problem still exists unless there's a better way to downgrade the node.

kubectl annotate daemonsets calico-node -n calico-system unsupported.operator.tigera.io/ignore="true"
daemonset.apps/calico-node annotated

After annotation, the network was the same:

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          5m46s
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          5m23s
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          5m32s
calico-system     calico-kube-controllers-7cd6f7b9f9-rpjkj   0/1     ContainerCreating   0          5m32s
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          7m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          7m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          7m
calico-system     calico-node-fhkqf                          1/1     Running             0          5m32s
calico-system     calico-node-khmrj                          1/1     Running             0          5m32s
calico-system     calico-node-j2qqm                          1/1     Running             0          5m32s
calico-system     calico-node-5hv2d                          1/1     Running             0          5m32s

Then I edited the daemonset spec for this:
.spec.containers.env.image: docker.io/calico/node:v3.21.4
to
.spec.containers.env.image: docker.io/calico/node:v3.20.0

after that the daemonset terminated the v3.21.4 calico-node containers and created new ones which pulled in v3.20.0. I waited for a minute and wasn't able to see the remaining containers get healthy so you might be right it could be an issue in calico-node instead of operator.

NAMESPACE         NAME                                       READY   STATUS              RESTARTS   AGE
tigera-operator   tigera-operator-c4b9549c7-w2527            1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-798gg              1/1     Running             0          22m
calico-system     calico-typha-8686dd5c79-q7xx5              1/1     Running             0          22m
kube-system       local-path-provisioner-5ff76fc89d-chc5f    0/1     ContainerCreating   0          23m
kube-system       coredns-7448499f4d-9sllb                   0/1     ContainerCreating   0          23m
kube-system       metrics-server-86cbb8457f-fcbrd            0/1     ContainerCreating   0          23m
calico-system     calico-kube-controllers-7cd6f7b9f9-9vhzz   0/1     ContainerCreating   0          5m41s
calico-system     calico-node-qvqm4                          1/1     Running             0          3m56s
calico-system     calico-node-wfs29                          1/1     Running             0          3m44s
calico-system     calico-node-xbdb9                          1/1     Running             0          3m22s
calico-system     calico-node-5ptfb                          1/1     Running             0          2m53s

@xinbinhuang
Copy link

xinbinhuang commented Jan 31, 2022

Thanks for looking into this! I'm seeing the same issue on K3os (provisioned as proxmox VM). And I think projectcalico/calico#5356 can be relevant here.

@xinbinhuang
Copy link

xinbinhuang commented Jan 31, 2022

I've fixed the problem by replacing both the init containers - cni and pod2daemon (flexvolume driver). I need to rebuild all cni-plugin and flex volume driver binaries with static link flags.

My set up is a k3s cluster all running k3os images (2 pi - arm64 + 1 proxmox VM - amd64) with Calico Operator v3.22. One thing I noticed is both archs need to rebuild pod2daemon but only the amd64 also needs to rebuild the cni.

I've pushed these two images (cni & pod2daemon-flexvol ) for both arm64 and amd64 in case you wanna test it out on your end.

Cheers

@tmjd
Copy link
Member

tmjd commented Jan 31, 2022

@Glen-Tigera could you try updating the cni plugin and flexvol container to v3.20.0 also?

@Glen-Tigera
Copy link
Member Author

Glen-Tigera commented Jan 31, 2022

@xinbinhuang Thanks for looking into this too! 🙏 Appreciate linking a related issue and there seems to be an ongoing resolution.

@tmjd That worked. Once I changed the manifest to use v3.20.0 cni and pod2daemon init containers, the pod network was functional once the daemonset re-created the calico-nodes. It is still functional with v3.22.0 of the calico-node image and v3.20.0 (cni + pod2daemon), so the issue is just in the init containers.

Screenshot from 2022-01-31 15-36-53

Casey has a PR to fix pod2daemon; that may be the source of this issue.
projectcalico/calico#5515

@xinbinhuang
Copy link

@tmjd while waiting for the upstream image to be fixed, is it possible to override the init containers image during operator installation?

@tmjd
Copy link
Member

tmjd commented Feb 14, 2022

One way to temporarily override the init containers you could use the "unsupported" annotation on the calico-node daemonset, but with that annotation added, the daemonset will not be updated by the operator anymore.
You can see how to here, https://github.com/tigera/operator/blob/master/README.md#making-temporary-changes-to-components-the-operator-manages.

@caseydavenport
Copy link
Member

This isn't an operator issue - the cause of this was us switching to dynamically linked builds of some host binaries (CNI and pod2daemon flexvol).

These both have fixes that will be available in the next Calico release (v3.23).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants