-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v3.21 operator failing to set up the pod network on a k3d cluster #1709
Comments
K3D + Calico operator install summary: |
Hi @Glen-Tigera , I got an e-mail from another k3d user saying that he's seeing the same issue on k3OS with Tigera Operator v3.21, while it works with v3.15. |
@Glen-Tigera did you compare the Installation resources of v3.20 and v3.21? At a minimum the v3.21 should have had a NonPrivileged field that should have been set to Disabled, where that field was not available in v3.20 because the only option was privileged. |
Hey @tmjd, just took a look. Yes you're right, the v3.21 installation has I believe the |
Sorry I got confused about the nodes, I'm not sure why I thought there should be more. I don't think my previous comment was very useful except to know that You're suggesting the difference in the different versions is the operator but it could very well be in calico-node. Could I suggest trying a v3.21 install and then putting the annotation |
Hey @tmjd sorry been busy with test plans the past few weeks so couldn't address this till now. I provisioned a 3.21 calico/node first and then applied the annotation above. Then I tried changing the image field to v3.20.0 for calico node. Looks like the problem still exists unless there's a better way to downgrade the node.
After annotation, the network was the same:
Then I edited the daemonset spec for this: after that the daemonset terminated the v3.21.4 calico-node containers and created new ones which pulled in v3.20.0. I waited for a minute and wasn't able to see the remaining containers get healthy so you might be right it could be an issue in calico-node instead of operator.
|
Thanks for looking into this! I'm seeing the same issue on K3os (provisioned as proxmox VM). And I think projectcalico/calico#5356 can be relevant here. |
I've fixed the problem by replacing both the init containers - My set up is a k3s cluster all running k3os images (2 pi - arm64 + 1 proxmox VM - amd64) with Calico Operator v3.22. One thing I noticed is both archs need to rebuild I've pushed these two images (cni & pod2daemon-flexvol ) for both Cheers |
@Glen-Tigera could you try updating the cni plugin and flexvol container to v3.20.0 also? |
@xinbinhuang Thanks for looking into this too! 🙏 Appreciate linking a related issue and there seems to be an ongoing resolution. @tmjd That worked. Once I changed the manifest to use v3.20.0 cni and pod2daemon init containers, the pod network was functional once the daemonset re-created the calico-nodes. It is still functional with v3.22.0 of the calico-node image and v3.20.0 (cni + pod2daemon), so the issue is just in the init containers. Casey has a PR to fix pod2daemon; that may be the source of this issue. |
@tmjd while waiting for the upstream image to be fixed, is it possible to override the init containers image during operator installation? |
One way to temporarily override the init containers you could use the "unsupported" annotation on the calico-node daemonset, but with that annotation added, the daemonset will not be updated by the operator anymore. |
This isn't an operator issue - the cause of this was us switching to dynamically linked builds of some host binaries (CNI and pod2daemon flexvol). These both have fixes that will be available in the next Calico release (v3.23). |
When using the v3.21 operator to install calico on a K3D cluster, the pod network is failing to start. This bug is a result of investigations done with the k3d team at rancher. k3d-io/k3d#898
Expected Behavior
The pod network should be up and running successfully in all namespaces. All pods are in the running state.
Current Behavior
The calico-nodes are able to run without issue but other containers are stuck in the ContainerCreating state (coredns, metrics, calico-kube-controller)
When describing the stuck pods, I see this in its events:
Based on the error above, I went to check /opt/cni/bin/calico to see if the calico binary existed in the container, which it does:
CNI Config Yaml:
Possible Solution
This is only occuring in v3.21 of the operator. I tested prior versions of operator and it sets up the pod network successfully. See
k3d-calico-operator-install-findings.txt
. The issue should lie between v3.20 and v3.21 changes of operator.Steps to Reproduce (for bugs)
This should try to install calico through the operator on your k3d cluster with IP forwarding enabled.
kubectl get pods -A
Context
Delivery engineering wants to be able to support k3d provisioning and install in Banzai to expand our E2E coverage of supported provisioners. This would help our engineering team for testing their features on a local k3d cluster as it is much faster to setup.
Your Environment
OS: GNU/Linux
Kernel Version: 20.04.2-Ubuntu SMP
Kernel Release: 5.11.0-40-generic
Processor/HW Platform/Machine Architecture: x86_64
The text was updated successfully, but these errors were encountered: