Description
/kind bug
What steps did you take and what happened:
I'm able to built CAPZ cluster on top of Azure instances (no AKS) with Cilium as CNI. I have private VNET not managed by CAPZ. I have cluster with 1 control-plane and 3 worker nodes.
Problem is a defined cluster CIDR is not reachable between nodes, e.g.
Cluster CIDR is 10.88.0.0/16
POD1 (IP 10.88.1.64) on WORKER1 (IP 10.117.21.109) can't reach POD2 (IP 10.88.4.77) on WORKER2 (IP 10.117.21.109)
Communication is allowed through NSG (Any-Any), but the same subnet should be opened by default.
RouteTable is created and reconciled by CAPZ and also properly filled:
Cilium was added as a cluster addon.
Cilium interface on node is properly created:
4: cilium_net@cilium_host: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ea:eb:9d:aa:2b:2b brd ff:ff:ff:ff:ff:ff
inet6 fe80::e8eb:9dff:feaa:2b2b/64 scope link
valid_lft forever preferred_lft forever
5: cilium_host@cilium_net: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 7e:02:15:79:39:e2 brd ff:ff:ff:ff:ff:ff
inet 10.88.4.48/32 scope global cilium_host
valid_lft forever preferred_lft forever
inet6 fe80::7c02:15ff:fe79:39e2/64 scope link
valid_lft forever preferred_lft forever
Cilium status:
Name IP Node Endpoints
worker1-machinedeployment-9qr7m-qc56f (localhost):
Host connectivity to 10.117.21.118:
ICMP to stack: OK, RTT=125.694µs
HTTP to agent: OK, RTT=221.314µs
Endpoint connectivity to 10.88.1.64:
ICMP to stack: OK, RTT=151.943µs
HTTP to agent: OK, RTT=525.348µs
worker2-machinedeployment-9qr7m-72sfk:
Host connectivity to 10.117.21.109:
ICMP to stack: OK, RTT=719.321µs
HTTP to agent: OK, RTT=424.48µs
Endpoint connectivity to 10.88.4.77:
ICMP to stack: Connection timed out
HTTP to agent: Get "http://10.88.4.77:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
worker3-machinedeployment-9qr7m-7g6ln:
Host connectivity to 10.117.21.10:
ICMP to stack: OK, RTT=692.478µs
HTTP to agent: OK, RTT=418.877µs
Endpoint connectivity to 10.88.2.199:
ICMP to stack: Connection timed out
HTTP to agent: Get "http://10.88.2.199:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
master1-z9z9s:
Host connectivity to 10.117.20.100:
ICMP to stack: OK, RTT=4.251332ms
HTTP to agent: OK, RTT=491.527µs
Endpoint connectivity to 10.88.0.110:
ICMP to stack: Connection timed out
HTTP to agent: Get "http://10.88.0.110:4240/hello": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Cilium configuration
identity-allocation-mode: crd
identity-heartbeat-timeout: 30m0s
identity-gc-interval: 15m0s
cilium-endpoint-gc-interval: 5m0s
nodes-gc-interval: 5m0s
debug: "true"
enable-policy: default
prometheus-serve-addr: :9962
controller-group-metrics: write-cni-file sync-host-ips sync-lb-maps-with-k8s-services
proxy-prometheus-port: "9964"
operator-prometheus-serve-addr: :9963
enable-metrics: "true"
enable-ipv4: "true"
enable-ipv6: "false"
custom-cni-conf: "false"
enable-bpf-clock-probe: "false"
enable-bpf-tproxy: "false"
monitor-aggregation: medium
monitor-aggregation-interval: 5s
monitor-aggregation-flags: all
bpf-map-dynamic-size-ratio: "0.0025"
enable-host-legacy-routing: "false"
bpf-policy-map-max: "16384"
bpf-lb-map-max: "65536"
bpf-lb-external-clusterip: "true"
bpf-events-drop-enabled: "true"
bpf-events-policy-verdict-enabled: "true"
bpf-events-trace-enabled: "true"
preallocate-bpf-maps: "false"
cluster-name: default
cluster-id: "0"
routing-mode: native
service-no-backend-response: reject
enable-l7-proxy: "true"
enable-ipv4-masquerade: "true"
enable-ipv4-big-tcp: "false"
enable-ipv6-big-tcp: "false"
enable-ipv6-masquerade: "true"
enable-tcx: "true"
datapath-mode: veth
enable-bpf-masquerade: "true"
enable-masquerade-to-route-source: "false"
enable-xt-socket-fallback: "true"
install-no-conntrack-iptables-rules: "false"
auto-direct-node-routes: "false"
direct-node-routes-skip-unreachable: "true"
direct-routing-skip-unreachable: "false"
enable-local-redirect-policy: "false"
ipv4-native-routing-cidr: 10.88.0.0/16
enable-runtime-device-detection: "true"
kube-proxy-replacement: "true"
kube-proxy-replacement-healthz-bind-address: 0.0.0.0:10256
bpf-lb-sock: "false"
bpf-lb-sock-terminate-pod-connections: "false"
enable-health-check-nodeport: "true"
enable-health-check-loadbalancer-ip: "false"
node-port-bind-protection: "true"
enable-auto-protect-node-port-range: "true"
bpf-lb-mode: dsr
bpf-lb-acceleration: disabled
enable-svc-source-range-check: "true"
enable-l2-neigh-discovery: "true"
arping-refresh-period: 30s
k8s-require-ipv4-pod-cidr: "false"
k8s-require-ipv6-pod-cidr: "false"
enable-k8s-networkpolicy: "true"
write-cni-conf-when-ready: /host/etc/cni/net.d/05-cilium.conflist
cni-exclusive: "true"
cni-log-file: /var/run/cilium/cilium-cni.log
enable-endpoint-health-checking: "true"
enable-health-checking: "true"
enable-well-known-identities: "false"
enable-node-selector-labels: "false"
synchronize-k8s-nodes: "true"
operator-api-serve-addr: 127.0.0.1:9234
enable-hubble: "true"
hubble-socket-path: /var/run/cilium/hubble.sock
hubble-metrics-server: :9965
hubble-metrics-server-enable-tls: "false"
hubble-metrics: dns drop tcp flow icmp http
enable-hubble-open-metrics: "false"
hubble-export-file-max-size-mb: "10"
hubble-export-file-max-backups: "5"
hubble-listen-address: :4244
hubble-disable-tls: "true"
ipam: kubernetes
ipam-cilium-node-update-rate: 15s
cluster-pool-ipv4-cidr: 10.88.0.0/16
cluster-pool-ipv4-mask-size: "24"
egress-gateway-reconciliation-trigger-interval: 1s
enable-vtep: "false"
vtep-endpoint: ""
vtep-cidr: ""
vtep-mask: ""
vtep-mac: ""
procfs: /host/proc
bpf-root: /sys/fs/bpf
cgroup-root: /run/cilium/cgroupv2
enable-k8s-terminating-endpoint: "true"
enable-sctp: "false"
k8s-client-qps: "10"
k8s-client-burst: "20"
remove-cilium-node-taints: "true"
set-cilium-node-taints: "true"
set-cilium-is-up-condition: "true"
unmanaged-pod-watcher-interval: "15"
dnsproxy-enable-transparent-mode: "true"
dnsproxy-socket-linger-timeout: "10"
tofqdns-dns-reject-response-code: refused
tofqdns-enable-dns-compression: "true"
tofqdns-endpoint-max-ip-per-hostname: "50"
tofqdns-idle-connection-grace-period: 0s
tofqdns-max-deferred-connection-deletes: "10000"
tofqdns-proxy-response-max-delay: 100ms
agent-not-ready-taint-key: node.cilium.io/agent-not-ready
mesh-auth-enabled: "true"
mesh-auth-queue-size: "1024"
mesh-auth-rotated-identities-queue-size: "1024"
mesh-auth-gc-interval: 5m0s
proxy-xff-num-trusted-hops-ingress: "0"
proxy-xff-num-trusted-hops-egress: "0"
proxy-connect-timeout: "2"
proxy-max-requests-per-connection: "0"
proxy-max-connection-duration-seconds: "0"
proxy-idle-timeout-seconds: "60"
external-envoy-proxy: "false"
envoy-base-id: "0"
envoy-keep-cap-netbindservice: "false"
max-connected-clusters: "255"
clustermesh-enable-endpoint-sync: "false"
clustermesh-enable-mcs-api: "false"
nat-map-stats-entries: "32"
nat-map-stats-interval: 30s
What did you expect to happen:
Communication within podCIDR should work.
Environment:
- cluster-api-provider-azure version: v1.19.4
- Kubernetes version: (use
kubectl version
): 1.31.6 - OS (e.g. from
/etc/os-release
): Ubuntu 24 - Tested with Cilium v1.16.2 and also v1.17.4
Based on discussion from Slack channel: https://kubernetes.slack.com/archives/CEX9HENG7/p1748952223222099
Metadata
Metadata
Assignees
Type
Projects
Status