Skip to content

Node-local-dns doesn't work with cilium CNI on kops 1.29.0  #16597

@nikita-nazemtsev

Description

@nikita-nazemtsev

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Client version: 1.29.0 (git-v1.29.0)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

1.28.7

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
Update Kops from 1.28.4 to 1.29.0, or create a new cluster using Kops 1.29.0 with Node Local DNS and Cilium CNI.

5. What happened after the commands executed?
Pods on updated nodes cannot access node-local-dns pods
image

6. What did you expect to happen?
Pods can access node-local-dns pods.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: "2024-05-31T07:47:47Z"
  name: k8s.tmp-test.example
spec:
  additionalSans:
  - api-internal.k8s.tmp-test.example
  - api.internal.k8s.tmp-test.example
  api:
    loadBalancer:
      type: Internal
      useForInternalApi: true
  authentication: {}
  authorization:
    rbac: {}
  certManager:
    enabled: true
    managed: false
  cloudConfig:
    awsEBSCSIDriver:
      enabled: true
  cloudProvider: aws
  clusterAutoscaler:
    enabled: true
  configBase: s3://example/k8s.tmp-test.example
  containerd:
    registryMirrors:
      '*':
      - https://nexus-proxy.example.io
      docker.io:
      - https://nexus-proxy.example.io
      k8s.gcr.io:
      - https://nexus-proxy.example.io
      public.ecr.aws:
      - https://nexus-proxy.example.io
      quay.io:
      - https://nexus-proxy.example.io
      registry.example.io:
      - https://registry.example.io
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-1a
      name: a
    - instanceGroup: master-1b
      name: b
    - instanceGroup: master-1c
      name: c
    manager:
      backupRetentionDays: 90
      env:
      - name: ETCD_LISTEN_METRICS_URLS
        value: http://0.0.0.0:2379
      - name: ETCD_METRICS
        value: basic
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MAX_REQUEST_BYTES
        value: "1572864"
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-1a
      name: a
    - instanceGroup: master-1b
      name: b
    - instanceGroup: master-1c
      name: c
    manager:
      backupRetentionDays: 90
      env:
      - name: ETCD_MANAGER_HOURLY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MANAGER_DAILY_BACKUPS_RETENTION
        value: 1d
      - name: ETCD_MAX_REQUEST_BYTES
        value: "1572864"
    memoryRequest: 100Mi
    name: events
  fileAssets:
  - content: |
      apiVersion: audit.k8s.io/v1
      kind: Policy
      rules:
      - level: RequestResponse
        userGroups:
        - "/devops"
        - "/developers"
        - "/teamleads"
        - "/k8s-full"
        - "/sre"
        - "/support"
        - "/qa"
        - "system:serviceaccounts"
    name: audit-policy-config
    path: /etc/kubernetes/audit/policy-config.yaml
    roles:
    - ControlPlane
  iam:
    legacy: false
  kubeAPIServer:
    auditLogMaxAge: 10
    auditLogMaxBackups: 1
    auditLogMaxSize: 100
    auditLogPath: /var/log/kube-apiserver-audit.log
    auditPolicyFile: /etc/kubernetes/audit/policy-config.yaml
    oidcClientID: kubernetes
    oidcGroupsClaim: groups
    oidcIssuerURL: https://sso.example.io/auth/realms/example
    serviceAccountIssuer: https://api-internal.k8s.tmp-test.example
    serviceAccountJWKSURI: https://api-internal.k8s.tmp-test.example/openid/v1/jwks
  kubeDNS:
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
          - matchExpressions:
            - key: kops.k8s.io/instancegroup
              operator: In
              values:
              - infra-nodes
    nodeLocalDNS:
      cpuRequest: 25m
      enabled: true
      memoryRequest: 5Mi
    provider: CoreDNS
    tolerations:
    - effect: NoSchedule
      key: dedicated/infra
      operator: Exists
  kubeProxy:
    enabled: false
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
    evictionHard: memory.available<7%,nodefs.available<3%,nodefs.inodesFree<5%,imagefs.available<10%,imagefs.inodesFree<5%
    evictionMaxPodGracePeriod: 30
    evictionSoft: memory.available<12%
    evictionSoftGracePeriod: memory.available=200s
  kubernetesApiAccess:
  - 10.170.0.0/16
  kubernetesVersion: 1.28.7
  masterPublicName: api.k8s.tmp-test.example
  metricsServer:
    enabled: false
    insecure: true
  networkCIDR: 10.170.0.0/16
  networkID: vpc-xxxx
  networking:
    cilium:
      enableNodePort: true
      enablePrometheusMetrics: true
      ipam: eni
  nodeProblemDetector:
    enabled: true
  nodeTerminationHandler:
    enableSQSTerminationDraining: false
    enabled: true
  nonMasqueradeCIDR: 100.64.0.0/10
  sshAccess:
  - 10.170.0.0/16
  subnets:
  - cidr: 10.170.140.0/24
    name: kops-k8s-1a
    type: Private
    zone: eu-central-1a
  - cidr: 10.170.142.0/24
    name: kops-k8s-eni-1a
    type: Private
    zone: eu-central-1a
  - cidr: 10.170.141.0/24
    name: kops-k8s-utility-1a
    type: Utility
    zone: eu-central-1a
  - cidr: 10.170.143.0/24
    name: kops-k8s-1b
    type: Private
    zone: eu-central-1b
  - cidr: 10.170.145.0/24
    name: kops-k8s-eni-1b
    type: Private
    zone: eu-central-1b
  - cidr: 10.170.144.0/24
    name: kops-k8s-utility-1b
    type: Utility
    zone: eu-central-1b
  - cidr: 10.170.146.0/24
    name: kops-k8s-1c
    type: Private
    zone: eu-central-1c
  - cidr: 10.170.148.0/24
    name: kops-k8s-eni-1c
    type: Private
    zone: eu-central-1c
  - cidr: 10.170.147.0/24
    name: kops-k8s-utility-1c
    type: Utility
    zone: eu-central-1c
  topology:
    dns:
      type: Private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:48Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: graviton-nodes
spec:
  autoscale: false
  image: ami-0192de4261c8ff06a
  machineType: t4g.small
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t4g.small
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
    spotInstancePools: 3
  nodeLabels:
    kops.k8s.io/instancegroup: graviton-nodes
  role: Node
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1a
  - kops-k8s-eni-1a
  sysctlParameters:
  - net.netfilter.nf_conntrack_max = 1048576
  - net.core.netdev_max_backlog = 30000
  - net.core.rmem_max = 134217728
  - net.core.wmem_max = 134217728
  - net.ipv4.tcp_wmem = 4096 87380 67108864
  - net.ipv4.tcp_rmem = 4096 87380 67108864
  - net.ipv4.tcp_mem = 187143 249527 1874286
  - net.ipv4.tcp_max_syn_backlog = 8192
  - net.ipv4.ip_local_port_range = 10240 65535

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:48Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: infra-nodes
spec:
  autoscale: false
  image: ami-035f7f826413ac489
  machineType: t3.small
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3.small
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
    spotInstancePools: 3
  nodeLabels:
    kops.k8s.io/instancegroup: infra-nodes
  role: Node
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1a
  - kops-k8s-eni-1a
  sysctlParameters:
  - net.netfilter.nf_conntrack_max = 1048576
  - net.core.netdev_max_backlog = 30000
  - net.core.rmem_max = 134217728
  - net.core.wmem_max = 134217728
  - net.ipv4.tcp_wmem = 4096 12582912 16777216
  - net.ipv4.tcp_rmem = 4096 12582912 16777216
  - net.ipv4.tcp_mem = 187143 249527 1874286
  - net.ipv4.tcp_max_syn_backlog = 8192
  - net.ipv4.ip_local_port_range = 10240 65535

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:47Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: master-1a
spec:
  image: ami-035f7f826413ac489
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-1a
  role: Master
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1a
  - kops-k8s-eni-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:47Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: master-1b
spec:
  image: ami-035f7f826413ac489
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-1b
  role: Master
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1b
  - kops-k8s-eni-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:48Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: master-1c
spec:
  image: ami-035f7f826413ac489
  machineType: t3.medium
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-1c
  role: Master
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1c
  - kops-k8s-eni-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2024-05-31T07:47:48Z"
  labels:
    kops.k8s.io/cluster: k8s.tmp-test.example
  name: nodes
spec:
  image: ami-035f7f826413ac489
  machineType: t3.small
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3.small
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
    spotInstancePools: 3
  nodeLabels:
    kops.k8s.io/instancegroup: nodes
  role: Node
  rootVolumeSize: 25
  subnets:
  - kops-k8s-1a
  - kops-k8s-eni-1a
  sysctlParameters:
  - net.netfilter.nf_conntrack_max = 1048576
  - net.core.netdev_max_backlog = 30000
  - net.core.rmem_max = 134217728
  - net.core.wmem_max = 134217728
  - net.ipv4.tcp_wmem = 4096 87380 67108864
  - net.ipv4.tcp_rmem = 4096 87380 67108864
  - net.ipv4.tcp_mem = 187143 249527 1874286
  - net.ipv4.tcp_max_syn_backlog = 8192
  - net.ipv4.ip_local_port_range = 10240 65535

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know?

We found a workaround to fix this issue on a single node:
We noticed that the nodelocaldns interface is in a down state on nodes. ( but the same we can observe on older kops versions where node-local-dns works fine)
image
But after executing
ip link set dev nodelocaldns up
Nodelocaldns interface:
image
In cilium agent logs on this node we can see:
time="2024-05-31T09:23:08Z" level=info msg="Node addresses updated" device=nodelocaldns node-addresses="169.254.20.10 (nodelocaldns)" subsys=node-address

After these actions, all pods on this node can access node-local-dns without any problems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions