Skip to content

Neutron metadata OVN crashes periodically #38

@cloudnull

Description

@cloudnull

Describe the bug
At seemingly random intervals Neutron OVN metadata agent will crash, and when it crashes it gets into a crash loop and fails to get back to a running state.

To Reproduce
I can't reproduce the error at this time.

Expected behavior
The Neutron OVN metadata agent shouldn't crash.

Kubernetes Version
Currently running v1.28.6. This issue has also been seen in v1.26.10.

POD Description

# kubectl --namespace openstack describe pod neutron-ovn-metadata-agent-default-nrmnk
Name:             neutron-ovn-metadata-agent-default-nrmnk
Namespace:        openstack
Priority:         0
Service Account:  neutron-ovn-metadata-agent
Node:             935822-compute03-ospcv2-dfw.openstack.local/172.28.232.123
Start Time:       Fri, 26 Jan 2024 02:43:15 +0000
Labels:           application=neutron
                  component=ovn-metadata-agent
                  controller-revision-hash=6b4fc69764
                  pod-template-generation=14
                  release_group=neutron
Annotations:      configmap-bin-hash: 3b17fadc4799090e9f5d65201d90080ae322cff710e2b448a8f9d2c92555a57d
                  configmap-etc-hash: 05ac75f51498078f0e5f26737411dca523197d952546c1c5d9926732d1d7d7a8
                  openstackhelm.openstack.org/release_uuid:
Status:           Running
IP:               172.28.232.123
IPs:
  IP:           172.28.232.123
Controlled By:  DaemonSet/neutron-ovn-metadata-agent-default
Init Containers:
  init:
    Container ID:  containerd://9e3aeb65d0f234b6e93298ec960c940a5cbe1a3b88c61ed2b5a05cf7d61b1ee7
    Image:         quay.io/airshipit/kubernetes-entrypoint:v1.0.0
    Image ID:      sha256:c092d0dada614fdae3920939c5a9683b2758288f23c2e3b425128653857d7520
    Port:          <none>
    Host Port:     <none>
    Command:
      kubernetes-entrypoint
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 26 Jan 2024 02:43:15 +0000
      Finished:     Fri, 26 Jan 2024 02:43:17 +0000
    Ready:          True
    Restart Count:  0
    Environment:
      POD_NAME:                    neutron-ovn-metadata-agent-default-nrmnk (v1:metadata.name)
      NAMESPACE:                   openstack (v1:metadata.namespace)
      INTERFACE_NAME:              eth0
      PATH:                        /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/
      DEPENDENCY_SERVICE:          openstack:nova-metadata,openstack:neutron-server
      DEPENDENCY_DAEMONSET:
      DEPENDENCY_CONTAINER:
      DEPENDENCY_POD_JSON:
      DEPENDENCY_CUSTOM_RESOURCE:
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
  neutron-metadata-agent-init:
    Container ID:  containerd://646c5854dc030c26473f090cc62a3b85e33badd985990b90addb67ec6098c383
    Image:         docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
    Image ID:      docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
    Port:          <none>
    Host Port:     <none>
    Command:
      /tmp/neutron-metadata-agent-init.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 26 Jan 2024 02:43:18 +0000
      Finished:     Fri, 26 Jan 2024 02:43:18 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:     100m
      memory:  64Mi
    Environment:
      NEUTRON_USER_UID:  42424
    Mounts:
      /etc/neutron/neutron.conf from neutron-etc (ro,path="neutron.conf")
      /tmp from pod-tmp (rw)
      /tmp/neutron-metadata-agent-init.sh from neutron-bin (ro,path="neutron-metadata-agent-init.sh")
      /var/lib/neutron/openstack-helm from socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
  ovn-neutron-init:
    Container ID:  containerd://177e528248223bd182edb2c6be832085a1164eb2f5a9ca018319a1227fa8b249
    Image:         docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
    Image ID:      docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
    Port:          <none>
    Host Port:     <none>
    Command:
      /tmp/neutron-ovn-init.sh
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 26 Jan 2024 02:43:19 +0000
      Finished:     Fri, 26 Jan 2024 02:43:19 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:        100m
      memory:     64Mi
    Environment:  <none>
    Mounts:
      /tmp from pod-tmp (rw)
      /tmp/neutron-ovn-init.sh from neutron-bin (ro,path="neutron-ovn-init.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
Containers:
  neutron-ovn-metadata-agent:
    Container ID:  containerd://dd4814d89b8446c913e22a8121062bae3d49206fd71fb89d6c9806c7e611140b
    Image:         docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy
    Image ID:      docker.io/openstackhelm/neutron@sha256:b6f3dcfe8ffe051ed2280857365ebfd51220e8d0ef8c4ef9f9f8f59ddf1a0823
    Port:          <none>
    Host Port:     <none>
    Command:
      /tmp/neutron-ovn-metadata-agent.sh
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 26 Jan 2024 03:17:01 +0000
      Finished:     Fri, 26 Jan 2024 03:17:14 +0000
    Ready:          False
    Restart Count:  11
    Limits:
      cpu:     2
      memory:  4Gi
    Requests:
      cpu:      100m
      memory:   64Mi
    Liveness:   exec [python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/ovn_metadata_agent.ini --liveness-probe] delay=120s timeout=580s period=600s #success=1 #failure=3
    Readiness:  exec [python /tmp/health-probe.py --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/ovn_metadata_agent.ini] delay=30s timeout=185s period=190s #success=1 #failure=3
    Environment:
      RPC_PROBE_TIMEOUT:  60
      RPC_PROBE_RETRIES:  2
    Mounts:
      /etc/neutron/logging.conf from neutron-etc (ro,path="logging.conf")
      /etc/neutron/neutron.conf from neutron-etc (ro,path="neutron.conf")
      /etc/neutron/ovn_metadata_agent.ini from neutron-etc (ro,path="ovn_metadata_agent.ini")
      /etc/neutron/plugins/ml2/ml2_conf.ini from neutron-etc (ro,path="ml2_conf.ini")
      /etc/neutron/rootwrap.conf from neutron-etc (ro,path="rootwrap.conf")
      /etc/neutron/rootwrap.d/debug.filters from neutron-etc (ro,path="debug.filters")
      /etc/neutron/rootwrap.d/dhcp.filters from neutron-etc (ro,path="dhcp.filters")
      /etc/neutron/rootwrap.d/dibbler.filters from neutron-etc (ro,path="dibbler.filters")
      /etc/neutron/rootwrap.d/ebtables.filters from neutron-etc (ro,path="ebtables.filters")
      /etc/neutron/rootwrap.d/ipset-firewall.filters from neutron-etc (ro,path="ipset-firewall.filters")
      /etc/neutron/rootwrap.d/iptables-firewall.filters from neutron-etc (ro,path="iptables-firewall.filters")
      /etc/neutron/rootwrap.d/l3.filters from neutron-etc (ro,path="l3.filters")
      /etc/neutron/rootwrap.d/linuxbridge-plugin.filters from neutron-etc (ro,path="linuxbridge-plugin.filters")
      /etc/neutron/rootwrap.d/netns-cleanup.filters from neutron-etc (ro,path="netns-cleanup.filters")
      /etc/neutron/rootwrap.d/openvswitch-plugin.filters from neutron-etc (ro,path="openvswitch-plugin.filters")
      /etc/neutron/rootwrap.d/privsep.filters from neutron-etc (ro,path="privsep.filters")
      /etc/sudoers.d/kolla_neutron_sudoers from neutron-etc (ro,path="neutron_sudoers")
      /run from run (rw)
      /run/netns from host-run-netns (rw)
      /tmp from pod-tmp (rw)
      /tmp/health-probe.py from neutron-bin (ro,path="health-probe.py")
      /tmp/neutron-ovn-metadata-agent.sh from neutron-bin (ro,path="neutron-ovn-metadata-agent.sh")
      /var/lib/neutron from pod-var-neutron (rw)
      /var/lib/neutron/openstack-helm from socket (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-vqjjt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  pod-tmp:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  pod-var-neutron:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  run:
    Type:          HostPath (bare host directory volume)
    Path:          /run
    HostPathType:
  neutron-bin:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      neutron-bin
    Optional:  false
  neutron-etc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  neutron-ovn-metadata-agent-default
    Optional:    false
  socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/neutron/openstack-helm
    HostPathType:
  host-run-netns:
    Type:          HostPath (bare host directory volume)
    Path:          /run/netns
    HostPathType:
  kube-api-access-vqjjt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              openstack-network-node=enabled
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  35m                  default-scheduler  Successfully assigned openstack/neutron-ovn-metadata-agent-default-nrmnk to 935822-compute03-ospcv2-dfw.openstack.local
  Normal   Pulled     35m                  kubelet            Container image "quay.io/airshipit/kubernetes-entrypoint:v1.0.0" already present on machine
  Normal   Created    35m                  kubelet            Created container init
  Normal   Started    35m                  kubelet            Started container init
  Normal   Pulled     35m                  kubelet            Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
  Normal   Created    35m                  kubelet            Created container neutron-metadata-agent-init
  Normal   Started    35m                  kubelet            Started container neutron-metadata-agent-init
  Normal   Pulled     35m                  kubelet            Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
  Normal   Created    35m                  kubelet            Created container ovn-neutron-init
  Normal   Started    35m                  kubelet            Started container ovn-neutron-init
  Normal   Pulled     34m (x4 over 35m)    kubelet            Container image "docker.io/openstackhelm/neutron:2023.1-ubuntu_jammy" already present on machine
  Normal   Created    34m (x4 over 35m)    kubelet            Created container neutron-ovn-metadata-agent
  Normal   Started    34m (x4 over 35m)    kubelet            Started container neutron-ovn-metadata-agent
  Warning  BackOff    35s (x151 over 35m)  kubelet            Back-off restarting failed container neutron-ovn-metadata-agent in pod neutron-ovn-metadata-agent-default-nrmnk_openstack(b12830e7-4c00-4562-a035-a31aab324ea3)

POD Logs

2024-01-26 03:17:06.670 321 INFO neutron.common.config [-] Logging enabled!
2024-01-26 03:17:06.670 321 INFO neutron.common.config [-] /var/lib/openstack/bin/neutron-ovn-metadata-agent version 22.1.1.dev14
2024-01-26 03:17:06.842 321 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:07.288 330 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:07.292 329 INFO neutron.agent.ovn.metadata.ovsdb [-] Getting OvsdbSbOvnIdl for MetadataAgent with retry
2024-01-26 03:17:13.761 321 INFO neutron.agent.ovn.metadata.agent [-] Cleaning up ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 namespace which is not needed anymore
2024-01-26 03:17:14.186 321 CRITICAL neutron [-] Unhandled error: OSError: [Errno 22] failed to open netns
2024-01-26 03:17:14.186 321 ERROR neutron Traceback (most recent call last):
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/bin/neutron-ovn-metadata-agent", line 8, in <module>
2024-01-26 03:17:14.186 321 ERROR neutron     sys.exit(main())
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/cmd/eventlet/agents/ovn_metadata.py", line 24, in main
2024-01-26 03:17:14.186 321 ERROR neutron     metadata_agent.main()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata_agent.py", line 42, in main
2024-01-26 03:17:14.186 321 ERROR neutron     agt.start()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 334, in start
2024-01-26 03:17:14.186 321 ERROR neutron     self.sync()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 65, in wrapped
2024-01-26 03:17:14.186 321 ERROR neutron     return f(*args, **kwargs)
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 407, in sync
2024-01-26 03:17:14.186 321 ERROR neutron     self.teardown_datapath(self._get_datapath_name(ns))
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/ovn/metadata/agent.py", line 454, in teardown_datapath
2024-01-26 03:17:14.186 321 ERROR neutron     ip.garbage_collect_namespace()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 267, in garbage_collect_namespace
2024-01-26 03:17:14.186 321 ERROR neutron     if self.namespace_is_empty():
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 262, in namespace_is_empty
2024-01-26 03:17:14.186 321 ERROR neutron     return not self.get_devices()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/agent/linux/ip_lib.py", line 179, in get_devices
2024-01-26 03:17:14.186 321 ERROR neutron     devices = privileged.get_device_names(self.namespace)
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 642, in get_device_names
2024-01-26 03:17:14.186 321 ERROR neutron     in get_link_devices(namespace, **kwargs)]
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 333, in wrapped_f
2024-01-26 03:17:14.186 321 ERROR neutron     return self(f, *args, **kw)
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 423, in __call__
2024-01-26 03:17:14.186 321 ERROR neutron     do = self.iter(retry_state=retry_state)
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 360, in iter
2024-01-26 03:17:14.186 321 ERROR neutron     return fut.result()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
2024-01-26 03:17:14.186 321 ERROR neutron     return self.__get_result()
2024-01-26 03:17:14.186 321 ERROR neutron   File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
2024-01-26 03:17:14.186 321 ERROR neutron     raise self._exception
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/tenacity/__init__.py", line 426, in __call__
2024-01-26 03:17:14.186 321 ERROR neutron     result = fn(*args, **kwargs)
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_privsep/priv_context.py", line 271, in _wrap
2024-01-26 03:17:14.186 321 ERROR neutron     return self.channel.remote_call(name, args, kwargs,
2024-01-26 03:17:14.186 321 ERROR neutron   File "/var/lib/openstack/lib/python3.10/site-packages/oslo_privsep/daemon.py", line 215, in remote_call
2024-01-26 03:17:14.186 321 ERROR neutron     raise exc_type(*result[2])
2024-01-26 03:17:14.186 321 ERROR neutron OSError: [Errno 22] failed to open netns
2024-01-26 03:17:14.186 321 ERROR neutron

Additional context
The relevant log entry is here.

2024-01-26 03:17:13.761 321 INFO neutron.agent.ovn.metadata.agent [-] Cleaning up ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 namespace which is not needed anymore
2024-01-26 03:17:14.186 321 CRITICAL neutron [-] Unhandled error: OSError: [Errno 22] failed to open netns

It seems that the neutron metadata agent is attempting to cleanup ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 and failing to do so. While this problem is easy to fix, simply login to the offending node 935822-compute03-ospcv2-dfw.openstack.local/172.28.232.123 and eliminate the erroneous namespace, the metadata agent should be able to do this automatically.

Resolving the issue with Ansible.

ansible -m shell -a 'ip netns delete ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361' 935822-compute03-ospcv2-dfw.openstack.local --inventory /etc/genestack/inventory --become

Host netns

When investigating the host, we find the following output.

root@935822-compute03-ospcv2-dfw:~# ip netns
Error: Peer netns reference is invalid.
Error: Peer netns reference is invalid.
ovnmeta-a60825e1-66ff-479c-92c4-2120b092f75c (id: 15)
ovnmeta-2887f499-3ca1-4507-8f93-7585e6a13f63 (id: 14)
cni-0adaaeb7-0666-ee24-ce5e-fb5bc70c4f21 (id: 6)
cni-46d2ca93-5b2a-1574-ecfb-1c1e03d85ed4 (id: 4)
cni-c48dde5f-15ed-c672-35f4-a7b7bd361222 (id: 3)
cni-b595cd77-4a34-89fc-d2ff-8b55e8a76b23 (id: 2)
ovnmeta-970d930a-f047-4a4b-977e-d9a2fe2fe00d (id: 5)
Error: Peer netns reference is invalid.
ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361
Error: Peer netns reference is invalid.
ovnmeta-691b7a69-1142-4d5b-8395-026a4c676aab
ovnmeta-7ceb1594-e791-4ce2-a218-761e59bb1169 (id: 11)
ovnmeta-1b686933-cc0d-4744-9642-cbbf5e50bed8 (id: 10)
ovnmeta-639ba975-890a-4bc6-a230-3b6548ce6089 (id: 9)
ovnmeta-9b291755-a9b6-4abe-a5a7-6f75a7f6db16 (id: 8)
ovnmeta-0beb4d29-1244-431f-85e8-dd0a5ef1ac88 (id: 7)
cni-08e4d4a9-723a-4959-bcf4-b73d5859acf5 (id: 0

The namespace ovnmeta-f992a46b-8dae-4eef-b80f-2d980466a361 has no id and we see the Error: Peer netns reference is invalid..

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions