Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAP failed/stuck job due to pod networking problem #187

Closed
DanielFroehlich opened this issue Jul 1, 2024 · 12 comments
Closed

AAP failed/stuck job due to pod networking problem #187

DanielFroehlich opened this issue Jul 1, 2024 · 12 comments
Assignees
Labels
bug Something isn't working cluster/isar BareMetal COE Cluter

Comments

@DanielFroehlich
Copy link

I am trying to run job template "stormshift-update-template-vms" on ISAR AAP.
The fails, the automation-job pod in NS "ansible-automation-platform" is stuck in state "ContainerCreating".

Event log shows error messages:

addLogicalPort failed for ansible-automation-platform/automation-job-252-jcjdd: failed to assign pod addresses for pod default/ansible-automation-platform/automation-job-252-jcjdd on switch: ucs57, err: range is full

and

failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_automation-job-252-jcjdd_ansible-automation-platform_fbcce7b4-1feb-48d0-8067-21a2f69ab074_0(000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d): error adding pod ansible-automation-platform_automation-job-252-jcjdd to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:000b3c811b66882860ad874f24cbf77dafeca43b201ede10c99c6748000a1b5d Netns:/var/run/netns/b5ad18e1-7e53-4ac6-95ff-01737e7ae193 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=ansible-automation-platform;K8S_POD_NAME=automation-job-252-

@rbo , can you please advise?

@DanielFroehlich DanielFroehlich added bug Something isn't working cluster/isar BareMetal COE Cluter labels Jul 1, 2024
@DanielFroehlich
Copy link
Author

Same issue when creating a new VM with the virt launcher pod, example in ns "stormshift-microshift"

https://console-openshift-console.apps.isar.coe.muc.redhat.com/k8s/ns/stormshift-microshift/pods/virt-launcher-ushift08-cw2ww

@rbo rbo self-assigned this Jul 2, 2024
@rbo
Copy link
Member

rbo commented Jul 2, 2024

Looks like we have a generall problem with ucs56/57 nodes:

 oc get pods -A -o wide | grep -v Completed | grep -v Running 
NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS         AGE     IP              NODE    NOMINATED NODE   READINESS GATES
openshift-cnv                                      centos-7-image-cron-7a375378-28660012-xlf6s                       0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      centos-stream8-image-cron-2da55196-28660012-bccdn                 0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      centos-stream9-image-cron-3832a6ff-28660012-8bbtp                 0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-cnv                                      fedora-image-cron-2336cc39-28660012-5rkn7                         0/1     ContainerCreating   0                3d13h   <none>          ucs57   <none>           <none>
openshift-image-registry                           image-pruner-28660320-82h68                                       0/1     ContainerCreating   0                3d8h    <none>          ucs57   <none>           <none>
openshift-marketplace                              acm-custom-registry-bn72q                                         0/1     ContainerCreating   0                3d16h   <none>          ucs57   <none>           <none>
openshift-marketplace                              multiclusterengine-catalog-cqrdr                                  0/1     ContainerCreating   0                3d17h   <none>          ucs57   <none>           <none>
openshift-pipelines                                tekton-resource-pruner-r27k7-28660800-28xb6                       0/1     ContainerCreating   0                3d      <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   certified-operators-catalog-8d57f86d6-2fktc                       0/1     ContainerCreating   0                5h37m   <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   community-operators-catalog-b4c8fddf8-4fqws                       0/1     ContainerCreating   0                3h7m    <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   importer-prime-aae4a260-a506-4616-921f-78c117be02a0               0/2     Init:0/1            0                13m     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   importer-prime-abbd9cb3-d101-4a90-93fa-97b4bc0280d5               0/2     Init:0/1            0                15m     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28660527-l2rn8                               0/1     ContainerCreating   0                3d4h    <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28661967-xb9fx                               0/1     ContainerCreating   0                2d4h    <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28663407-7ng7c                               0/1     ContainerCreating   0                28h     <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   olm-collect-profiles-28664847-8dbhq                               0/1     ContainerCreating   0                4h50m   <none>          ucs57   <none>           <none>
rbohne-hcp-rhods                                   redhat-marketplace-catalog-7977bb8dd7-t8bzj                       0/1     ContainerCreating   0                11h     <none>          ucs56   <none>           <none>
rbohne-hcp-rhods                                   redhat-operators-catalog-6f6575d9c4-l7lq5                         0/1     ContainerCreating   0                10h     <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        certified-operators-catalog-75fbf8f964-rwgq7                      0/1     ContainerCreating   0                5h26m   <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        community-operators-catalog-6d5c96fdd8-lgcgn                      0/1     ContainerCreating   0                176m    <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28660503-s4pch                               0/1     ContainerCreating   0                3d5h    <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28661943-9wj98                               0/1     ContainerCreating   0                2d5h    <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28663383-jqjhr                               0/1     ContainerCreating   0                29h     <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28664823-svrlq                               0/1     ContainerCreating   0                5h14m   <none>          ucs57   <none>           <none>
rbohne-hcp-sendling-ingress                        redhat-marketplace-catalog-769b96bb8c-ldzzx                       0/1     ContainerCreating   0                11h     <none>          ucs56   <none>           <none>
rbohne-hcp-sendling-ingress                        redhat-operators-catalog-6ffbd47bb6-7l9vc                         0/1     ContainerCreating   0                9h      <none>          ucs56   <none>           <none>
stormshift-microshift                              virt-launcher-ushift08-cw2ww                                      0/1     ContainerCreating   0                19h     <none>          ucs57   <none>           1/1

@rbo
Copy link
Member

rbo commented Jul 2, 2024

(combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_virt-launcher-ushift08-cw2ww_stormshift-microshift_6e292806-0748-46dc-b90e-8b8767e0c409_0(12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79): error adding pod stormshift-microshift_virt-launcher-ushift08-cw2ww to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:12a7dca9a96acfe3a633aec9fbc5d7093acd6cff51a85e839960b8e19d1a8a79 Netns:/var/run/netns/4fa0ce5f-0c5f-447e-826d-30ddd55763f7 IfName:eth0

=> https://access.redhat.com/solutions/7042208 old KSC did not help...

@rbo
Copy link
Member

rbo commented Jul 2, 2024

@rbo
Copy link
Member

rbo commented Jul 2, 2024

Let's try:

https://hackmd.io/@mjace/H1fJuv5Ap?utm_source=preview-mode&utm_medium=rec

$ oc get pods -o wide
NAME                                     READY   STATUS    RESTARTS      AGE   IP            NODE    NOMINATED NODE   READINESS GATES
ovnkube-control-plane-6c569d8d4b-5fc4q   2/2     Running   1 (36d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-control-plane-6c569d8d4b-df5n9   2/2     Running   0             73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-5fnmg                       8/8     Running   17            73d   10.32.96.8    inf8    <none>           <none>
ovnkube-node-bf9kn                       8/8     Running   9 (73d ago)   73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-kftrz                       8/8     Running   9 (73d ago)   73d   10.32.96.6    inf6    <none>           <none>
ovnkube-node-nb8fs                       8/8     Running   8             73d   10.32.96.44   inf44   <none>           <none>
ovnkube-node-tx28h                       8/8     Running   8             73d   10.32.96.57   ucs57   <none>           <none>
ovnkube-node-vkdv2                       8/8     Running   9 (73d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-node-vn27s                       8/8     Running   16            73d   10.32.96.7    inf7    <none>           <none>
ovnkube-node-wzfnp                       8/8     Running   8             73d   10.32.96.56   ucs56   <none>           <none>

$ oc delete pods ovnkube-control-plane-6c569d8d4b-5fc4q ovnkube-control-plane-6c569d8d4b-df5n9 ovnkube-node-tx28h ovnkube-node-wzfnp
pod "ovnkube-control-plane-6c569d8d4b-5fc4q" deleted
pod "ovnkube-control-plane-6c569d8d4b-df5n9" deleted
pod "ovnkube-node-tx28h" deleted
pod "ovnkube-node-wzfnp" deleted
$ oc get pods -o wide
NAME                                     READY   STATUS    RESTARTS      AGE   IP            NODE    NOMINATED NODE   READINESS GATES
ovnkube-control-plane-6c569d8d4b-fxvqd   2/2     Running   0             61s   10.32.96.6    inf6    <none>           <none>
ovnkube-control-plane-6c569d8d4b-j2j6g   2/2     Running   0             61s   10.32.96.5    inf5    <none>           <none>
ovnkube-node-5fnmg                       8/8     Running   17            73d   10.32.96.8    inf8    <none>           <none>
ovnkube-node-bf9kn                       8/8     Running   9 (73d ago)   73d   10.32.96.4    inf4    <none>           <none>
ovnkube-node-dzgff                       8/8     Running   0             30s   10.32.96.56   ucs56   <none>           <none>
ovnkube-node-kftrz                       8/8     Running   9 (73d ago)   73d   10.32.96.6    inf6    <none>           <none>
ovnkube-node-nb8fs                       8/8     Running   8             73d   10.32.96.44   inf44   <none>           <none>
ovnkube-node-vdvmf                       8/8     Running   0             30s   10.32.96.57   ucs57   <none>           <none>
ovnkube-node-vkdv2                       8/8     Running   9 (73d ago)   73d   10.32.96.5    inf5    <none>           <none>
ovnkube-node-vn27s                       8/8     Running   16            73d   10.32.96.7    inf7    <none>           <none>

@rbo
Copy link
Member

rbo commented Jul 2, 2024

Solved

$ oc get pods -A -o wide | grep -v Completed | grep -v Running 
NAMESPACE                                          NAME                                                              READY   STATUS              RESTARTS         AGE     IP              NODE    NOMINATED NODE   READINESS GATES
rbohne-hcp-rhods                                   virt-launcher-rhods-4e9414fe-qdpg2-mm5rs                          0/1     ContainerCreating   0                4s      <none>          ucs56   <none>           1/1
rbohne-hcp-sendling-ingress                        olm-collect-profiles-28664823-svrlq                               0/1     Error               0                5h37m   10.130.8.10     ucs57   <none>           <none>

The pods are above from my hcp playground we can ingore for now.

@rbo rbo closed this as completed Jul 2, 2024
@DanielFroehlich
Copy link
Author

Same problem again today with ucs56 - trying the workaround....

@DanielFroehlich
Copy link
Author

...by deleting the control plane pods AND the ovnkube-node pods on ucs56 and ucs57
Now the cluster is in a really strange state, console not working, API/Control plane degraded.
@rbo , HELP! Please!

@DanielFroehlich
Copy link
Author

Feels like ovnk is in an inconsitent state. e.g this event in openshift-console when trying to restart the console:

"4m42s Warning ErrorUpdatingResource pod/downloads-54777dd798-vxmhz addLogicalPort failed for openshift-console/downloads-54777dd798-vxmhz: timed out waiting for logical switch in logical switch cache "ucs57" subnet: error getting logical switch ucs57: switch not in logical switch cache"

@DanielFroehlich
Copy link
Author

Trying to drain and reboot UCS56....

@DanielFroehlich
Copy link
Author

... that helped, cluster looks way better now. I needed also to disable/enable the CNV console plugin.

@DanielFroehlich
Copy link
Author

still wondering what the root cause is/was - might need to regularly reboot nodes? Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cluster/isar BareMetal COE Cluter
Projects
None yet
Development

No branches or pull requests

2 participants