Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[IBM Z] : Test test_check_pods_status_after_node_failure failing eventhough the replacement pods are in Running state #6689

Closed
abdulkandathil opened this issue Nov 30, 2022 · 42 comments · Fixed by #6839
Assignees

Comments

@abdulkandathil
Copy link

Test fails check status of pods in Terminating state even though replacement pods are in Running state. Is this an expected behavior?

Test: tests/manage/z_cluster/nodes/test_check_pods_status_after_node_failure.py::TestCheckPodsAfterNodeFailure::test_check_pods_status_after_node_failure

Error:
The pods being checked are in the Terminating state and replacement pods are already available.

16:10:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
16:10:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-54459799sc5b9 -n openshift-storage -o yaml
16:10:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
16:10:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mgr-a-5cf998b95c-s4w5h -n openshift-storage -o yaml
16:10:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
16:10:39 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-a-5464cb8bc5-w2qmr -n openshift-storage -o yaml

----

>           raise NotFoundError(f"Node name not found for the pod {pod_obj.name}")
E           ocs_ci.ocs.exceptions.NotFoundError: Node name not found for the pod rook-ceph-mon-a-5464cb8bc5-w2qmr

Mustgather and full logs: https://drive.google.com/file/d/1z0jgGfWm9Eddku470TUPAbiNLbQfWoun/view?usp=share_link

@Sravikaz
Copy link
Contributor

Sravikaz commented Dec 7, 2022

Also observing the same issue in tier4b test cases

@sudeeshjohn
Copy link

Observed this in IBM Power as well

@ebenahar
Copy link
Contributor

@Shrivaibavi @OdedViner any idea if this happens also with our test executions on x86?

@OdedViner
Copy link
Contributor

I don't know this test... @yitzhak12 Do you know this failure?

@yitzhak12
Copy link
Contributor

I think this is a general issue in the function get_node_pods in the line https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/node.py#L1414: if pod.get_pod_node(p).name == node_name:.
When calling the function get_pod_node in the line above, we get the error raise NotFoundError(f"Node name not found for the pod {pod_obj.name}").
So one way to fix it is to catch the exception of type NotFoundError in the function get_node_pods.

@Sravikaz
Copy link
Contributor

Sravikaz commented Jan 4, 2023

We are still facing this issue with the latest ocs-ci changes

@ebenahar
Copy link
Contributor

ebenahar commented Jan 4, 2023

@yitzhak12 could you please take a look?

@yitzhak12
Copy link
Contributor

Can you send me a link to the relevant test failure?

@Sravikaz
Copy link
Contributor

Sravikaz commented Jan 4, 2023

@yitzhak12 : Ocs-ci keeps checking for the older pod name after the replacement of the pod. Attaching the log of the test case.
tier4b_noobaa_sts_host_node_failure_noobaa-db-pg-True.zip

1 similar comment
@Sravikaz
Copy link
Contributor

Sravikaz commented Jan 4, 2023

@yitzhak12 : Ocs-ci keeps checking for the older pod name after the replacement of the pod. Attaching the log of the test case.
tier4b_noobaa_sts_host_node_failure_noobaa-db-pg-True.zip

@yitzhak12
Copy link
Contributor

In error, you described we don't find the pod noobaa_operator_pod. From what I see in the test, we stopped the node associated with the pod noobaa_operator_pod, so maybe it's expected that the pod will be deleted.
Let me know if it sounds reasonable - then we need to change the tests accordingly.

@Sravikaz
Copy link
Contributor

Sravikaz commented Jan 4, 2023

@yitzhak12 : yes, the test stopped the worker node and the noobaa pod(noobaa-operator-f9f48cbff-lpn6f) got drained and got a new name after replacement. However the ocs-ci test keeps checking for the older noobaa pod name (noobaa-operator-f9f48cbff-lpn6f)

10:47:19 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod  -n openshift-storage --selector=noobaa-db=postgres -o yaml
10:47:19 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod noobaa-db-pg-0 -n openshift-storage -o yaml
10:47:19 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:20 - MainThread - tests.manage.mcg.test_host_node_failure - INFO  - noobaa-db-pg-0 is running on worker-1.ocpm4202001.lnxero1.boe
10:47:20 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod  -n openshift-storage --selector=noobaa-operator=deployment -o yaml
10:47:20 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod noobaa-operator-f9f48cbff-lpn6f -n openshift-storage -o yaml
10:47:20 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:21 - MainThread - tests.manage.mcg.test_host_node_failure - INFO  - noobaa-operator-f9f48cbff-lpn6f is running on worker-1.ocpm4202001.lnxero1.boe
10:47:21 - MainThread - tests.manage.mcg.test_host_node_failure - INFO  - noobaa-db-pg-0 and noobaa-operator-f9f48cbff-lpn6f are running on same node.
10:47:21 - MainThread - tests.manage.mcg.test_host_node_failure - INFO  - Stopping worker-1.ocpm4202001.lnxero1.boe where noobaa-db-pg-0 is hosted
10:47:21 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@127.0.0.1 ssh core@172.23.233.107 sudo systemctl stop kubelet.service -f
10:47:21 - MainThread - ocs_ci.ocs.node - INFO  - Waiting for nodes ['worker-1.ocpm4202001.lnxero1.boe'] to reach status NotReady
10:47:21 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:21 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:21 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:22 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:22 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - WARNING  - Command stderr: Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.

10:47:22 - ThreadPoolExecutor-3_0 - ocs_ci.utility.service - INFO  - Result of shutdown CompletedProcess(args=['ssh', '-o', 'StrictHostKeyChecking=no', '-o', 'UserKnownHostsFile=/dev/null', 'root@127.0.0.1', 'ssh', 'core@172.23.233.107', 'sudo', 'systemctl', 'stop', 'kubelet.service', '-f'], returncode=0, stdout=b'', stderr=b"Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.\r\n"). Checking if service kubelet went down.
10:47:22 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null root@127.0.0.1 ssh core@172.23.233.107 sudo systemctl is-active kubelet.service
10:47:23 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - WARNING  - Command stderr: Warning: Permanently added '127.0.0.1' (ECDSA) to the list of known hosts.

10:47:23 - ThreadPoolExecutor-3_0 - ocs_ci.utility.service - INFO  - Action succeeded.
10:47:23 - ThreadPoolExecutor-3_0 - ocs_ci.ocs.node - INFO  - Waiting for nodes ['worker-1.ocpm4202001.lnxero1.boe'] to reach status NotReady
10:47:23 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:24 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:24 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:24 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:25 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:25 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:25 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:26 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:27 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:27 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:28 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:28 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:29 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:29 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:29 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:30 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:31 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:31 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:32 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:32 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:33 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:34 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:34 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:34 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:35 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:35 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:35 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:36 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:37 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:38 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:38 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:39 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:39 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:39 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:40 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:41 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:42 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:42 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:42 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:43 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:43 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:43 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:44 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:45 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:46 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:46 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:46 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:47 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:48 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:48 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:48 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:49 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:50 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:50 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:50 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:51 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:52 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:52 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:52 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:53 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:54 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:54 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:54 - MainThread - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:55 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:56 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:56 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:56 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Going to sleep for 3 seconds before next iteration
10:47:57 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:58 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:47:58 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
10:47:59 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get node  -o yaml
10:47:59 - MainThread - ocs_ci.ocs.node - INFO  - Node worker-1.ocpm4202001.lnxero1.boe reached status NotReady
10:47:59 - MainThread - ocs_ci.ocs.node - INFO  - The following nodes reached status NotReady: ['worker-1.ocpm4202001.lnxero1.boe']
10:47:59 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage delete Pod noobaa-operator-f9f48cbff-lpn6f --grace-period=0 --force
10:48:00 - MainThread - ocs_ci.utility.utils - WARNING  - Command stderr: Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods "noobaa-operator-f9f48cbff-lpn6f" not found

10:48:00 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node worker-1.ocpm4202001.lnxero1.boe
10:48:00 - ThreadPoolExecutor-3_0 - ocs_ci.utility.utils - INFO  - Executing command: oc get Node  -o yaml
FAILED

@yitzhak12
Copy link
Contributor

Okay, I see.

@AaruniAggarwal
Copy link
Contributor

We are also still facing this issue on IBM Power as well.

@yitzhak12
Copy link
Contributor

yitzhak12 commented Jan 5, 2023

@Sravikaz, in the logs you showed me above, it waits for the worker node to reach NotReady status - which was successful. Then we have the problematic error when trying to delete the old pod noobaa-operator-f9f48cbff-lpn6f:

10:47:59 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage delete Pod noobaa-operator-f9f48cbff-lpn6f --grace-period=0 --force
10:48:00 - MainThread - ocs_ci.utility.utils - WARNING - Command stderr: Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (NotFound): pods "noobaa-operator-f9f48cbff-lpn6f" not found

This error is different than the one described in the first comment #6689 (comment). Anyway, we need to get the new noobaa operator pod or ignore the error when we don't find the pod(cause the pod already has been deleted)

@yitzhak12
Copy link
Contributor

@AaruniAggarwal, Can you share the test logs? Which test is failing?

@Sravikaz
Copy link
Contributor

Sravikaz commented Jan 5, 2023

@yitzhak12 : The deletion of the pod is successful and the new noobaa pod is back in Running state during the test execution. However the ocs-ci test would still fail as it keeps checking for the older pod name.

# oc get po -n openshift-storage
NAME                                                              READY   STATUS      RESTARTS      AGE
csi-addons-controller-manager-657c5f6655-5br5x                    2/2     Running     0             24h
csi-cephfsplugin-47sbv                                            2/2     Running     0             23h
csi-cephfsplugin-b5z2l                                            2/2     Running     0             23h
csi-cephfsplugin-provisioner-784cb55c5c-7lzpc                     5/5     Running     0             24h
csi-cephfsplugin-provisioner-784cb55c5c-nhg72                     5/5     Running     0             24h
csi-cephfsplugin-vfz4d                                            2/2     Running     0             24h
csi-rbdplugin-2nmnv                                               3/3     Running     0             24h
csi-rbdplugin-596x2                                               3/3     Running     0             23h
csi-rbdplugin-jqcxq                                               3/3     Running     0             23h
csi-rbdplugin-provisioner-5f4cb9d4fb-9qnp8                        6/6     Running     0             24h
csi-rbdplugin-provisioner-5f4cb9d4fb-trzcz                        6/6     Running     0             24h
noobaa-core-0                                                     1/1     Running     0             24h
noobaa-db-pg-0                                                    1/1     Running     0             24h
noobaa-endpoint-6cc7fcb54d-7mvd4                                  1/1     Running     0             24h
noobaa-operator-7b865c8f5-s887v                                   1/1     Running     1 (18h ago)   24h
ocs-metrics-exporter-74c99bf9db-k79ld                             1/1     Running     0             24h
ocs-operator-8646fc67cf-c82tf                                     1/1     Running     0             24h
odf-console-7cd5c55456-mh88g                                      1/1     Running     0             24h
odf-operator-controller-manager-66cc84c8b8-xdf7l                  2/2     Running     0             24h
rook-ceph-crashcollector-worker-0.ocpm4202001.lnxero1.boe-xvrxm   1/1     Running     0             23h
rook-ceph-crashcollector-worker-1.ocpm4202001.lnxero1.boe-x7w49   1/1     Running     0             23h
rook-ceph-crashcollector-worker-2.ocpm4202001.lnxero1.boe-whwjm   1/1     Running     0             23h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-54d5fb774nhqp   2/2     Running     0             23h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6c7564bf99fmh   2/2     Running     0             23h
rook-ceph-mgr-a-6dc79f85b7-vp7fs                                  2/2     Running     0             23h
rook-ceph-mon-a-675fcd65f9-ln5bn                                  2/2     Running     0             23h
rook-ceph-mon-b-f57f596b7-4wmmh                                   2/2     Running     0             23h
rook-ceph-mon-c-858f6699f5-46mlp                                  2/2     Running     0             23h
rook-ceph-operator-75c7d856b9-t5h9d                               1/1     Running     0             24h
rook-ceph-osd-0-67fcddd648-qswwn                                  2/2     Running     0             23h
rook-ceph-osd-1-59c8d697b6-559bc                                  2/2     Running     0             23h
rook-ceph-osd-2-8666477d9b-b9gjk                                  2/2     Running     0             23h
rook-ceph-osd-prepare-747267076e4decf7c7088a8293a396c3-c5d52      0/1     Completed   0             43h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-7cf5858qvw7m   2/2     Running     0             23h
rook-ceph-tools-75b8549fb9-tz48t                                  1/1     Running     0             24h

@yitzhak12
Copy link
Contributor

Yes, so we need to create a new pr and fix the line noobaa_operator_pod.delete(force=True) where we delete the pod - so it will check if the pod already has been deleted. This should fix the error you described. Unless I missed something...

@AaruniAggarwal
Copy link
Contributor

AaruniAggarwal commented Jan 5, 2023

@AaruniAggarwal, Can you share the test logs? Which test is failing?

I ran the following test case and it is failing with a similar reason which Sravika mentioned ie. The pods being checked are in the Terminating state and replacement pods are already available.:

tests/manage/z_cluster/nodes/test_check_pods_status_after_node_failure.py::TestCheckPodsAfterNodeFailure::test_check_pods_status_after_node_failure

@yitzhak12
Copy link
Contributor

Okay. But I need to see the logs - the root cause may be different. BTW do you mean the test test_check_pods_status_after_node_failure or the test Sravikaz mentioned test_noobaa_sts_host_node_failure?

@AaruniAggarwal
Copy link
Contributor

Both the following test cases are failing:

  • tests/manage/z_cluster/nodes/test_check_pods_status_after_node_failure.py::TestCheckPodsAfterNodeFailure::test_check_pods_status_after_node_failure

  • tests/manage/mcg/test_host_node_failure.py::TestNoobaaSTSHostNodeFailure::test_noobaa_sts_host_node_failure[noobaa-db-pg-True]

logfile:
test-check-pod-status-after-node-failure.log

test-noobaa-sts-host-node-failure-true.log

@yitzhak12
Copy link
Contributor

Thanks @AaruniAggarwal, for sharing the logs.
In the second logs of the test test_noobaa_sts_host_node_failure, I see it has the same issue as @Sravikaz mentioned. I will raise a new PR soon to fix the issue.

In the first logs of the test test_check_pods_status_after_node_failure, I see that the error is because after shutting down the worker node, some of the pods still were in a Terminating state. This is a product bug related to Ceph. From what I know, None of the pods should be stuck in a Terminating state - instead, they should be removed from the cluster until a new worker node come up. @am-agrawa @ebenahar @keesturam @prsurve Are you familiar with the issue?

@am-agrawa
Copy link
Contributor

test_check_pods_status_after_node_failure

My understanding is that the existing pods on that node will go to Terminating state, then get deleted and come up on a new node (to maintain the replica) if the node can handle the required CPU/Memory, etc needed to schedule those Pods.

  1. Is the timeout to check whether the Pod gets deleted after reaching terminating state is sufficient? Some time, it takes more than anticipated.
  2. If the answer is yes, then I would expect the new pods to get scheduled on another node and be in Running/Completed state. Else it's a bug.

And I am not aware of any open bugs for this case.

@yitzhak12
Copy link
Contributor

Yes, the timeout is 600 seconds - it should be enough. I tested it with vSphere 4.12, and it passed.
Here the test is with: OCS4-12-Downstream-OCP4-12-POWERVS-UPI-1AZ-RHCOS-LSO-3M-3W-tier4a. Maybe it behaves differently with the conf above?

@yitzhak12
Copy link
Contributor

In both tests, there are only two worker nodes after shutting down one worker node. But here, the pods are stuck in a Terminating state. And with vSphere, the pods have been removed, and new pods come up in a Pending state(as expected).

@am-agrawa
Copy link
Contributor

In both tests, there are only two worker nodes after shutting down one worker node. But here, the pods are stuck in a Terminating state. And with vSphere, the pods have been removed, and new pods come up in a Pending state(as expected).

As it's working with VSphere, it certainly has to do something with the platform it's failing on, which is IBM Z.

@yitzhak12
Copy link
Contributor

Oh, okay. Do you think we should raise a bug about the problem?

@am-agrawa
Copy link
Contributor

Let's open one to see what dev. has to offer? But I doubt it's a bug.

@yitzhak12
Copy link
Contributor

If it's the expected behavior with IBM Z, we can change the condition in the test accordingly.

@am-agrawa
Copy link
Contributor

am-agrawa commented Jan 9, 2023

Not sure why it's behaving differently on IBM Z, let's go with a bug for now.

@ebenahar ebenahar reopened this Jan 9, 2023
@AaruniAggarwal
Copy link
Contributor

AaruniAggarwal commented Jan 9, 2023

I re-ran the test test_check_pods_status_after_node_failure and while the test was Running, I monitored the nodes and pods in openshift-storage namespace.

(venv) [root@rdr-site-lon06-bastion-0 ocs-ci]# oc get nodes
NAME                              STATUS     ROLES                  AGE    VERSION
lon06-master-0.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-master-1.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-master-2.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-worker-0.rdr-site.ibm.com   Ready      worker                 3d3h   v1.25.4+77bec7a
lon06-worker-1.rdr-site.ibm.com   NotReady   worker                 3d3h   v1.25.4+77bec7a
lon06-worker-2.rdr-site.ibm.com   Ready      worker                 3d3h   v1.25.4+77bec7a

(venv) [root@rdr-site-lon06-bastion-0 ocs-ci]# oc get pods -o wide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-ff9cbdb85-cv7cz                     2/2     Running       0          8h      10.129.2.172    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-48hrj                                            2/2     Running       2          2d23h   192.168.0.170   lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-f24vb                                            2/2     Running       2          2d23h   192.168.0.26    lon06-worker-1.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-nm5c5                                            2/2     Running       0          2d23h   192.168.0.51    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-8547ff775c-74n7s                     5/5     Running       0          6h27m   10.128.2.133    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-8547ff775c-kcxnv                     5/5     Running       0          9h      10.129.2.165    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-bvvs8                                               3/3     Running       0          2d23h   192.168.0.51    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-56ddf95444-6g6h2                        6/6     Running       0          9h      10.129.2.164    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-56ddf95444-kxddk                        6/6     Running       0          6h27m   10.128.2.132    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-qt85t                                               3/3     Running       3          2d23h   192.168.0.170   lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-v5trq                                               3/3     Running       3          2d23h   192.168.0.26    lon06-worker-1.rdr-site.ibm.com   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          6h32m   10.128.2.123    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running       0          6h32m   10.128.2.121    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-endpoint-7c8d5679c4-dgtcc                                  1/1     Running       0          6h32m   10.128.2.122    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-operator-698f8b864-8p92p                                   1/1     Running       0          6h32m   10.128.2.120    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
ocs-metrics-exporter-6cb87d944f-cdmsk                             1/1     Running       0          8h      10.129.2.169    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
ocs-operator-697df75746-p4wz6                                     1/1     Running       0          6h27m   10.128.2.138    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
odf-console-6874b779dc-gjdcx                                      1/1     Running       0          6h27m   10.128.2.131    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
odf-operator-controller-manager-549999b555-qh69q                  2/2     Running       0          6h27m   10.128.2.128    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-site.ibm.com-6dxl89   1/1     Running       0          6h32m   10.128.2.126    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-69k7dh   0/1     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p   1/1     Terminating   0          6h22m   10.131.0.28     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df   1/1     Running       0          9h      10.129.2.160    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm   2/2     Running       0          9h      10.129.2.157    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2   2/2     Running       0          6h32m   10.128.2.125    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-mgr-a-55654896fb-h99wm                                  2/2     Running       0          9h      10.129.2.159    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-a-856f7d5784-65crq                                  2/2     Running       0          10h     10.129.2.152    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-c-6b97765555-pwcv7                                  0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-mon-c-6b97765555-rq646                                  2/2     Terminating   0          6h27m   10.131.0.29     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-d-db9dfc74f-qxpv6                                   2/2     Running       0          6h52m   10.128.2.116    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-operator-65c7df8664-f9htk                               1/1     Running       0          6h27m   10.128.2.137    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-0-579764f59-nmpbm                                   0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-osd-0-579764f59-phqqk                                   2/2     Terminating   0          6h27m   10.131.0.24     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-1-59868c476c-fmt5b                                  2/2     Running       0          6h52m   10.128.2.113    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-2-5c94d44ffc-r7zrt                                  2/2     Running       0          10h     10.129.2.145    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6   2/2     Running       0          9h      10.129.2.158    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-tools-8fb899c6d-lnw9n                                   1/1     Running       0          6h27m   10.128.2.127    lon06-worker-0.rdr-site.ibm.com   <none>           <none>

New pods got created which are in Pending state, which is expected as the node (worker-1) is still in NotReady state.

Not sure why the test case checked the pod in the Terminating state. Found following in the log:

11:55:34 - MainThread - ocs_ci.ocs.resources.pod - ERROR  - The pod rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p is in Terminating state. Expected = Running
11:55:34 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df -n openshift-storage
11:55:34 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mgr-a-55654896fb-h99wm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mgr-a-55654896fb-h99wm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-a-856f7d5784-65crq -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-a-856f7d5784-65crq -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-6b97765555-rq646 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-6b97765555-rq646 -n openshift-storage
11:55:36 - MainThread - ocs_ci.ocs.resources.pod - ERROR  - The pod rook-ceph-mon-c-6b97765555-rq646 is in Terminating state. Expected = Running
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-d-db9dfc74f-qxpv6 -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-d-db9dfc74f-qxpv6 -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-operator-65c7df8664-f9htk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-operator-65c7df8664-f9htk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-579764f59-phqqk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-579764f59-phqqk -n openshift-storage
11:55:36 - MainThread - ocs_ci.ocs.resources.pod - ERROR  - The pod rook-ceph-osd-0-579764f59-phqqk is in Terminating state. Expected = Running
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-1-59868c476c-fmt5b -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-1-59868c476c-fmt5b -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-2-5c94d44ffc-r7zrt -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-2-5c94d44ffc-r7zrt -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6 -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO  - Executing command: oc -n openshift-storage get Pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6 -n openshift-storage
11:55:37 - MainThread - ocs_ci.ocs.resources.pod - WARNING  - Not all the pods reached status running after 600 seconds
11:55:37 - MainThread - ocs_ci.ocs.utils - INFO  - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.12 will be used.

I tried running the test case after increasing the timeout from 600 to 900, but it didn't help as well.

@am-agrawa
Copy link
Contributor

I re-ran the test test_check_pods_status_after_node_failure and while the test was Running, I monitored the nodes and pods in openshift-storage namespace.

(venv) [root@rdr-site-lon06-bastion-0 ocs-ci]# oc get nodes
NAME                              STATUS     ROLES                  AGE    VERSION
lon06-master-0.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-master-1.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-master-2.rdr-site.ibm.com   Ready      control-plane,master   3d3h   v1.25.4+77bec7a
lon06-worker-0.rdr-site.ibm.com   Ready      worker                 3d3h   v1.25.4+77bec7a
lon06-worker-1.rdr-site.ibm.com   NotReady   worker                 3d3h   v1.25.4+77bec7a
lon06-worker-2.rdr-site.ibm.com   Ready      worker                 3d3h   v1.25.4+77bec7a

(venv) [root@rdr-site-lon06-bastion-0 ocs-ci]# oc get pods -o wide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-ff9cbdb85-cv7cz                     2/2     Running       0          8h      10.129.2.172    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-48hrj                                            2/2     Running       2          2d23h   192.168.0.170   lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-f24vb                                            2/2     Running       2          2d23h   192.168.0.26    lon06-worker-1.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-nm5c5                                            2/2     Running       0          2d23h   192.168.0.51    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-8547ff775c-74n7s                     5/5     Running       0          6h27m   10.128.2.133    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-8547ff775c-kcxnv                     5/5     Running       0          9h      10.129.2.165    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-bvvs8                                               3/3     Running       0          2d23h   192.168.0.51    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-56ddf95444-6g6h2                        6/6     Running       0          9h      10.129.2.164    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-56ddf95444-kxddk                        6/6     Running       0          6h27m   10.128.2.132    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-qt85t                                               3/3     Running       3          2d23h   192.168.0.170   lon06-worker-0.rdr-site.ibm.com   <none>           <none>
csi-rbdplugin-v5trq                                               3/3     Running       3          2d23h   192.168.0.26    lon06-worker-1.rdr-site.ibm.com   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          6h32m   10.128.2.123    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-db-pg-0                                                    1/1     Running       0          6h32m   10.128.2.121    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-endpoint-7c8d5679c4-dgtcc                                  1/1     Running       0          6h32m   10.128.2.122    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
noobaa-operator-698f8b864-8p92p                                   1/1     Running       0          6h32m   10.128.2.120    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
ocs-metrics-exporter-6cb87d944f-cdmsk                             1/1     Running       0          8h      10.129.2.169    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
ocs-operator-697df75746-p4wz6                                     1/1     Running       0          6h27m   10.128.2.138    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
odf-console-6874b779dc-gjdcx                                      1/1     Running       0          6h27m   10.128.2.131    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
odf-operator-controller-manager-549999b555-qh69q                  2/2     Running       0          6h27m   10.128.2.128    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-0.rdr-site.ibm.com-6dxl89   1/1     Running       0          6h32m   10.128.2.126    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-69k7dh   0/1     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p   1/1     Terminating   0          6h22m   10.131.0.28     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df   1/1     Running       0          9h      10.129.2.160    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm   2/2     Running       0          9h      10.129.2.157    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2   2/2     Running       0          6h32m   10.128.2.125    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-mgr-a-55654896fb-h99wm                                  2/2     Running       0          9h      10.129.2.159    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-a-856f7d5784-65crq                                  2/2     Running       0          10h     10.129.2.152    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-c-6b97765555-pwcv7                                  0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-mon-c-6b97765555-rq646                                  2/2     Terminating   0          6h27m   10.131.0.29     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-mon-d-db9dfc74f-qxpv6                                   2/2     Running       0          6h52m   10.128.2.116    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-operator-65c7df8664-f9htk                               1/1     Running       0          6h27m   10.128.2.137    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-0-579764f59-nmpbm                                   0/2     Pending       0          9m34s   <none>          <none>                            <none>           <none>
rook-ceph-osd-0-579764f59-phqqk                                   2/2     Terminating   0          6h27m   10.131.0.24     lon06-worker-1.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-1-59868c476c-fmt5b                                  2/2     Running       0          6h52m   10.128.2.113    lon06-worker-0.rdr-site.ibm.com   <none>           <none>
rook-ceph-osd-2-5c94d44ffc-r7zrt                                  2/2     Running       0          10h     10.129.2.145    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6   2/2     Running       0          9h      10.129.2.158    lon06-worker-2.rdr-site.ibm.com   <none>           <none>
rook-ceph-tools-8fb899c6d-lnw9n                                   1/1     Running       0          6h27m   10.128.2.127    lon06-worker-0.rdr-site.ibm.com   <none>           <none>

New pods got created which are in Pending state, which is expected as the node (worker-1) is still in NotReady state.

Not sure why the test case checked the pod in the Terminating state. Found following in the log:

11:55:34 - MainThread - ocs_ci.ocs.resources.pod - ERROR  - The pod rook-ceph-crashcollector-lon06-worker-1.rdr-site.ibm.com-6qnx4p is in Terminating state. Expected = Running

I guess there a lot of things happening here:

  1. If a node goes down, the pods running on that node should be re-created but the older pods should get deleted once they move into terminating state, which is not happening at the moment so it could be a bug.
  2. Assuming the node is untained (saying without looking at the code line by line), the new Pods are expected to be created on the same node, worker-1 in this case, but worker-1 is not ready so Pods will be stuck in Pending, which is expected.
    But the code is expecting the same old pod to reach running state, which is absolutely incorrect. Pod once re-created will have a new suffix, new ip, etc. So the test will never pass as the condition will not meet.
    We should also fix the code to check the status of newly created pods on that node, and not the older one's under a condition that the node has already reached ready state, and not before that.

11:55:34 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df -n openshift-storage
11:55:34 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-crashcollector-lon06-worker-2.rdr-site.ibm.com-5wr6df -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-58758574jx9zm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7dc65cdbfnkk2 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mgr-a-55654896fb-h99wm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mgr-a-55654896fb-h99wm -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-a-856f7d5784-65crq -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-a-856f7d5784-65crq -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-6b97765555-rq646 -n openshift-storage
11:55:35 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-c-6b97765555-rq646 -n openshift-storage
11:55:36 - MainThread - ocs_ci.ocs.resources.pod - ERROR - The pod rook-ceph-mon-c-6b97765555-rq646 is in Terminating state. Expected = Running

Same here

11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-d-db9dfc74f-qxpv6 -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-mon-d-db9dfc74f-qxpv6 -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-operator-65c7df8664-f9htk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-operator-65c7df8664-f9htk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-579764f59-phqqk -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-0-579764f59-phqqk -n openshift-storage
11:55:36 - MainThread - ocs_ci.ocs.resources.pod - ERROR - The pod rook-ceph-osd-0-579764f59-phqqk is in Terminating state. Expected = Running

Same here

11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-1-59868c476c-fmt5b -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-1-59868c476c-fmt5b -n openshift-storage
11:55:36 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-2-5c94d44ffc-r7zrt -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-osd-2-5c94d44ffc-r7zrt -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6 -n openshift-storage
11:55:37 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage get Pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-5cf748dtx4c6 -n openshift-storage
11:55:37 - MainThread - ocs_ci.ocs.resources.pod - WARNING - Not all the pods reached status running after 600 seconds
11:55:37 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.12 will be used.


I tried running the test case after increasing the timeout from 600 to 900, but it didn't help as well.

Increasing the timeout would not help. But I am wondering how it is passing on VMware? @AaruniAggarwal Could you pls run it on VMware or any other platform, debug it in the same way and share your observations?
I am totally confused for now.

@AaruniAggarwal
Copy link
Contributor

Aman, I don't have access to any other platform, So I can't run this test on any other platform. I believe this test is also failing for IBM Z as Abdul created this issue and he works on IBM Z (s390x), he also mentioned that the test case is looking for the pods which are in the Terminating state.

@yitzhak12
Copy link
Contributor

But the code is expecting the same old pod to reach running state, which is absolutely incorrect. Pod once re-created will have a new suffix, new ip, etc. So the test will never pass as the condition will not meet.

It is not expecting the pods to be Running. It expects the that the old pods will be removed or will be in a Running state. The test already passed with vSphere 4.12, as you can see here for example https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull

@am-agrawa
Copy link
Contributor

But the code is expecting the same old pod to reach running state, which is absolutely incorrect. Pod once re-created will have a new suffix, new ip, etc. So the test will never pass as the condition will not meet.

It is not expecting the pods to be Running. It expects the that the old pods will be removed or will be in a Running state.

I checked the logs of VMware run, and it looks fine to me. But a log message like Expected = Running when pod is in terminating is what confused me. Do you think we should change it?

The test already passed with vSphere 4.12, as you can see here for example https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull

All the checks in this run looks good to me, so now it's clear that it has something to do with IBM Z. It could be a platform specific bug or some code fix would ne needed at our end (but I don't know what?).

@yitzhak12
Copy link
Contributor

Yes, It could be more clear. Maybe we need to add a comment or a log message.
Anyway, the function: wait_for_pods_to_be_running( pod_names=rook_ceph_pod_names_not_in_node, timeout=timeout, sleep=30 ) it's a "safe" function - it checks that pods are Running but ignores the case of a pod not found error. So if the pods have been removed, it is also fine.

@yitzhak12
Copy link
Contributor

I raised a new bug about the issue: https://bugzilla.redhat.com/show_bug.cgi?id=2159757.
@AaruniAggarwal Can you send me the relevant versions(OCP, ODF, cluster version, Ceph, and rook)? Or you can add it in a comment to the BZ I raised.

@AaruniAggarwal
Copy link
Contributor

Apologies, I missed the comment. I have added relevant versions in the BZ itself.
Thanks, Itzhak.

@yitzhak12
Copy link
Contributor

@AaruniAggarwal @Sravikaz the pr #6912 is merged. So now the issue with the test test_noobaa_sts_host_node_failure should be resolved (maybe we need to use branch master, and not stable).
Let me know if it works fine.

@AaruniAggarwal
Copy link
Contributor

AaruniAggarwal commented Jan 18, 2023

After incorporating the changes, executed the test test_noobaa_sts_host_node_failure and it failed again. Attaching the log file:
test-noobaa-sts-host-node-failure-true-1.log

status of nodes and pods while executing the testcase:

(venv) [root@rdr-abhi-syd05-bastion-0 ocs-ci]# oc get nodes
NAME                              STATUS     ROLES                  AGE    VERSION
syd05-master-0.rdr-abhi.ibm.com   Ready      control-plane,master   5d9h   v1.25.4+77bec7a
syd05-master-1.rdr-abhi.ibm.com   Ready      control-plane,master   5d9h   v1.25.4+77bec7a
syd05-master-2.rdr-abhi.ibm.com   Ready      control-plane,master   5d9h   v1.25.4+77bec7a
syd05-worker-0.rdr-abhi.ibm.com   Ready      worker                 5d8h   v1.25.4+77bec7a
syd05-worker-1.rdr-abhi.ibm.com   NotReady   worker                 5d8h   v1.25.4+77bec7a
syd05-worker-2.rdr-abhi.ibm.com   Ready      worker                 5d8h   v1.25.4+77bec7a

(venv) [root@rdr-abhi-syd05-bastion-0 ocs-ci]# oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS        RESTARTS   AGE     IP              NODE                              NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-6d97b7f5c5-x2rmg                    2/2     Running       0          2d8h    10.129.2.36     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-d9f57                                            2/2     Running       0          2d8h    192.168.0.20    syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-ljvpj                                            2/2     Running       0          2d8h    192.168.0.167   syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-nwl7m                                            2/2     Running       0          2d8h    192.168.0.80    syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-669c4bb954-cmvxd                     5/5     Running       0          3m36s   10.129.2.56     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-669c4bb954-f45lq                     5/5     Running       0          2d8h    10.131.0.97     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
csi-cephfsplugin-provisioner-669c4bb954-zvmtj                     5/5     Terminating   0          2d8h    10.128.2.39     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-6nxtz                                               3/3     Running       0          2d8h    192.168.0.20    syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-ddmpl                                               3/3     Running       0          2d8h    192.168.0.80    syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-697f7df5df-22gd7                        6/6     Running       0          3m36s   10.131.1.102    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-697f7df5df-6c6lt                        6/6     Running       0          2d8h    10.129.2.37     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-provisioner-697f7df5df-n7vgp                        6/6     Terminating   0          2d8h    10.128.2.38     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
csi-rbdplugin-s9fdz                                               3/3     Running       0          2d8h    192.168.0.167   syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
noobaa-core-0                                                     1/1     Running       0          8m42s   10.129.2.54     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
noobaa-db-pg-0                                                    0/1     Init:0/2      0          8m42s   <none>          syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
noobaa-endpoint-55bcb989f5-2zklx                                  1/1     Running       0          8m42s   10.131.1.99     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
noobaa-operator-7dbbd8d898-kqw57                                  1/1     Running       0          8m40s   10.131.1.100    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
ocs-metrics-exporter-7db4b969f8-4ph4l                             1/1     Running       0          2d8h    10.131.0.96     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
ocs-operator-595585fbf8-m7vn6                                     1/1     Running       0          2d8h    10.131.0.95     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
odf-console-74687584d9-rlwwc                                      1/1     Running       0          2d8h    10.129.2.33     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
odf-operator-controller-manager-5c4bfdbfbc-mpkbc                  2/2     Running       0          2d8h    10.129.2.32     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-crashcollector-syd05-worker-0.rdr-abhi.ibm.com-5zgh4h   1/1     Running       0          2d8h    10.131.0.105    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-crashcollector-syd05-worker-1.rdr-abhi.ibm.com-bbbg28   1/1     Terminating   0          2d8h    10.128.2.42     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-crashcollector-syd05-worker-1.rdr-abhi.ibm.com-bvjtrl   0/1     Pending       0          3m36s   <none>          <none>                            <none>           <none>
rook-ceph-crashcollector-syd05-worker-2.rdr-abhi.ibm.com-5mgg4q   1/1     Running       0          2d8h    10.129.2.48     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-56c48c57jmzxk   2/2     Running       0          2d8h    10.129.2.47     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-5d454fdc7hts7   2/2     Running       0          2d8h    10.131.0.104    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mgr-a-74f9d4675f-9kqcj                                  2/2     Running       0          19h     10.131.1.15     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mon-a-7ff4cb9994-p45vp                                  0/2     Pending       0          3m36s   <none>          <none>                            <none>           <none>
rook-ceph-mon-a-7ff4cb9994-rrbdq                                  2/2     Terminating   0          19h     10.128.2.210    syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mon-b-584c786ff8-zjqr8                                  2/2     Running       0          19h     10.131.1.16     syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-mon-c-7d9b7d85cc-q49m6                                  2/2     Running       0          2d8h    10.129.2.39     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-operator-5cfcc7b6c6-5smsp                               1/1     Terminating   0          2d8h    10.128.2.37     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-operator-5cfcc7b6c6-vhnfl                               1/1     Running       0          3m36s   10.131.1.104    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-osd-0-56948d4698-6dpcv                                  2/2     Running       0          2d8h    10.129.2.43     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-osd-1-7595b76dc7-kf7kz                                  2/2     Running       0          2d8h    10.131.0.103    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-osd-2-7fb49bfb9-2dt8s                                   0/2     Pending       0          3m36s   <none>          <none>                            <none>           <none>
rook-ceph-osd-2-7fb49bfb9-l9tcq                                   2/2     Terminating   0          2d8h    10.128.2.44     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-osd-prepare-8d657d9e4bd24036dece59a32a55dde5-wnfq7      0/1     Completed     0          2d8h    10.129.2.42     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-osd-prepare-98b176343a88a00f79a4c108c624e0e3-947tf      0/1     Completed     0          2d8h    10.131.0.102    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-856f7ffbwgkh   2/2     Running       0          2d8h    10.129.2.49     syd05-worker-2.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-tools-56646999f5-jnppj                                  1/1     Terminating   0          2d8h    10.128.2.45     syd05-worker-1.rdr-abhi.ibm.com   <none>           <none>
rook-ceph-tools-56646999f5-xfvkd                                  1/1     Running       0          3m36s   10.131.1.101    syd05-worker-0.rdr-abhi.ibm.com   <none>           <none>

Once the test case finished and the node reached the Ready state, all the pods came into the Running state.

[root@rdr-abhi-syd05-bastion-0 ~]# oc get pods -n openshift-storage |grep noobaa
NAME                                                              READY   STATUS      RESTARTS   AGE
noobaa-core-0                                                     1/1     Running     0          64m
noobaa-db-pg-0                                                    1/1     Running     0          64m
noobaa-endpoint-55bcb989f5-2zklx                                  1/1     Running     0          64m
noobaa-operator-7dbbd8d898-kqw57                                  1/1     Running     0          64m

@yitzhak12
Copy link
Contributor

Okay. So the previous problem is resolved. This is the good news:)
I have opened a new issue for the problem you described above: #6951
@AaruniAggarwal @Sravikaz please follow

@yitzhak12
Copy link
Contributor

I am closing this issue, as the original problem has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants