New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[IBM Z] : Test test_check_pods_status_after_node_failure failing eventhough the replacement pods are in Running state
#6689
Comments
|
Also observing the same issue in tier4b test cases |
|
Observed this in IBM Power as well |
|
@Shrivaibavi @OdedViner any idea if this happens also with our test executions on x86? |
|
I don't know this test... @yitzhak12 Do you know this failure? |
|
I think this is a general issue in the function |
|
We are still facing this issue with the latest ocs-ci changes |
|
@yitzhak12 could you please take a look? |
|
Can you send me a link to the relevant test failure? |
|
@yitzhak12 : Ocs-ci keeps checking for the older pod name after the replacement of the pod. Attaching the log of the test case. |
1 similar comment
|
@yitzhak12 : Ocs-ci keeps checking for the older pod name after the replacement of the pod. Attaching the log of the test case. |
|
In error, you described we don't find the pod |
|
@yitzhak12 : yes, the test stopped the worker node and the noobaa pod(noobaa-operator-f9f48cbff-lpn6f) got drained and got a new name after replacement. However the ocs-ci test keeps checking for the older noobaa pod name (noobaa-operator-f9f48cbff-lpn6f) |
|
Okay, I see. |
|
We are also still facing this issue on IBM Power as well. |
|
@Sravikaz, in the logs you showed me above, it waits for the worker node to reach NotReady status - which was successful. Then we have the problematic error when trying to delete the old pod
This error is different than the one described in the first comment #6689 (comment). Anyway, we need to get the new noobaa operator pod or ignore the error when we don't find the pod(cause the pod already has been deleted) |
|
@AaruniAggarwal, Can you share the test logs? Which test is failing? |
|
@yitzhak12 : The deletion of the pod is successful and the new noobaa pod is back in Running state during the test execution. However the ocs-ci test would still fail as it keeps checking for the older pod name. |
|
Yes, so we need to create a new pr and fix the line |
I ran the following test case and it is failing with a similar reason which Sravika mentioned ie. The pods being checked are in the Terminating state and replacement pods are already available.: tests/manage/z_cluster/nodes/test_check_pods_status_after_node_failure.py::TestCheckPodsAfterNodeFailure::test_check_pods_status_after_node_failure |
|
Okay. But I need to see the logs - the root cause may be different. BTW do you mean the test |
|
Both the following test cases are failing:
|
|
Thanks @AaruniAggarwal, for sharing the logs. In the first logs of the test |
My understanding is that the existing pods on that node will go to Terminating state, then get deleted and come up on a new node (to maintain the replica) if the node can handle the required CPU/Memory, etc needed to schedule those Pods.
And I am not aware of any open bugs for this case. |
|
Yes, the timeout is 600 seconds - it should be enough. I tested it with vSphere 4.12, and it passed. |
|
In both tests, there are only two worker nodes after shutting down one worker node. But here, the pods are stuck in a Terminating state. And with vSphere, the pods have been removed, and new pods come up in a Pending state(as expected). |
As it's working with VSphere, it certainly has to do something with the platform it's failing on, which is IBM Z. |
|
Oh, okay. Do you think we should raise a bug about the problem? |
|
Let's open one to see what dev. has to offer? But I doubt it's a bug. |
|
If it's the expected behavior with IBM Z, we can change the condition in the test accordingly. |
|
Not sure why it's behaving differently on IBM Z, let's go with a bug for now. |
|
I re-ran the test New pods got created which are in Pending state, which is expected as the node (worker-1) is still in NotReady state. Not sure why the test case checked the pod in the Terminating state. Found following in the log: I tried running the test case after increasing the timeout from 600 to 900, but it didn't help as well. |
I guess there a lot of things happening here:
Same here
Same here
Increasing the timeout would not help. But I am wondering how it is passing on VMware? @AaruniAggarwal Could you pls run it on VMware or any other platform, debug it in the same way and share your observations? |
|
Aman, I don't have access to any other platform, So I can't run this test on any other platform. I believe this test is also failing for IBM Z as Abdul created this issue and he works on IBM Z (s390x), he also mentioned that the test case is looking for the pods which are in the Terminating state. |
It is not expecting the pods to be Running. It expects the that the old pods will be removed or will be in a Running state. The test already passed with vSphere 4.12, as you can see here for example https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull |
I checked the logs of VMware run, and it looks fine to me. But a log message like The test already passed with vSphere 4.12, as you can see here for example https://ocs4-jenkins-csb-odf-qe.apps.ocp-c1.prod.psi.redhat.com/job/qe-deploy-ocs-cluster/19374/consoleFull All the checks in this run looks good to me, so now it's clear that it has something to do with IBM Z. It could be a platform specific bug or some code fix would ne needed at our end (but I don't know what?). |
|
Yes, It could be more clear. Maybe we need to add a comment or a log message. |
|
I raised a new bug about the issue: https://bugzilla.redhat.com/show_bug.cgi?id=2159757. |
|
Apologies, I missed the comment. I have added relevant versions in the BZ itself. |
|
@AaruniAggarwal @Sravikaz the pr #6912 is merged. So now the issue with the test |
|
After incorporating the changes, executed the test status of nodes and pods while executing the testcase: Once the test case finished and the node reached the Ready state, all the pods came into the Running state. |
|
Okay. So the previous problem is resolved. This is the good news:) |
|
I am closing this issue, as the original problem has been fixed. |
Test fails check status of pods in
Terminatingstate even though replacement pods are in Running state. Is this an expected behavior?Test:
tests/manage/z_cluster/nodes/test_check_pods_status_after_node_failure.py::TestCheckPodsAfterNodeFailure::test_check_pods_status_after_node_failureError:
The pods being checked are in the Terminating state and replacement pods are already available.
Mustgather and full logs: https://drive.google.com/file/d/1z0jgGfWm9Eddku470TUPAbiNLbQfWoun/view?usp=share_link
The text was updated successfully, but these errors were encountered: