Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct CRD operator delete predicate, increase test timeout #8715

Merged
merged 2 commits into from Jan 9, 2024

Conversation

MikeEdgar
Copy link
Contributor

@MikeEdgar MikeEdgar commented Jun 21, 2023

Type of change

Select the type of your PR

  • Bugfix
  • Enhancement / new feature
  • Refactoring
  • Documentation

Description

Fixes #8526

I am unable to confirm the fix with MacOS/aarch64 myself, however.

  • Correct the CRD operator's delete waitFor predicate to complete when the resource being deleted is missing.
  • Adjust test mocks to account for the above change.
  • Increase the timeout for a test dealing with updates to a CR following it being deleted. The amount of time for the orphan finalizer to be handled and the resource fully deleted seems to vary significantly between runs. Increasing the timeout to 1 minute appears to be sufficient to allow it to succeed consistently.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@MikeEdgar
Copy link
Contributor Author

@gracegrimwood , would you be willing to try building this PR in your environment to confirm it fixes your issue?

@gracegrimwood
Copy link

Hi Mike, the good news is that it seems to be failing a bit less! (Might be coincidence, might not, but I'm counting it as a win 😁)
I'm seeing occasional TimeoutExceptions now, rather than the almost constant exceptions I was seeing before. It seems that some of the integration tests in operator-common still time out when deleting resources, and I'm still seeing io.vertx.core.VertxException: Thread blocked warnings, but they're far less frequent. I've also seen failures in cluster-operator to create Cluster Roles (which results in everything failing from that point on). Some examples of failures seen following your change are below:

TimeoutException on io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT
[INFO] Running io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT
[2023-06-22 13:30:39,981] INFO Client use Namespace: strimzipodset-crd-it-namespace (io.strimzi.test.k8s.KubeClusterResource:82)
[2023-06-22 13:30:41,182] INFO Using cluster: minikube (io.strimzi.test.k8s.cluster.KubeCluster:92)
[2023-06-22 13:30:41,249] WARN Namespace strimzipodset-crd-it-namespace is already created, going to delete it (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:88)
[2023-06-22 13:30:41,289] INFO Creating namespace: strimzipodset-crd-it-namespace (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:93)
[2023-06-22 13:30:41,327] INFO Creating CRD (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:97)
[2023-06-22 13:30:41,327] INFO Creating resources /Users/ggrimwoo/workspace/temp/strimzi-kafka-operator/operator-common/../packaging/install/cluster-operator/042-Crd-strimzipodset.yaml in Namespace strimzipodset-crd-it-namespace (io.strimzi.test.k8s.KubeClusterResource:263)
[2023-06-22 13:30:41,636] INFO Created CRD (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:100)
[2023-06-22 13:30:41,637] INFO Creating resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:207)
[2023-06-22 13:30:41,655] WARN The client is using resource type 'strimzipodsets' with unstable version 'v1beta2' (io.fabric8.kubernetes.client.dsl.internal.VersionUsageUtils:60)
[2023-06-22 13:30:43,713] INFO Saving resource with status change prior to deletion (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:212)
[2023-06-22 13:30:43,725] INFO Deleting resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:214)
[2023-06-22 13:31:11,673] INFO Creating resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:169)
[2023-06-22 13:31:11,705] INFO Updating resource status (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:177)
[2023-06-22 13:31:11,721] INFO Reconciliation #0(test) kind(namespace/name): Status of StrimziPodSet my-test-resource-562149180 in namespace strimzipodset-crd-it-namespace has been updated (io.strimzi.operator.common.operator.resource.CrdOperator:130)
[2023-06-22 13:31:11,727] INFO Deleting resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:186)
[2023-06-22 13:31:16,793] INFO Creating resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:151)
[2023-06-22 13:31:16,804] INFO Deleting resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:155)
[2023-06-22 13:31:17,143] INFO Wait for confirmed deletion (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:219)
[2023-06-22 13:31:17,157] INFO Updating resource with new status - should fail (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:223)
[2023-06-22 13:31:17,824] INFO Creating resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:248)
[2023-06-22 13:31:17,841] INFO Updating resource (mocking an update due to some other reason) (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:255)
[2023-06-22 13:31:17,851] INFO Updating resource status after underlying resource has changed (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:258)
[2023-06-22 13:31:17,854] INFO Failed as expected (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:262)
[2023-06-22 13:31:17,854] INFO Deleting resource (io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT:270)
[2023-06-22 13:31:18,890] WARN Deleting namespace strimzipodset-crd-it-namespace after tests run (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:110)
[2023-06-22 13:31:24,358] INFO Deleting CRD (io.strimzi.operator.common.operator.resource.AbstractCustomResourceOperatorIT:113)
[2023-06-22 13:31:24,358] INFO Deleting resources /Users/ggrimwoo/workspace/temp/strimzi-kafka-operator/operator-common/../packaging/install/cluster-operator/042-Crd-strimzipodset.yaml (io.strimzi.test.k8s.KubeClusterResource:301)
[ERROR] Tests run: 4, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 44.536 s <<< FAILURE! - in null
[ERROR] io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT.testUpdateStatusAfterResourceDeletedThrowsKubernetesClientException(VertxTestContext)  Time elapsed: 30.028 s  <<< ERROR!
java.util.concurrent.TimeoutException: 
The test execution timed out. Make sure your asynchronous code includes calls to either VertxTestContext#completeNow(), VertxTestContext#failNow() or Checkpoint#flag()

Unsatisfied checkpoints diagnostics:
-> checkpoint at io.strimzi.operator.common.operator.resource.StrimziPodSetCrdOperatorIT.testUpdateStatusAfterResourceDeletedThrowsKubernetesClientException(StrimziPodSetCrdOperatorIT.java:200)
	at io.vertx.junit5.VertxExtension.joinActiveTestContexts(VertxExtension.java:204)
	at io.vertx.junit5.VertxExtension.afterTestExecution(VertxExtension.java:164)
TimeoutException on io.strimzi.operator.cluster.operator.assembly.KafkaConnectMigrationTest

(This one was actually quite long, the full test output is in this gist.)

2023-06-22 13:48:11 INFO  AbstractOperator:510 - Reconciliation #777(test-trigger) Kafka(my-namespace/my-cluster): reconciled
2023-06-22 13:48:11 INFO  StrimziPodSetController:592 - Requesting the StrimziPodSet controller to stop
2023-06-22 13:48:11 INFO  StrimziPodSetController:575 - Stopping StrimziPodSet controller
2023-06-22 13:48:11 INFO  InformerUtils:63 - Stopping informers
2023-06-22 13:48:11 INFO  InformerUtils:51 - StrimziPodSet informer stopped
2023-06-22 13:48:11 INFO  InformerUtils:51 - Pod informer stopped
2023-06-22 13:48:11 INFO  InformerUtils:51 - Kafka informer stopped
2023-06-22 13:48:11 INFO  InformerUtils:51 - KafkaConnect informer stopped
2023-06-22 13:48:11 INFO  InformerUtils:51 - KafkaMirrorMaker2 informer stopped
[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 52.024 s <<< FAILURE! - in io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest
[ERROR] io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest.testReconcileAndRemovePool(VertxTestContext)  Time elapsed: 30.069 s  <<< ERROR!
java.util.concurrent.TimeoutException: 
The test execution timed out. Make sure your asynchronous code includes calls to either VertxTestContext#completeNow(), VertxTestContext#failNow() or Checkpoint#flag()

Unsatisfied checkpoints diagnostics:
-> checkpoint at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest.testReconcileAndRemovePool(KafkaAssemblyOperatorWithPoolsMockTest.java:627)
	at io.vertx.junit5.VertxExtension.joinActiveTestContexts(VertxExtension.java:204)
	at io.vertx.junit5.VertxExtension.afterTestExecution(VertxExtension.java:164)
Failure to create Cluster Roles in cluster-operator tests

(Full test output is in this gist.)

[INFO] Running io.strimzi.operator.cluster.MainIT
2023-06-22 14:28:24 WARN  DnsServerAddressStreamProviders:70 - Can not find io.netty.resolver.dns.macos.MacOSDnsServerAddressStreamProvider in the classpath, fallback to system defaults. This may result in incorrect DNS resolutions on MacOS. Check whether you have a dependency on 'io.netty:netty-resolver-dns-native-macos'
2023-06-22 14:28:25 INFO  KubeCluster:92 - Using cluster: minikube
2023-06-22 14:28:25 INFO  Main:274 - Creating cluster role strimzi-kafka-broker
2023-06-22 14:28:25 INFO  Main:274 - Creating cluster role strimzi-kafka-client
2023-06-22 14:28:25 INFO  Main:274 - Creating cluster role strimzi-cluster-operator-global
2023-06-22 14:28:25 INFO  Main:274 - Creating cluster role strimzi-cluster-operator-namespaced
2023-06-22 14:28:25 INFO  Main:274 - Creating cluster role strimzi-entity-operator
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.293 s <<< FAILURE! - in null
[ERROR] io.strimzi.operator.cluster.MainIT.testCreateClusterRolesCreatesClusterRoles(VertxTestContext)  Time elapsed: 1.221 s  <<< FAILURE!
java.lang.AssertionError: io.vertx.core.impl.NoStackTraceThrowable: Failed to create Cluster Roles.
	at io.vertx.junit5.VertxExtension.joinActiveTestContexts(VertxExtension.java:193)
	at io.vertx.junit5.VertxExtension.afterTestExecution(VertxExtension.java:164)

@MikeEdgar
Copy link
Contributor Author

@gracegrimwood , the second commit should resolve the timeout with StrimziPodSetCrdOperatorIT and also adds a log statement to capture the issue creating the cluster roles. The blocked thread messages can be ignored for now, those are in a separate PR.

@gracegrimwood
Copy link

Indeed, seems to have resolved that issue. I'm now seeing a lot of this instead:

[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 51.141 s <<< FAILURE! - in io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest
[ERROR] io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest.testReconcileAndRemovePool(VertxTestContext)  Time elapsed: 30.091 s  <<< ERROR!
java.util.concurrent.TimeoutException: 
The test execution timed out. Make sure your asynchronous code includes calls to either VertxTestContext#completeNow(), VertxTestContext#failNow() or Checkpoint#flag()

Unsatisfied checkpoints diagnostics:
-> checkpoint at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperatorWithPoolsMockTest.testReconcileAndRemovePool(KafkaAssemblyOperatorWithPoolsMockTest.java:627)
	at io.vertx.junit5.VertxExtension.joinActiveTestContexts(VertxExtension.java:204)
	at io.vertx.junit5.VertxExtension.afterTestExecution(VertxExtension.java:164)

A gist with the full output of 3 attempts at make java_build with the new change is here.

@MikeEdgar
Copy link
Contributor Author

It seems as though the timeout in KafkaAssemblyOperatorWithPoolsMockTest.testReconcileAndRemovePool is always associated with a crash of the MockWebServer with a ConcurrentModificationException. Downgrading kubernetes-server-mock to 6.4.1 (version before fabric8io/kubernetes-client#4945) seems to fix it, but that version has performance problems as noted in fabric8io/kubernetes-client#4924. I guess the fix for the performance should also require a read lock on the watch operation. I will try it and see if the exception is resolved.

2023-06-23 13:54:00  okhttp3.mockwebserver.MockWebServer  SEVERE  MockWebServer[54552] connection from /127.0.0.1 crashed 
java.util.ConcurrentModificationException
	at java.base/java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:756)
	at java.base/java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:788)
	at java.base/java.util.LinkedHashMap$LinkedEntryIterator.next(LinkedHashMap.java:786)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
	at io.fabric8.kubernetes.client.server.mock.KubernetesCrudDispatcher.lambda$handleWatch$6(KubernetesCrudDispatcher.java:298)
	at io.fabric8.kubernetes.client.server.mock.WatchEventsListener.onOpen(WatchEventsListener.java:63)

@MikeEdgar
Copy link
Contributor Author

I am no longer seeing the crash (or the timeout) with additional locking in the mock server [1]. Were you also still getting errors creating the cluster roles? I didn't spot anything in the gist log you linked @gracegrimwood .

[1] fabric8io/kubernetes-client#5280

@gracegrimwood
Copy link

I don't seem to be, no. I've also run it again a couple times this morning and haven't had it occur. 😃

@MikeEdgar
Copy link
Contributor Author

Once #8895 is merged I'll rebase this and we can see if anything remains.

@MikeEdgar
Copy link
Contributor Author

Tests are clean locally for me (no flakes). The latest push includes #8895, which will be removed once that is merged.

@scholzj
Copy link
Member

scholzj commented Jul 27, 2023

@MikeEdgar #8895 was merged. So you can now rebase this I guess.

@MikeEdgar
Copy link
Contributor Author

I see there are several flakes in the CI test run, unfortunately. @scholzj , what are your thoughts on having a way for TestUtils#waitFor to vary the timeout based on Runtime#availableProcessors? IMO it would be good to avoid tolerating flaky tests, but perhaps the medicine is worse than the disease 😄

@scholzj
Copy link
Member

scholzj commented Jul 28, 2023

I see there are several flakes in the CI test run, unfortunately. @scholzj , what are your thoughts on having a way for TestUtils#waitFor to vary the timeout based on Runtime#availableProcessors? IMO it would be good to avoid tolerating flaky tests, but perhaps the medicine is worse than the disease 😄

What would be the timeouts? If it helps, why not. But not sure why would it help. These tests should IMHO run one by one and not in parallel. So are they really unable to execute in time on one CPU?

@scholzj scholzj added this to the 0.37.0 milestone Jul 28, 2023
@MikeEdgar
Copy link
Contributor Author

What would be the timeouts? If it helps, why not. But not sure why would it help. These tests should IMHO run one by one and not in parallel. So are they really unable to execute in time on one CPU?

Although the tests are running serially, there are background processes like minikube and the mock server competing for CPU cycles. The theory is that with fewer available cores there is a potential for the async code to take longer to be scheduled and complete.

@scholzj
Copy link
Member

scholzj commented Jul 31, 2023

What would be the timeouts? If it helps, why not. But not sure why would it help. These tests should IMHO run one by one and not in parallel. So are they really unable to execute in time on one CPU?

Although the tests are running serially, there are background processes like minikube and the mock server competing for CPU cycles. The theory is that with fewer available cores there is a potential for the async code to take longer to be scheduled and complete.

Well, I think you would need to share more details on how would you see it and what would be the dynamic timeouts. The waitFor method is IMHO used in many different parts of the code base. In some places, it might make sense to adjust it like this. In others - e.g. the STs which already today wait for many minutes - I do not think this is really a solution.

@scholzj scholzj modified the milestones: 0.37.0, 0.38.0 Aug 31, 2023
@scholzj scholzj modified the milestones: 0.38.0, 0.39.0 Oct 11, 2023
@scholzj scholzj modified the milestones: 0.39.0, 0.40.0 Dec 13, 2023
Signed-off-by: Michael Edgar <medgar@redhat.com>
Signed-off-by: Michael Edgar <medgar@redhat.com>
@MikeEdgar
Copy link
Contributor Author

Sorry to have left this hanging around for so long. @scholzj if you see some value in the PR, excluding the third commit with the TestUtils changes, I can remove that and this could move forward. The remaining changes are increases to timeouts waiting for (potentially) slow CR deletes and two logic fixes in the CrdOperator classes that currently wait for the wrong condition on a delete operation.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some questions. TBH, I do not remember wh we never got back to it. If it helps, we should try to complete it.

@scholzj
Copy link
Member

scholzj commented Jan 8, 2024

@MikeEdgar For what is worth, it seems to pass on the CI here as well as on my Arm-based Mac. It is always hard to say if it works better or worse, but we can give it a try. So I would proceed with it.

@MikeEdgar
Copy link
Contributor Author

I do not remember wh we never got back to it

I allowed my attention to be fully focused elsewhere 😄

it seems to pass on the CI here as well as on my Arm-based Mac. It is always hard to say if it works better or worse, but we can give it a try. So I would proceed with it.

At one point there were no flakes listed in the test logs in Azure, but I see some out there now that I don't think would be helped by the TestUtils change anyway. I would be fine to leave the CPU adjustment change in place and go ahead with it. The only concern would be some tests that wait for a condition that truly will not occur would take some time longer to time out, but the result of the test would be the same either way. There is at least the chance the condition becomes true with the extended wait period.

@MikeEdgar MikeEdgar marked this pull request as ready for review January 8, 2024 17:58
@scholzj
Copy link
Member

scholzj commented Jan 8, 2024

Looking at it a bit more in an IDE and thinking a bit more about it ... I would probably remove it.

I think the CPU part is a bit questionable. In the IT tests related to Kubernetes - is it really the local CPU that is the issue? Or the CPU of the Kube cluster that might be different (both in case of a remote Kube cluster but also when you run it locally but don't give it the full CPU)? The waitFor also seems to be used all over the system tests. There it will be either the CPU of the target Kube cluster that will matter or no CPU at all as sometimes you are limited by the various timeouts rather than how quickly the CPU does things.

So if we decide to keep it, I think you should use it in a separate method that is called only from selected places such as the IT tests and not from the places such as the system tests. But maybe just removing it might be a better start?

@scholzj
Copy link
Member

scholzj commented Jan 8, 2024

@ppatierno @im-konge WDYT ^^^?

@MikeEdgar
Copy link
Contributor Author

is it really the local CPU that is the issue? Or the CPU of the Kube cluster that might be different (both in case of a remote Kube cluster but also when you run it locally but don't give it the full CPU)?

Yeah, that's a good point. I was limiting my thinking on this to the minikube scenario with full CPU (which I think is the case in Azure, right?). As you said, better to start with removing it here and possibly adding a separate method for use in individual tests where necessary.

@MikeEdgar
Copy link
Contributor Author

Removed the changes to TestUtils. For reference they are here: MikeEdgar@36ae567

@scholzj
Copy link
Member

scholzj commented Jan 8, 2024

Yeah, that's a good point. I was limiting my thinking on this to the minikube scenario with full CPU (which I think is the case in Azure, right?).

It is Minikube on Azure with 2 CPU cores (that said, from my subjective experience, the actual performance of the 2 cores differs quite heavily between runs).

@im-konge
Copy link
Member

im-konge commented Jan 8, 2024

Regarding STs -> we have just a few situations where we are waiting for timeout of waitFor, but if I understood the code correctly, it would multiply the default timeout by the "adjustment", which could end up in situation where we would wait relatively longer than expected. (Or the tests can wait longer for some timeout of resource condition, thus we can have pipelines that will be stopped due to timeout just because it didn't finished in time).

In ITs the timeouts are relatively small, but in STs we have timeout around 14 minutes, so in case it would be multiplied, we can get 28 minutes or something (that's for resource readiness), which is something we don't want.

So in case you would like to incorporate it, I would maybe create some kind of encapsulation, that would not hit STs.

@MikeEdgar
Copy link
Contributor Author

The other thing to consider is whether or not this should close #8526. The flaky tests can be a bit of a moving target and some still remain that this PR doesn't resolve.

@scholzj
Copy link
Member

scholzj commented Jan 8, 2024

The other thing to consider is whether or not this should close #8526. The flaky tests can be a bit of a moving target and some still remain that this PR doesn't resolve.

I think closing it is fine. We can always open a new one 😉.

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MikeEdgar

Copy link
Member

@im-konge im-konge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks :)

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be late to the party, I have no further comments. LGTM, thanks!

@scholzj scholzj merged commit d796434 into strimzi:main Jan 9, 2024
13 checks passed
@MikeEdgar MikeEdgar deleted the issue-8526-timeouts branch January 9, 2024 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Local build fails during java test phase on MacOS/aarch64
5 participants