Fix flaky kafka cluster example #4549

rnorth · 2021-10-06T14:27:44Z

Fixes #4479
I believe the original issue relates to the GitHub Actions runners simply not being powerful enough to launch 3 instances of Kafka in parallel, within reasonable startup time limits

Increase startup time limit and start containers in series. The total startup time in series seems to be the same as in parallel, which I think confirms my suspicion that this was resource constrained. (i.e. we observed ~27s startup time for parallel start, vs ~12+7+7s for serial start)
Add logback to kafka-cluster example for better, timestamped, log output
Change KafkaContainer to emit full exec stdout/stderr/exit code in case of a kafka-configs failure.

rnorth · 2021-10-07T08:58:11Z

Our observations so far: the kafka-configs process is exiting with exit code 137, which means that it is receiving SIGKILL. @bsideup tried upgrading the image to one that has a cgroup-aware JDK version, in case a GHA change is causing containers to be OOMKilled. It doesn't look like that's helped.

I'm still considering the hypothesis that startup is just taking too long, since it involves pulling the image and then starting N instances of Kafka in parallel (25s+ just to start the container looks really long). If deepStart is timing out, perhaps it's killing the containers (and by coincidence, killing kafka-configs).

rnorth · 2021-10-07T09:00:48Z

I'm still considering the hypothesis that startup is just taking too long, since it involves pulling the image and then starting N instances of Kafka in parallel (25s+ just to start the container looks really long). If deepStart is timing out, perhaps it's killing the containers (and by coincidence, killing kafka-configs).

I think this is it:

Kafka being started at 08:57:31.423:

    08:57:31.423 INFO  🐳 [confluentinc/cp-kafka:6.2.1] - Pulling docker image: confluentinc/cp-kafka:6.2.1. Please be patient; this may take some time but only needs to be done once.

First failure logged at 08:58:01.917:

08:58:01.917 ERROR 🐳 [confluentinc/cp-kafka:6.2.1] - Could not start container
    java.lang.IllegalStateException: Container.ExecResult(exitCode=137, stdout=, stderr=)
    	at org.testcontainers.containers.KafkaContainer.containerIsStarted(KafkaContainer.java:121)
    	at org.testcontainers.containers.GenericContainer.containerIsStarted(GenericContainer.java:687)
    	at org.testcontainers.containers.GenericContainer.tryStart(GenericContainer.java:503)
    	at org.testcontainers.containers.GenericContainer.lambda$doStart$0(GenericContainer.java:330)
    	at org.rnorth.ducttape.unreliables.Unreliables.retryUntilSuccess(Unreliables.java:81)
    	at org.testcontainers.containers.GenericContainer.doStart(GenericContainer.java:328)
    	at org.testcontainers.containers.GenericContainer.start(GenericContainer.java:316)
    	at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:719)
    	at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701)
    	at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    	at java.lang.Thread.run(Thread.java:748)

This lines up with the 30s timeout applied in CONTAINER_RUNNING_TIMEOUT_SEC.

kiview · 2021-10-07T10:36:01Z

I would approve this, once out of draft state 🙂

rnorth · 2021-10-07T10:42:21Z

Have tidied up and updated the PR description.

examples/kafka-cluster/src/test/java/com/example/kafkacluster/KafkaContainerCluster.java

…KafkaContainerCluster.java Co-authored-by: Kevin Wittek <kiview@users.noreply.github.com>

rnorth and others added 4 commits October 6, 2021 15:16

Debugging for #4479

db2b4d1

Update KafkaContainerClusterTest.java

089c7f2

Update KafkaContainerClusterTest.java

40f5cb6

Debugging for #4479 - add logging with timestamps

7d13a6b

rnorth added 6 commits October 7, 2021 10:03

Debugging for #4479 - increase startup timeout

98311b1

Restore 3 brokers

a6d75a5

Serialize startup

8087635

Trivial change to trigger build

f84872c

Trivial change to trigger build

71377a5

Trivial change to trigger build

2b404bf

Tidy

c0f0e0c

rnorth changed the title ~~Debugging for #4479~~ Fix flaky kafka cluster example Oct 7, 2021

rnorth marked this pull request as ready for review October 7, 2021 10:42

rnorth requested review from bsideup and kiview as code owners October 7, 2021 10:42

kiview reviewed Oct 7, 2021

View reviewed changes

examples/kafka-cluster/src/test/java/com/example/kafkacluster/KafkaContainerCluster.java Show resolved Hide resolved

kiview approved these changes Oct 7, 2021

View reviewed changes

Update examples/kafka-cluster/src/test/java/com/example/kafkacluster/…

0ca8bf6

…KafkaContainerCluster.java Co-authored-by: Kevin Wittek <kiview@users.noreply.github.com>

rnorth merged commit dace7e4 into master Oct 7, 2021

rnorth deleted the 4479-debug branch October 7, 2021 11:15

kiview added the type/enhancement label Oct 13, 2021

kiview added this to the next milestone Oct 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky kafka cluster example #4549

Fix flaky kafka cluster example #4549

rnorth commented Oct 6, 2021 •

edited

Loading

rnorth commented Oct 7, 2021

rnorth commented Oct 7, 2021 •

edited

Loading

kiview commented Oct 7, 2021

rnorth commented Oct 7, 2021

Fix flaky kafka cluster example #4549

Fix flaky kafka cluster example #4549

Conversation

rnorth commented Oct 6, 2021 • edited Loading

rnorth commented Oct 7, 2021

rnorth commented Oct 7, 2021 • edited Loading

kiview commented Oct 7, 2021

rnorth commented Oct 7, 2021

rnorth commented Oct 6, 2021 •

edited

Loading

rnorth commented Oct 7, 2021 •

edited

Loading