Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

k-wall · 2022-07-27T17:33:34Z

Describe the bug

We saw a situation where the Strimzi Kafka Operator restarted kafka brokers in an inappropriate situation. Specifically, the Strimzi pod had run out-of-disk space in /tmp. The KafkaRoller handled this exception inappropriately by causing broker(s) to be restarted.

This causes a disruption to the applications using the kafka instance. The situation continued disrupting the services until an SRE intervened.

The evidence in the logs of this condition was an unexpected:

2022-07-22 22:48:18.011,2022-07-22 22:48:18 INFO  PodOperator:68 - Reconciliation #58826(timer) Kafka(kafka-xxxxxxxxx/prod-xxxxxx): Rolling pod zzzzz-xxxxx-kafka-0

Looking at the code:

https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L473

we see that the exception handling conflates exceptions arising from 'local' issues where an broker restart is undesirable and other exceptions where a restart is reasonable.

The code on main has been refactored but I believe the same issue remains:

strimzi-kafka-operator/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

Line 466 in 2450eb4

if (!initAdminClient()) {

We also the zookeeper stage of the reconciliation fail, but its exception handling did not cause an operand restart.

2022-07-22 22:34:11.405,java.lang.RuntimeException: java.io.IOException: No space left on device
2022-07-22 22:34:11.405,        at io.strimzi.operator.common.Util.createFileTrustStore(Util.java:271) ~[io.strimzi.operator-common-0.26.0.managedsvc-redhat-00016.jar:0.26.0.managedsvc-redhat-00016]
2022-07-22 22:34:11.405,        at io.strimzi.operator.cluster.operator.resource.ZookeeperScaler.lambda$getClientConfig$12(ZookeeperScaler.java:282) ~[io.strimzi.cluster-operator-0.26.0.managedsvc-redhat-00016.jar:0.26.0.managedsvc-redhat-00

To Reproduce

It is tricky to reproduce as the issue non-deterministic. It depends on the number of clusters being managed.

Expected behavior

The broker should not be restarted. I would expect strimzi_reconciliations_failed_total to could the failed reconcillatiion.

Environment (please complete the following information):

Strimzi version: 0.26
Installation method: OLM
Kubernetes cluster: OpenShift 4.10.15
Infrastructure: AWS

YAML files and logs

Additional context

The text was updated successfully, but these errors were encountered:

Gnosnay · 2022-08-02T03:51:37Z

same to me.

strimzi version: 0.27.1
when the /tmp of strimiz cluster operator is full, it will raise No space left and fail to schedule the zookeepers pod.

2022-08-02 03:08:41 ERROR AbstractOperator:247 - Reconciliation #239(watch) Kafka(kfk-ns-1659409714/no-space-test): createOrUpdate fai
led
java.lang.RuntimeException: java.io.IOException: No space left on device
        at io.strimzi.operator.cluster.model.Ca.generateCaKeyAndCert(Ca.java:972) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.model.Ca.createRenewOrReplace(Ca.java:531) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator$ReconciliationState.lambda$reconcileCas$6(KafkaAssembly
Operator.java:651) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
        at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.71.Final.j
ar:4.1.71.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62) ~[?:?]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
        at sun.nio.ch.IOUtil.write(IOUtil.java:79) ~[?:?]
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280) ~[?:?]
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:74) ~[?:?]
        at java.nio.channels.Channels.writeFully(Channels.java:97) ~[?:?]
        at java.nio.channels.Channels$1.write(Channels.java:172) ~[?:?]
        at java.io.InputStream.transferTo(InputStream.java:705) ~[?:?]
        at java.nio.file.Files.copy(Files.java:3078) ~[?:?]
        at io.strimzi.certs.OpenSslCertManager.createDefaultConfig(OpenSslCertManager.java:105) ~[io.strimzi.certificate-manager-0.27.
1.jar:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.buildConfigFile(OpenSslCertManager.java:118) ~[io.strimzi.certificate-manager-0.27.1.ja
r:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateCaCert(OpenSslCertManager.java:259) ~[io.strimzi.certificate-manager-0.27.1.jar
:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateRootCaCert(OpenSslCertManager.java:168) ~[io.strimzi.certificate-manager-0.27.1
.jar:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateSelfSignedCert(OpenSslCertManager.java:147) ~[io.strimzi.certificate-manager-0.
27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.model.Ca.generateCaKeyAndCert(Ca.java:950) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        ... 10 more

Gnosnay · 2022-08-02T03:54:40Z

[strimzi@strimzi-cluster-operator-c56b8b9b7-mn85c tmp]$ df .
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024  1024         0 100% /tmp

scholzj · 2022-11-03T16:55:52Z

Triaged on the community call on 3rd November: The first issue from the description should be fixed as it seems like a bug. Should be fixed.

@Yansongsongsong If the issue you described seems unrelated to this. If you still have it, pleas open a separate issue for it.

k-wall added the bug label Jul 27, 2022

k-wall changed the title ~~Operator exceptions may lead to unexpected broker restarr]~~ Operator exception handling may lead to unexpected broker restart Jul 27, 2022

k-wall changed the title ~~Operator exception handling may lead to unexpected broker restart~~ Operator (KafkaRoller) exception handling may lead to unexpected broker restart Jul 27, 2022

scholzj added the needs-triage label Oct 8, 2022

scholzj removed the needs-triage label Nov 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

k-wall commented Jul 27, 2022 •

edited

Gnosnay commented Aug 2, 2022 •

edited

Gnosnay commented Aug 2, 2022

scholzj commented Nov 3, 2022

Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

Comments

k-wall commented Jul 27, 2022 • edited

Gnosnay commented Aug 2, 2022 • edited

Gnosnay commented Aug 2, 2022

scholzj commented Nov 3, 2022

k-wall commented Jul 27, 2022 •

edited

Gnosnay commented Aug 2, 2022 •

edited