Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121

Open
k-wall opened this issue Jul 27, 2022 · 3 comments
Labels

Comments

@k-wall
Copy link
Contributor

k-wall commented Jul 27, 2022

Describe the bug

We saw a situation where the Strimzi Kafka Operator restarted kafka brokers in an inappropriate situation. Specifically, the Strimzi pod had run out-of-disk space in /tmp. The KafkaRoller handled this exception inappropriately by causing broker(s) to be restarted.

This causes a disruption to the applications using the kafka instance. The situation continued disrupting the services until an SRE intervened.

The evidence in the logs of this condition was an unexpected:

2022-07-22 22:48:18.011,2022-07-22 22:48:18 INFO  PodOperator:68 - Reconciliation #58826(timer) Kafka(kafka-xxxxxxxxx/prod-xxxxxx): Rolling pod zzzzz-xxxxx-kafka-0

Looking at the code:

https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L473

we see that the exception handling conflates exceptions arising from 'local' issues where an broker restart is undesirable and other exceptions where a restart is reasonable.

The code on main has been refactored but I believe the same issue remains:

We also the zookeeper stage of the reconciliation fail, but its exception handling did not cause an operand restart.

2022-07-22 22:34:11.405,java.lang.RuntimeException: java.io.IOException: No space left on device
2022-07-22 22:34:11.405,        at io.strimzi.operator.common.Util.createFileTrustStore(Util.java:271) ~[io.strimzi.operator-common-0.26.0.managedsvc-redhat-00016.jar:0.26.0.managedsvc-redhat-00016]
2022-07-22 22:34:11.405,        at io.strimzi.operator.cluster.operator.resource.ZookeeperScaler.lambda$getClientConfig$12(ZookeeperScaler.java:282) ~[io.strimzi.cluster-operator-0.26.0.managedsvc-redhat-00016.jar:0.26.0.managedsvc-redhat-00

To Reproduce

It is tricky to reproduce as the issue non-deterministic. It depends on the number of clusters being managed.

Expected behavior

The broker should not be restarted. I would expect strimzi_reconciliations_failed_total to could the failed reconcillatiion.

Environment (please complete the following information):

  • Strimzi version: 0.26
  • Installation method: OLM
  • Kubernetes cluster: OpenShift 4.10.15
  • Infrastructure: AWS

YAML files and logs

Additional context

@k-wall k-wall added the bug label Jul 27, 2022
@k-wall k-wall changed the title Operator exceptions may lead to unexpected broker restarr] Operator exception handling may lead to unexpected broker restart Jul 27, 2022
@k-wall k-wall changed the title Operator exception handling may lead to unexpected broker restart Operator (KafkaRoller) exception handling may lead to unexpected broker restart Jul 27, 2022
@Gnosnay
Copy link

Gnosnay commented Aug 2, 2022

same to me.

strimzi version: 0.27.1
when the /tmp of strimiz cluster operator is full, it will raise No space left and fail to schedule the zookeepers pod.

2022-08-02 03:08:41 ERROR AbstractOperator:247 - Reconciliation #239(watch) Kafka(kfk-ns-1659409714/no-space-test): createOrUpdate fai
led
java.lang.RuntimeException: java.io.IOException: No space left on device
        at io.strimzi.operator.cluster.model.Ca.generateCaKeyAndCert(Ca.java:972) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.model.Ca.createRenewOrReplace(Ca.java:531) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.operator.assembly.KafkaAssemblyOperator$ReconciliationState.lambda$reconcileCas$6(KafkaAssembly
Operator.java:651) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
        at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:159) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.ContextImpl.lambda$executeBlocking$1(ContextImpl.java:157) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at io.vertx.core.impl.TaskQueue.run(TaskQueue.java:76) ~[io.vertx.vertx-core-4.2.1.jar:4.2.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.71.Final.j
ar:4.1.71.Final]
        at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.IOException: No space left on device
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[?:?]
        at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:62) ~[?:?]
        at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:113) ~[?:?]
        at sun.nio.ch.IOUtil.write(IOUtil.java:79) ~[?:?]
        at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:280) ~[?:?]
        at java.nio.channels.Channels.writeFullyImpl(Channels.java:74) ~[?:?]
        at java.nio.channels.Channels.writeFully(Channels.java:97) ~[?:?]
        at java.nio.channels.Channels$1.write(Channels.java:172) ~[?:?]
        at java.io.InputStream.transferTo(InputStream.java:705) ~[?:?]
        at java.nio.file.Files.copy(Files.java:3078) ~[?:?]
        at io.strimzi.certs.OpenSslCertManager.createDefaultConfig(OpenSslCertManager.java:105) ~[io.strimzi.certificate-manager-0.27.
1.jar:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.buildConfigFile(OpenSslCertManager.java:118) ~[io.strimzi.certificate-manager-0.27.1.ja
r:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateCaCert(OpenSslCertManager.java:259) ~[io.strimzi.certificate-manager-0.27.1.jar
:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateRootCaCert(OpenSslCertManager.java:168) ~[io.strimzi.certificate-manager-0.27.1
.jar:0.27.1]
        at io.strimzi.certs.OpenSslCertManager.generateSelfSignedCert(OpenSslCertManager.java:147) ~[io.strimzi.certificate-manager-0.
27.1.jar:0.27.1]
        at io.strimzi.operator.cluster.model.Ca.generateCaKeyAndCert(Ca.java:950) ~[io.strimzi.operator-common-0.27.1.jar:0.27.1]
        ... 10 more

@Gnosnay
Copy link

Gnosnay commented Aug 2, 2022

[strimzi@strimzi-cluster-operator-c56b8b9b7-mn85c tmp]$ df .
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs               1024  1024         0 100% /tmp

@scholzj
Copy link
Member

scholzj commented Nov 3, 2022

Triaged on the community call on 3rd November: The first issue from the description should be fixed as it seems like a bug. Should be fixed.

@Yansongsongsong If the issue you described seems unrelated to this. If you still have it, pleas open a separate issue for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants