-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator (KafkaRoller) exception handling may lead to unexpected broker restart #7121
Comments
same to me. strimzi version: 0.27.1
|
|
Triaged on the community call on 3rd November: The first issue from the description should be fixed as it seems like a bug. Should be fixed. @Yansongsongsong If the issue you described seems unrelated to this. If you still have it, pleas open a separate issue for it. |
Describe the bug
We saw a situation where the Strimzi Kafka Operator restarted kafka brokers in an inappropriate situation. Specifically, the Strimzi pod had run out-of-disk space in
/tmp
. The KafkaRoller handled this exception inappropriately by causing broker(s) to be restarted.This causes a disruption to the applications using the kafka instance. The situation continued disrupting the services until an SRE intervened.
The evidence in the logs of this condition was an unexpected:
Looking at the code:
https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.26.x/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L473
we see that the exception handling conflates exceptions arising from 'local' issues where an broker restart is undesirable and other exceptions where a restart is reasonable.
The code on main has been refactored but I believe the same issue remains:
strimzi-kafka-operator/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Line 466 in 2450eb4
We also the zookeeper stage of the reconciliation fail, but its exception handling did not cause an operand restart.
To Reproduce
It is tricky to reproduce as the issue non-deterministic. It depends on the number of clusters being managed.
Expected behavior
The broker should not be restarted. I would expect
strimzi_reconciliations_failed_total
to could the failed reconcillatiion.Environment (please complete the following information):
YAML files and logs
Additional context
The text was updated successfully, but these errors were encountered: