New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix KR issues #3721
Fix KR issues #3721
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a test for this (ideally a regression test in the ST). And probably some test coverage in the KafkaRollerTest
too.
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Outdated
Show resolved
Hide resolved
We also need to fix the logging part of the problem mentioned in #3710. |
2a8f11a
to
70a75e1
Compare
@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST |
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
✔️ Test Summary ✔️TEST_PROFILE: regression |
if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) { | ||
log.debug("{}: Pod {} can be rolled now", reconciliation, podId); | ||
restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS); | ||
if (isCrashlooping(pod)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of interest why was it necessary to add it here? Just so that we don't spend ~7 minutes before we restart it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying it was necessary but without it the timeout was still 20 minutes. And it makes sense. If we want to roll pod which is crashlooping, it is probably not possible to connect to it. So checking it at the beginning of the algorithm seems to be a good idea.
@stanlyDoge I was looking into failures here and for now all are connected to |
Thanks for confirmation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Standa.
if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) { | ||
log.debug("{}: Pod {} can be rolled now", reconciliation, podId); | ||
restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS); | ||
if (isCrashlooping(pod)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the pod is not crash looping but creating or pending because of some other misconfiguration (missing volumes, badly configured resources, ...)? Does the rolling update work fine in that case even with this check looking just for crash-looping pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you addressed this? This still seems to do something only when the pod is crashlooping but not when it gets stuck in other states such as Pending, ContainerCreating, ImagePullBackoff etc.
@@ -122,6 +125,30 @@ void testKafkaTopicRFLowerThanMinInSyncReplicas() { | |||
assertThat(StatefulSetUtils.ssSnapshot(kafkaName), is(not(kafkaPods))); | |||
} | |||
|
|||
@Test | |||
void testKafkaPodCrashLooping() throws InterruptedException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually work? Why should it decide to roll the pod? In your test the configuration is up to date so I would not expect it to roll the pod. I think the test should do the following:
- Deploy the Kafka cluster with UseParNewGC
- Wait for it to crash loop
- Remove UseParNewGC from the config
- Wait for it to roll
- Wait for it to be ready
With the test as it is written right now, I would not expect the pods to be rolled at all since there was no change to anything. It should not roll them just because they are crashlooping ... but if it needs to roll them and they are crashlooping it should not wait for a successful connection tot he broker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your points. Let me retest it properly.
Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
6c35a6c
to
9224dad
Compare
@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST |
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
❗ Systemtests Failed (no tests results are present) ❗ |
✔️ Test Summary ✔️TEST_PROFILE: regression |
Signed-off-by: Stanislav Knot <sknot@redhat.com>
@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST |
@tombentley I think this changed a bit since your approval ... maybe it would be good if you could at least quickly look through it? |
❌ Test Summary ❌TEST_PROFILE: regression Re-run command: |
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run acceptance |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few nits here, but I'm happy with the overall change to the error handling.
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Outdated
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Outdated
Show resolved
Hide resolved
private boolean podWaitingBecauseOfReason(Pod pod, String reason) { | ||
if (pod != null && pod.getStatus() != null) { | ||
List<ContainerStatus> kafkaContainerStatus = pod.getStatus().getContainerStatuses().stream().filter(containerStatus -> containerStatus.getName().equals("kafka")).collect(Collectors.toList()); | ||
if (kafkaContainerStatus.size() > 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're just interested in getting the first reason you should use findFirst
, rather than collect
ing into a List
. But is it correct that the container status we're looking for will always be the first in the list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from the tests, yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't really mean that there does not exist a real-world circumstance where there's some other status which appears first, just that we've not run into such a circumstance yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you suggest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you can go through the list and check if it contains the conditions instead of checking just the first one?
cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Stanislav Knot <sknot@redhat.com>
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Stanislav Knot sknot@redhat.com
Type of change
Description
Fixes #3710
All credits to @tombentley
Set timetouts to AdminClient;
Check whether the pod is in the
CrashLoopBackOff
Checklist