Fix KR issues #3721

sknot-rh · 2020-09-29T14:40:42Z

Signed-off-by: Stanislav Knot sknot@redhat.com

Type of change

Bugfix

Description

Fixes #3710
All credits to @tombentley
Set timetouts to AdminClient;
Check whether the pod is in the CrashLoopBackOff

Checklist

Write tests
Make sure all tests pass
Update documentation
Check RBAC rights for Kubernetes / OpenShift roles
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
Reference relevant issue(s) and close them after merging
Update CHANGELOG.md
Supply screenshots for visual changes, such as Grafana dashboards

tombentley

We need a test for this (ideally a regression test in the ST). And probably some test coverage in the KafkaRollerTest too.

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

tombentley · 2020-09-30T08:53:08Z

We also need to fix the logging part of the problem mentioned in #3710.

sknot-rh · 2020-10-01T08:05:22Z

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

Frawless · 2020-10-01T10:32:06Z

/azp run regression

azure-pipelines · 2020-10-01T10:32:16Z

Azure Pipelines successfully started running 1 pipeline(s).

strimzi-ci · 2020-10-01T14:09:20Z

✔️ Test Summary ✔️

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 1
PASS: 1
FAIL: 0
SKIP: 0
BUILD_NUMBER: 1588
BUILD_ENV: oc cluster up

tombentley · 2020-10-01T15:08:39Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

-                        if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) {
-                            log.debug("{}: Pod {} can be rolled now", reconciliation, podId);
-                            restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS);
+            if (isCrashlooping(pod)) {


Out of interest why was it necessary to add it here? Just so that we don't spend ~7 minutes before we restart it?

I am not saying it was necessary but without it the timeout was still 20 minutes. And it makes sense. If we want to roll pod which is crashlooping, it is probably not possible to connect to it. So checking it at the beginning of the algorithm seems to be a good idea.

Frawless · 2020-10-01T15:42:27Z

@stanlyDoge I was looking into failures here and for now all are connected to testHostAliases which should be skipped - #3737 .

sknot-rh · 2020-10-01T20:41:41Z

@stanlyDoge I was looking into failures here and for now all are connected to testHostAliases which should be skipped - #3737 .

Thanks for confirmation

tombentley

Thanks Standa.

scholzj · 2020-10-02T09:18:08Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

-                        if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) {
-                            log.debug("{}: Pod {} can be rolled now", reconciliation, podId);
-                            restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS);
+            if (isCrashlooping(pod)) {


What if the pod is not crash looping but creating or pending because of some other misconfiguration (missing volumes, badly configured resources, ...)? Does the rolling update work fine in that case even with this check looking just for crash-looping pods?

How did you addressed this? This still seems to do something only when the pod is crashlooping but not when it gets stuck in other states such as Pending, ContainerCreating, ImagePullBackoff etc.

scholzj · 2020-10-02T09:26:10Z

systemtest/src/test/java/io/strimzi/systemtest/rollingupdate/KafkaRollerST.java

@@ -122,6 +125,30 @@ void testKafkaTopicRFLowerThanMinInSyncReplicas() {
        assertThat(StatefulSetUtils.ssSnapshot(kafkaName), is(not(kafkaPods)));
    }

+    @Test
+    void testKafkaPodCrashLooping() throws InterruptedException {


Does this actually work? Why should it decide to roll the pod? In your test the configuration is up to date so I would not expect it to roll the pod. I think the test should do the following:

Deploy the Kafka cluster with UseParNewGC

Wait for it to crash loop

Remove UseParNewGC from the config

Wait for it to roll

Wait for it to be ready

With the test as it is written right now, I would not expect the pods to be rolled at all since there was no change to anything. It should not roll them just because they are crashlooping ... but if it needs to roll them and they are crashlooping it should not wait for a successful connection tot he broker.

I see your points. Let me retest it properly.

Signed-off-by: Stanislav Knot <sknot@redhat.com>

sknot-rh · 2020-10-06T09:06:07Z

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

tombentley · 2020-10-06T10:05:18Z

/azp run regression

azure-pipelines · 2020-10-06T10:05:28Z

Azure Pipelines successfully started running 1 pipeline(s).

strimzi-ci · 2020-10-06T10:52:58Z

❗ Systemtests Failed (no tests results are present) ❗

strimzi-ci · 2020-10-06T12:24:13Z

✔️ Test Summary ✔️

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 3
PASS: 3
FAIL: 0
SKIP: 0
BUILD_NUMBER: 1601
BUILD_ENV: oc cluster up

Signed-off-by: Stanislav Knot <sknot@redhat.com>

sknot-rh · 2020-10-07T11:14:31Z

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

Signed-off-by: Stanislav Knot <sknot@redhat.com>

scholzj · 2020-10-07T12:18:20Z

@tombentley I think this changed a bit since your approval ... maybe it would be good if you could at least quickly look through it?

strimzi-ci · 2020-10-07T14:12:32Z

❌ Test Summary ❌

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 1
PASS: 0
FAIL: 1
SKIP: 0
BUILD_NUMBER: 1603
BUILD_ENV: oc cluster up

Re-run command:
@strimzi-ci run tests false profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST#testKafkaTopicRFLowerThanMinInSyncReplicas

scholzj · 2020-10-07T22:00:41Z

/azp run regression

azure-pipelines · 2020-10-07T22:00:52Z

Azure Pipelines successfully started running 1 pipeline(s).

scholzj · 2020-10-07T22:00:59Z

/azp run acceptance

azure-pipelines · 2020-10-07T22:01:10Z

Azure Pipelines successfully started running 1 pipeline(s).

tombentley

A few nits here, but I'm happy with the overall change to the error handling.

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

tombentley · 2020-10-08T07:52:57Z

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

+    private boolean podWaitingBecauseOfReason(Pod pod, String reason) {
+        if (pod != null && pod.getStatus() != null) {
+            List<ContainerStatus> kafkaContainerStatus = pod.getStatus().getContainerStatuses().stream().filter(containerStatus -> containerStatus.getName().equals("kafka")).collect(Collectors.toList());
+            if (kafkaContainerStatus.size() > 0) {


If you're just interested in getting the first reason you should use findFirst, rather than collecting into a List. But is it correct that the container status we're looking for will always be the first in the list?

from the tests, yes

That doesn't really mean that there does not exist a real-world circumstance where there's some other status which appears first, just that we've not run into such a circumstance yet.

What do you suggest?

I guess you can go through the list and check if it contains the conditions instead of checking just the first one?

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java

Signed-off-by: Stanislav Knot <sknot@redhat.com>

scholzj · 2020-10-08T13:05:55Z

/azp run regression

azure-pipelines · 2020-10-08T13:06:06Z

Azure Pipelines successfully started running 1 pipeline(s).

sknot-rh requested a review from tombentley September 29, 2020 14:40

sknot-rh marked this pull request as draft September 29, 2020 14:40

tombentley reviewed Sep 29, 2020

View reviewed changes

cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java Outdated Show resolved Hide resolved

sknot-rh force-pushed the roller-timeout branch from 2a8f11a to 70a75e1 Compare September 30, 2020 14:54

sknot-rh added this to the 0.20.0 milestone Oct 1, 2020

tombentley reviewed Oct 1, 2020

View reviewed changes

tombentley requested a review from scholzj October 1, 2020 15:25

tombentley approved these changes Oct 2, 2020

View reviewed changes

sknot-rh marked this pull request as ready for review October 2, 2020 07:52

scholzj reviewed Oct 2, 2020

View reviewed changes

sknot-rh added 5 commits October 6, 2020 11:00

Fix KR issues

7799f71

Signed-off-by: Stanislav Knot <sknot@redhat.com>

carefully coding

c2c30df

Signed-off-by: Stanislav Knot <sknot@redhat.com>

changes

90a434a

Signed-off-by: Stanislav Knot <sknot@redhat.com>

changes

c37c031

Signed-off-by: Stanislav Knot <sknot@redhat.com>

fixes

9224dad

Signed-off-by: Stanislav Knot <sknot@redhat.com>

sknot-rh force-pushed the roller-timeout branch from 6c35a6c to 9224dad Compare October 6, 2020 09:00

another situations

e1d0b86

Signed-off-by: Stanislav Knot <sknot@redhat.com>

fix

74012ed

Signed-off-by: Stanislav Knot <sknot@redhat.com>

scholzj approved these changes Oct 7, 2020

View reviewed changes

sknot-rh requested a review from tombentley October 7, 2020 13:35

tombentley approved these changes Oct 8, 2020

View reviewed changes

sknot-rh added 2 commits October 8, 2020 11:00

comments

eef513c

Signed-off-by: Stanislav Knot <sknot@redhat.com>

list to opt

7e0f99e

Signed-off-by: Stanislav Knot <sknot@redhat.com>

scholzj merged commit 30ff688 into strimzi:master Oct 8, 2020

sknot-rh deleted the roller-timeout branch October 9, 2020 08:39

Frawless mentioned this pull request Oct 13, 2020

ST: Small changes for KafkaRoller ST to remove redundant checks #3799

Merged

1 task

sknot-rh mentioned this pull request Oct 14, 2020

Exponential backoff in KafkaRoller takes sometimes 20+ minutes #2263

Closed

im-konge mentioned this pull request Nov 4, 2020

[systemtest] Test recovery of Kafka cluster when impossible memory request is set #3914

Merged

2 tasks

Fix KR issues #3721

Fix KR issues #3721

Conversation

sknot-rh commented Sep 29, 2020 • edited

Type of change

Description

Checklist

tombentley left a comment

Choose a reason for hiding this comment

tombentley commented Sep 30, 2020

sknot-rh commented Oct 1, 2020

Frawless commented Oct 1, 2020

azure-pipelines bot commented Oct 1, 2020

strimzi-ci commented Oct 1, 2020

✔️ Test Summary ✔️

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Frawless commented Oct 1, 2020

sknot-rh commented Oct 1, 2020

tombentley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sknot-rh commented Oct 6, 2020

tombentley commented Oct 6, 2020

azure-pipelines bot commented Oct 6, 2020

strimzi-ci commented Oct 6, 2020

strimzi-ci commented Oct 6, 2020

✔️ Test Summary ✔️

sknot-rh commented Oct 7, 2020

scholzj commented Oct 7, 2020

strimzi-ci commented Oct 7, 2020

❌ Test Summary ❌

scholzj commented Oct 7, 2020

azure-pipelines bot commented Oct 7, 2020

scholzj commented Oct 7, 2020

azure-pipelines bot commented Oct 7, 2020

tombentley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scholzj commented Oct 8, 2020

azure-pipelines bot commented Oct 8, 2020

sknot-rh commented Sep 29, 2020 •

edited