Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix KR issues #3721

Merged
merged 9 commits into from Oct 8, 2020
Merged

Fix KR issues #3721

merged 9 commits into from Oct 8, 2020

Conversation

sknot-rh
Copy link
Member

@sknot-rh sknot-rh commented Sep 29, 2020

Signed-off-by: Stanislav Knot sknot@redhat.com

Type of change

  • Bugfix

Description

Fixes #3710
All credits to @tombentley
Set timetouts to AdminClient;
Check whether the pod is in the CrashLoopBackOff

Checklist

  • Write tests
  • Make sure all tests pass
  • Update documentation
  • Check RBAC rights for Kubernetes / OpenShift roles
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md
  • Supply screenshots for visual changes, such as Grafana dashboards

@sknot-rh sknot-rh marked this pull request as draft September 29, 2020 14:40
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a test for this (ideally a regression test in the ST). And probably some test coverage in the KafkaRollerTest too.

@tombentley
Copy link
Member

We also need to fix the logging part of the problem mentioned in #3710.

@sknot-rh
Copy link
Member Author

sknot-rh commented Oct 1, 2020

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

@Frawless
Copy link
Member

Frawless commented Oct 1, 2020

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 1
PASS: 1
FAIL: 0
SKIP: 0
BUILD_NUMBER: 1588
BUILD_ENV: oc cluster up

if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) {
log.debug("{}: Pod {} can be rolled now", reconciliation, podId);
restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS);
if (isCrashlooping(pod)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest why was it necessary to add it here? Just so that we don't spend ~7 minutes before we restart it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying it was necessary but without it the timeout was still 20 minutes. And it makes sense. If we want to roll pod which is crashlooping, it is probably not possible to connect to it. So checking it at the beginning of the algorithm seems to be a good idea.

@Frawless
Copy link
Member

Frawless commented Oct 1, 2020

@stanlyDoge I was looking into failures here and for now all are connected to testHostAliases which should be skipped - #3737 .

@sknot-rh
Copy link
Member Author

sknot-rh commented Oct 1, 2020

@stanlyDoge I was looking into failures here and for now all are connected to testHostAliases which should be skipped - #3737 .

Thanks for confirmation

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Standa.

@sknot-rh sknot-rh marked this pull request as ready for review October 2, 2020 07:52
if (!maybeDynamicUpdateBrokerConfig(podId, restartPlan)) {
log.debug("{}: Pod {} can be rolled now", reconciliation, podId);
restartAndAwaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS);
if (isCrashlooping(pod)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the pod is not crash looping but creating or pending because of some other misconfiguration (missing volumes, badly configured resources, ...)? Does the rolling update work fine in that case even with this check looking just for crash-looping pods?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you addressed this? This still seems to do something only when the pod is crashlooping but not when it gets stuck in other states such as Pending, ContainerCreating, ImagePullBackoff etc.

@@ -122,6 +125,30 @@ void testKafkaTopicRFLowerThanMinInSyncReplicas() {
assertThat(StatefulSetUtils.ssSnapshot(kafkaName), is(not(kafkaPods)));
}

@Test
void testKafkaPodCrashLooping() throws InterruptedException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually work? Why should it decide to roll the pod? In your test the configuration is up to date so I would not expect it to roll the pod. I think the test should do the following:

  1. Deploy the Kafka cluster with UseParNewGC
  2. Wait for it to crash loop
  3. Remove UseParNewGC from the config
  4. Wait for it to roll
  5. Wait for it to be ready

With the test as it is written right now, I would not expect the pods to be rolled at all since there was no change to anything. It should not roll them just because they are crashlooping ... but if it needs to roll them and they are crashlooping it should not wait for a successful connection tot he broker.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your points. Let me retest it properly.

Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
@sknot-rh
Copy link
Member Author

sknot-rh commented Oct 6, 2020

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

@tombentley
Copy link
Member

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@strimzi-ci
Copy link

Systemtests Failed (no tests results are present)

@strimzi-ci
Copy link

✔️ Test Summary ✔️

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 3
PASS: 3
FAIL: 0
SKIP: 0
BUILD_NUMBER: 1601
BUILD_ENV: oc cluster up

Signed-off-by: Stanislav Knot <sknot@redhat.com>
@sknot-rh
Copy link
Member Author

sknot-rh commented Oct 7, 2020

@strimzi-ci run tests profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST

Signed-off-by: Stanislav Knot <sknot@redhat.com>
@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

@tombentley I think this changed a bit since your approval ... maybe it would be good if you could at least quickly look through it?

@strimzi-ci
Copy link

❌ Test Summary ❌

TEST_PROFILE: regression
EXCLUDED_GROUPS: networkpolicies,flaky
TEST_CASE: io.strimzi.systemtest.rollingupdate.KafkaRollerST
TOTAL: 1
PASS: 0
FAIL: 1
SKIP: 0
BUILD_NUMBER: 1603
BUILD_ENV: oc cluster up

Re-run command:
@strimzi-ci run tests false profile=regression testcase=io.strimzi.systemtest.rollingupdate.KafkaRollerST#testKafkaTopicRFLowerThanMinInSyncReplicas

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Oct 7, 2020

/azp run acceptance

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits here, but I'm happy with the overall change to the error handling.

private boolean podWaitingBecauseOfReason(Pod pod, String reason) {
if (pod != null && pod.getStatus() != null) {
List<ContainerStatus> kafkaContainerStatus = pod.getStatus().getContainerStatuses().stream().filter(containerStatus -> containerStatus.getName().equals("kafka")).collect(Collectors.toList());
if (kafkaContainerStatus.size() > 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're just interested in getting the first reason you should use findFirst, rather than collecting into a List. But is it correct that the container status we're looking for will always be the first in the list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from the tests, yes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't really mean that there does not exist a real-world circumstance where there's some other status which appears first, just that we've not run into such a circumstance yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you suggest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can go through the list and check if it contains the conditions instead of checking just the first one?

Signed-off-by: Stanislav Knot <sknot@redhat.com>
Signed-off-by: Stanislav Knot <sknot@redhat.com>
@scholzj
Copy link
Member

scholzj commented Oct 8, 2020

/azp run regression

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Rolling update never finishes when the pods are not ready
5 participants