TO loop #1617

tombentley · 2019-05-13T11:16:49Z

Type of change

Bugfix

Description

It was occasionally observed that the TO could get stuck performing reconciliations forever when topics were created/modified simultaneously with TO startup. The logging was insufficient to prove conclusively what the problem was, but I realised that the timed reconciliation code wasn't performing all of its work with the per-topic lock held, which could certainly provide for an initial race between the initial reconciliation and watch-based reconciliation.

The commits in this PR do three things:

Improves the logging
Changes the old home-brew mutual exclusion mechanism (InFlight) to using Vertx locks, removing some rather intricate code that was only lightly tested in favour of something more easily reasoned about.
Changes the scope of the locks used for timed reconciliation to include the "reads" as well as the writes. This should mean that the timed reconciliation code will run exclusively from the watch-based reconciliations.

Checklist

Make sure all tests pass
Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally

During initial and timed reconciliations we were only taking the topic lock after some of the remote resources had been fetched. This meant there was a race between the timed reconciliation and any topics touched immediately after the TO was started. This meant it was possible for the TO to get into a state where reconciliations of a given topic proceeded endlessly. The topic lock is now acquired before reading *any* of the remote resources required for a reconciliation.

tombentley · 2019-05-20T10:06:30Z

@strimzi-ci run tests

ppatierno

Just a couple of nits. Did you test this one somehow?

ppatierno · 2019-05-20T15:02:58Z

topic-operator/src/main/java/io/strimzi/operator/topic/Session.java

-                    HttpServer healthServer = this.healthServer;
-                    if (healthServer != null) {
-                        healthServer.close();
+                    long timeoutMs = Math.max(1, deadline - System.currentTimeMillis());


why this adjustment? did you notice any strange deadline - System.currentTimeMillis() value?

I observed an exception caused by a negative timeout, and I know from experience that a 0 argument also causes an exception.

ppatierno · 2019-05-20T15:11:06Z

topic-operator/src/main/java/io/strimzi/operator/topic/TopicOperator.java

+     * When the given {@code action} is complete it must complete its argument future,
+     * which will complete the returned future
+     */
+    public <T> Future<T> acquireLock(TopicName key, Handler<Future<T>> action) {


about the name, it doesn't just acquire a lock but even add the action to do in the queue. Maybe just an enqueueAction ? From the code it's clear it acquire a lock for doing that maybe not need to reflect in the name?

I agree acquireLock is a terrible name, but I couldn't think of a better one. How about executeWithTopicLockHeld, which does exactly what it says, even though it's a bit long. Wdyt?

strimzi-ci · 2019-05-20T15:53:51Z

Test Failures

testMirrorMakerTlsAuthenticated in io.strimzi.systemtest.MirrorMakerST
testMirrorMakerTlsScramSha in io.strimzi.systemtest.MirrorMakerST
testMirrorMaker in io.strimzi.systemtest.MirrorMakerST

tombentley · 2019-05-21T13:51:41Z

Testing was mostly via the TopicOperatorIT, which uses a real Kafka cluster (thanks to the Debezium test API) and a real Kubernetes cluster for the CR, but the TO itself is not running inside a container.

tombentley · 2019-05-21T13:55:37Z

Note: The Jenkins failures are unrelated to this PR.

tombentley added 4 commits May 2, 2019 19:02

Fix/improve logging around ZK watches

10e6245

Rewrite InFlight to use vertx locks, rather than homebrew queues

b0c5f0a

Inline InFlight into TopicOperator

6948aab

tombentley requested a review from ppatierno May 13, 2019 11:16

tombentley changed the title ~~To loop~~ TO loop May 13, 2019

ppatierno approved these changes May 20, 2019

View reviewed changes

Rename method

d1db6e4

tombentley merged commit 875bd0d into master May 23, 2019

tombentley deleted the TO-loop branch May 23, 2019 17:42

tombentley added this to the 0.12.0 milestone May 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TO loop #1617

TO loop #1617

tombentley commented May 13, 2019

tombentley commented May 20, 2019

ppatierno left a comment

ppatierno May 20, 2019

tombentley May 21, 2019

ppatierno May 20, 2019

tombentley May 21, 2019

strimzi-ci commented May 20, 2019

tombentley commented May 21, 2019

tombentley commented May 21, 2019

TO loop #1617

TO loop #1617

Conversation

tombentley commented May 13, 2019

Type of change

Description

Checklist

tombentley commented May 20, 2019

ppatierno left a comment

Choose a reason for hiding this comment

ppatierno May 20, 2019

Choose a reason for hiding this comment

tombentley May 21, 2019

Choose a reason for hiding this comment

ppatierno May 20, 2019

Choose a reason for hiding this comment

tombentley May 21, 2019

Choose a reason for hiding this comment

strimzi-ci commented May 20, 2019

tombentley commented May 21, 2019

tombentley commented May 21, 2019