Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TO loop #1617

Merged
merged 5 commits into from May 23, 2019
Merged

TO loop #1617

merged 5 commits into from May 23, 2019

Conversation

tombentley
Copy link
Member

Type of change

  • Bugfix

Description

It was occasionally observed that the TO could get stuck performing reconciliations forever when topics were created/modified simultaneously with TO startup. The logging was insufficient to prove conclusively what the problem was, but I realised that the timed reconciliation code wasn't performing all of its work with the per-topic lock held, which could certainly provide for an initial race between the initial reconciliation and watch-based reconciliation.

The commits in this PR do three things:

  1. Improves the logging
  2. Changes the old home-brew mutual exclusion mechanism (InFlight) to using Vertx locks, removing some rather intricate code that was only lightly tested in favour of something more easily reasoned about.
  3. Changes the scope of the locks used for timed reconciliation to include the "reads" as well as the writes. This should mean that the timed reconciliation code will run exclusively from the watch-based reconciliations.

Checklist

  • Make sure all tests pass
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally

During initial and timed reconciliations we were only taking the topic lock
after some of the remote resources had been fetched.
This meant there was a race between the timed reconciliation and any topics
touched immediately after the TO was started.
This meant it was possible for the TO to get into a state where
reconciliations of a given topic proceeded endlessly.

The topic lock is now acquired before reading *any* of the remote resources
required for a reconciliation.
@tombentley tombentley requested a review from ppatierno May 13, 2019 11:16
@tombentley tombentley changed the title To loop TO loop May 13, 2019
@tombentley
Copy link
Member Author

@strimzi-ci run tests

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of nits. Did you test this one somehow?

HttpServer healthServer = this.healthServer;
if (healthServer != null) {
healthServer.close();
long timeoutMs = Math.max(1, deadline - System.currentTimeMillis());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this adjustment? did you notice any strange deadline - System.currentTimeMillis() value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I observed an exception caused by a negative timeout, and I know from experience that a 0 argument also causes an exception.

* When the given {@code action} is complete it must complete its argument future,
* which will complete the returned future
*/
public <T> Future<T> acquireLock(TopicName key, Handler<Future<T>> action) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

about the name, it doesn't just acquire a lock but even add the action to do in the queue. Maybe just an enqueueAction ? From the code it's clear it acquire a lock for doing that maybe not need to reflect in the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree acquireLock is a terrible name, but I couldn't think of a better one. How about executeWithTopicLockHeld, which does exactly what it says, even though it's a bit long. Wdyt?

@strimzi-ci
Copy link

Test Failures

  • testMirrorMakerTlsAuthenticated in io.strimzi.systemtest.MirrorMakerST
  • testMirrorMakerTlsScramSha in io.strimzi.systemtest.MirrorMakerST
  • testMirrorMaker in io.strimzi.systemtest.MirrorMakerST

@tombentley
Copy link
Member Author

Testing was mostly via the TopicOperatorIT, which uses a real Kafka cluster (thanks to the Debezium test API) and a real Kubernetes cluster for the CR, but the TO itself is not running inside a container.

@tombentley
Copy link
Member Author

Note: The Jenkins failures are unrelated to this PR.

@tombentley tombentley merged commit 875bd0d into master May 23, 2019
@tombentley tombentley deleted the TO-loop branch May 23, 2019 17:42
@tombentley tombentley added this to the 0.12.0 milestone May 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants