New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TO loop #1617
TO loop #1617
Conversation
During initial and timed reconciliations we were only taking the topic lock after some of the remote resources had been fetched. This meant there was a race between the timed reconciliation and any topics touched immediately after the TO was started. This meant it was possible for the TO to get into a state where reconciliations of a given topic proceeded endlessly. The topic lock is now acquired before reading *any* of the remote resources required for a reconciliation.
@strimzi-ci run tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of nits. Did you test this one somehow?
HttpServer healthServer = this.healthServer; | ||
if (healthServer != null) { | ||
healthServer.close(); | ||
long timeoutMs = Math.max(1, deadline - System.currentTimeMillis()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this adjustment? did you notice any strange deadline - System.currentTimeMillis()
value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I observed an exception caused by a negative timeout, and I know from experience that a 0 argument also causes an exception.
* When the given {@code action} is complete it must complete its argument future, | ||
* which will complete the returned future | ||
*/ | ||
public <T> Future<T> acquireLock(TopicName key, Handler<Future<T>> action) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
about the name, it doesn't just acquire a lock but even add the action to do in the queue. Maybe just an enqueueAction
? From the code it's clear it acquire a lock for doing that maybe not need to reflect in the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree acquireLock
is a terrible name, but I couldn't think of a better one. How about executeWithTopicLockHeld
, which does exactly what it says, even though it's a bit long. Wdyt?
Test Failures
|
Testing was mostly via the |
Note: The Jenkins failures are unrelated to this PR. |
Type of change
Description
It was occasionally observed that the TO could get stuck performing reconciliations forever when topics were created/modified simultaneously with TO startup. The logging was insufficient to prove conclusively what the problem was, but I realised that the timed reconciliation code wasn't performing all of its work with the per-topic lock held, which could certainly provide for an initial race between the initial reconciliation and watch-based reconciliation.
The commits in this PR do three things:
InFlight
) to using Vertx locks, removing some rather intricate code that was only lightly tested in favour of something more easily reasoned about.Checklist