New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Connect's dynamic log level changes #9752
Conversation
The loggers endpoint now supports the scope parameter, which can be set to cluster in order to change level in all nodes at once. This change simply adds this parameter that works since Kafka 3.7, and causes no harm to previous versions. This should close strimzi#9067. Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
I guess the test failure is related to this? |
Interestingly enough when running the whole test suite I get 500, running just the single test gives me another error but with 204. |
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Test fixed. With cluster wide logging level update the response code changes to 204 (no content). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, it is not that simple ... how will this work with Kafka 3.6? I assume the new parameter will be ignored which is fine, but it will give you still HTTP 200, or? So you need to check for both / or check the whole range of HTTP OK codes, or?
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Right, we still need to support both. Thanks. |
@fvaleri I see the test failing because the 500 error. |
Yeah, that is what I pointed out -> the test alone works fine with 204, but when the whole test suite is run it gets 500. |
I'll look into that, but it works fine on my end. Do you also see 500 on your local env? |
@fvaleri I did saw 500 locally yesterday when running the whole |
Ok, was able to reproduce locally, but it doesn't happen all the time. There is some flakiness here. |
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
That was too easy, indeed. The problem is that, unlike the "worker" scope, the "cluster" scope request writes the logging level configuration into the internal -config topic, in order to propagate this information to the other workers. To do this, it uses a KafkaBasedLog instance, which includes a Kafka producer initialized to null on instance creation. This producer is created lazily when the service starts. When we start the test Connect instance, services like the ConfigBackingStore, which uses the KafkaBasedLog, are started asynchronously on a different thread. This means that, when our cluster wide log level configuration request arrives, the producer may not have been created yet, which results in the error "IllegalStateException: This KafkaBasedLog was created in read-only mode and does not support write operations". This is fixed by using the Connect.isRunning() method to wait for all services initialization before running the tests. I also found a second source of flakiness, this time on task restart test. This is due to "ConnectRestException: Cannot complete request momentarily due to no known leader URL, likely because a rebalance was underway". This may happens when there is a rebalance in progress, as the error suggests, and confirmed by observing the log. The workaround here is to only use one worker node for this test, which I think is fine as the focus is on the REST API here. After these changes, I ran KafkaConnectApiIT and KafkaConnectorIT many times without issues. Let me know if you also see the same. |
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming the tests pass.
The loggers endpoint now supports the scope parameter, which can be set to cluster in order to change level in all nodes at once. This change simply adds this parameter that works since Kafka 3.7, and causes no harm to previous versions.
This should close #9067.