Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Connect's dynamic log level changes #9752

Merged
merged 4 commits into from Mar 4, 2024

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Feb 28, 2024

The loggers endpoint now supports the scope parameter, which can be set to cluster in order to change level in all nodes at once. This change simply adds this parameter that works since Kafka 3.7, and causes no harm to previous versions.

This should close #9067.

The loggers endpoint now supports the scope parameter, which can be set to cluster in order to change level in all nodes at once.
This change simply adds this parameter that works since Kafka 3.7, and causes no harm to previous versions.

This should close strimzi#9067.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@fvaleri fvaleri added this to the 0.40.0 milestone Feb 28, 2024
@scholzj
Copy link
Member

scholzj commented Feb 28, 2024

I guess the test failure is related to this?

@scholzj
Copy link
Member

scholzj commented Feb 28, 2024

Interestingly enough when running the whole test suite I get 500, running just the single test gives me another error but with 204.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@fvaleri
Copy link
Contributor Author

fvaleri commented Feb 29, 2024

Test fixed. With cluster wide logging level update the response code changes to 204 (no content).

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, it is not that simple ... how will this work with Kafka 3.6? I assume the new parameter will be ignored which is fine, but it will give you still HTTP 200, or? So you need to check for both / or check the whole range of HTTP OK codes, or?

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@fvaleri
Copy link
Contributor Author

fvaleri commented Feb 29, 2024

So you need to check for both / or check the whole range of HTTP OK codes, or?

Right, we still need to support both. Thanks.

@ppatierno
Copy link
Member

@fvaleri I see the test failing because the 500 error.

@scholzj
Copy link
Member

scholzj commented Feb 29, 2024

Yeah, that is what I pointed out -> the test alone works fine with 204, but when the whole test suite is run it gets 500.

@fvaleri
Copy link
Contributor Author

fvaleri commented Feb 29, 2024

I'll look into that, but it works fine on my end. Do you also see 500 on your local env?

@scholzj
Copy link
Member

scholzj commented Feb 29, 2024

@fvaleri I did saw 500 locally yesterday when running the whole KafkaConnectApiIT suit from IDE. And 204 when running only the single KafkaConnectApiIT.testChangeLoggers test.

@fvaleri
Copy link
Contributor Author

fvaleri commented Mar 1, 2024

Ok, was able to reproduce locally, but it doesn't happen all the time. There is some flakiness here.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@fvaleri
Copy link
Contributor Author

fvaleri commented Mar 1, 2024

That was too easy, indeed.


The problem is that, unlike the "worker" scope, the "cluster" scope request writes the logging level configuration into the internal -config topic, in order to propagate this information to the other workers. To do this, it uses a KafkaBasedLog instance, which includes a Kafka producer initialized to null on instance creation. This producer is created lazily when the service starts.

When we start the test Connect instance, services like the ConfigBackingStore, which uses the KafkaBasedLog, are started asynchronously on a different thread. This means that, when our cluster wide log level configuration request arrives, the producer may not have been created yet, which results in the error "IllegalStateException: This KafkaBasedLog was created in read-only mode and does not support write operations".

This is fixed by using the Connect.isRunning() method to wait for all services initialization before running the tests.


I also found a second source of flakiness, this time on task restart test. This is due to "ConnectRestException: Cannot complete request momentarily due to no known leader URL, likely because a rebalance was underway".

This may happens when there is a rebalance in progress, as the error suggests, and confirmed by observing the log. The workaround here is to only use one worker node for this test, which I think is fine as the focus is on the REST API here.


After these changes, I ran KafkaConnectApiIT and KafkaConnectorIT many times without issues. Let me know if you also see the same.

@scholzj
Copy link
Member

scholzj commented Mar 1, 2024

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming the tests pass.

@scholzj scholzj merged commit f17b8d3 into strimzi:main Mar 4, 2024
21 checks passed
@fvaleri fvaleri deleted the fix-connect-log branch March 4, 2024 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Connect's dynamic log level changes are only applied to one worker node
3 participants