Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the UTO upgrade issue reported in #9470. #9474

Merged
merged 1 commit into from Dec 18, 2023

Conversation

fvaleri
Copy link
Contributor

@fvaleri fvaleri commented Dec 18, 2023

This issue is caused by stale metadata of one or more brokers after restarting the cluster (no risk of data loss). Using the reproducer, we can see that the UTO fails at 14:27:39 with UnknownTopicOrPartitionException (retriable), while one of the brokers first knows about my-topic at 14:27:44. This triggers topic creation logic which fails with TopicExistsException.

UTO log:

2023-12-17 14:27:39,55262 TRACE [kafka-admin-client-thread | strimzi-topic-operator-a93c1635-76c3-4c9f-b61f-68c1a6ac98c3] BatchingTopicController:754 - Admin.describeTopics([__strimzi_store_topic, strimzi.cruisecontrol.partitionmetricsamples, __strimzi-topic-operator-kstreams-topic-store-changelog, timer-topic, connect-cluster-status, strimzi.cruisecontrol.modeltrainingsamples, strimzi.cruisecontrol.metrics, my-topic, __consumer_offsets, connect-cluster-offsets]) failed with java.util.concurrent.CompletionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

Broker log:

2023-12-17 14:27:44,209 TRACE [Broker id=1000] Cached leader info UpdateMetadataPartitionState(topicName='my-topic', partitionIndex=0, controllerEpoch=1, leader=2000, leaderEpoch=4, isr=[1001, 2000, 2001], zkVersion=7, replicas=[2000, 2001, 1001], offlineReplicas=[]) for partition my-topic-0 in response to UpdateMetadata request sent by controller 1001 epoch 2 with correlation id 0 (state.change.logger) [control-plane-kafka-request-handler-0]

I'm proposing to catch and ignore the TopicExistsException, wich is also what BTO does. If the topic was created by a third party before the UTO, the next reconciliation will try to revert any configuration drift.

This issue is caused by stale metadata of one or more brokers after restarting the cluster (no risk of data loss).
Using the reproducer, we can see that the UTO fails at 14:27:39 with UnknownTopicOrPartitionException (retriable), while one of the brokers first knows about my-topic at 14:27:44.
This triggers topic creation logic which fails with TopicExistsException.

UTO log:
2023-12-17 14:27:39,55262 TRACE [kafka-admin-client-thread | strimzi-topic-operator-a93c1635-76c3-4c9f-b61f-68c1a6ac98c3] BatchingTopicController:754 - Admin.describeTopics([__strimzi_store_topic, strimzi.cruisecontrol.partitionmetricsamples, __strimzi-topic-operator-kstreams-topic-store-changelog, timer-topic, connect-cluster-status, strimzi.cruisecontrol.modeltrainingsamples, strimzi.cruisecontrol.metrics, my-topic, __consumer_offsets, connect-cluster-offsets]) failed with java.util.concurrent.CompletionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

Broker log:
2023-12-17 14:27:44,209 TRACE [Broker id=1000] Cached leader info UpdateMetadataPartitionState(topicName='my-topic', partitionIndex=0, controllerEpoch=1, leader=2000, leaderEpoch=4, isr=[1001, 2000, 2001], zkVersion=7, replicas=[2000, 2001, 1001], offlineReplicas=[]) for partition my-topic-0 in response to UpdateMetadata request sent by controller 1001 epoch 2 with correlation id 0 (state.change.logger) [control-plane-kafka-request-handler-0]

I'm proposing to catch and ignore the TopicExistsException, wich is also what BTO does.
If the topic was created by a third party before the UTO, the next reconciliation will try to revert any configuration drift.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fvaleri

@scholzj scholzj added this to the 0.40.0 milestone Dec 18, 2023
@scholzj
Copy link
Member

scholzj commented Dec 18, 2023

Note: This should be cherry-picked for 0.39 in case we do another RC.

@scholzj
Copy link
Member

scholzj commented Dec 18, 2023

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@ppatierno ppatierno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@scholzj scholzj merged commit 391179f into strimzi:main Dec 18, 2023
21 checks passed
scholzj pushed a commit that referenced this pull request Dec 18, 2023
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@scholzj scholzj modified the milestones: 0.40.0, 0.39.0 Dec 18, 2023
@fvaleri fvaleri deleted the fix-uto-upgrade branch December 18, 2023 19:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[UTO] Unidirectional Topic Operator does seem to cause disruption when upgrading from 0.38 with BTO
3 participants