Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UTO] Unidirectional Topic Operator does seem to cause disruption when upgrading from 0.38 with BTO #9470

Closed
scholzj opened this issue Dec 15, 2023 · 1 comment · Fixed by #9474

Comments

@scholzj
Copy link
Member

scholzj commented Dec 15, 2023

The Unidirectional Topic Operator (enabled by default in 0.39) seems to cause disruption when upgrading from 0.38 (with BTO enabled by default).

It seems to be reproducible with the following steps (4 times out of 4 tries):

  1. Deploy Strimzi 0.38 and enable KafkaNodePools
  2. Deploy the Kafka cluster with 3 Zoos, 2 node pol with 2 brokers each, CC enabled (I used this YAML -> this is my usual YAML, nothing was modified specifically for this test so not sure if any option there is causing this)
  3. Once the Kafka cluster is running, deploy test clients:
    kubectl apply -f https://raw.githubusercontent.com/scholzj/kafka-test-apps/main/deployment-ssl-client-auth.yaml
    
  4. Deploy Kafka Connect (I used this YAML -> again this is my usual YAML, nothing was modified specifically for this test so not sure if any option there is causing this; If you want to reuse it, you have to update the docker image and the push secret and use yours)
  5. Once it is all running, upgrade to 0.39 and observe the upgrade

Expected behavior:

  • Everything upgrades fine, no custom resource transitions to any errors

Actual behavior:

  • Some KafkaTopic resources transition to error until they recover again few minutes later.
  • When observing the topics, I see this:
    k get kt -w
    NAME                                                                                               CLUSTER      PARTITIONS   REPLICATION FACTOR   READY
    connect-cluster-configs                                                                            my-cluster   1            3                    True
    connect-cluster-offsets                                                                            my-cluster   25           3                    True
    connect-cluster-status                                                                             my-cluster   5            3                    True
    consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a                                        my-cluster   50           3                    True
    kafka-test-apps                                                                                    my-cluster   3            3                    True
    strimzi-store-topic---effb8e3e057afce1ecf67c3f5d8e4e3ff177fc55                                     my-cluster   1            3                    True
    strimzi-topic-operator-kstreams-topic-store-changelog---b75e702040b99be8a9263134de3507fc0cc4017b   my-cluster   1            3                    True
    strimzi.cruisecontrol.metrics                                                                      my-cluster   1            3                    True
    strimzi.cruisecontrol.modeltrainingsamples                                                         my-cluster   32           2                    True
    strimzi.cruisecontrol.partitionmetricsamples                                                       my-cluster   32           2                    True
    timer-topic                                                                                        my-cluster   1            3                    True
    
    ^^^ Original topics created by BTO from Strimzi 0.38
    
    connect-cluster-configs                                                                            my-cluster   1            3                    True
    connect-cluster-configs                                                                            my-cluster   1            3                    False
    timer-topic                                                                                        my-cluster   1            3                    True
    strimzi.cruisecontrol.partitionmetricsamples                                                       my-cluster   32           2                    True
    strimzi.cruisecontrol.modeltrainingsamples                                                         my-cluster   32           2                    True
    strimzi.cruisecontrol.metrics                                                                      my-cluster   1            3                    True
    strimzi-topic-operator-kstreams-topic-store-changelog---b75e702040b99be8a9263134de3507fc0cc4017b   my-cluster   1            3                    True
    strimzi-store-topic---effb8e3e057afce1ecf67c3f5d8e4e3ff177fc55                                     my-cluster   1            3                    True
    kafka-test-apps                                                                                    my-cluster   3            3                    True
    consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a                                        my-cluster   50           3                    True
    connect-cluster-status                                                                             my-cluster   5            3                    True
    connect-cluster-offsets                                                                            my-cluster   25           3                    True
    connect-cluster-status                                                                             my-cluster   5            3                    False
    timer-topic                                                                                        my-cluster   1            3                    False
    consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a                                        my-cluster   50           3                    False
    strimzi-topic-operator-kstreams-topic-store-changelog---b75e702040b99be8a9263134de3507fc0cc4017b   my-cluster   1            3                    False
    kafka-test-apps                                                                                    my-cluster   3            3                    False
    connect-cluster-offsets                                                                            my-cluster   25           3                    False
    strimzi.cruisecontrol.metrics                                                                      my-cluster   1            3                    False
    strimzi-store-topic---effb8e3e057afce1ecf67c3f5d8e4e3ff177fc55                                     my-cluster   1            3                    False
    strimzi.cruisecontrol.partitionmetricsamples                                                       my-cluster   32           2                    False
    strimzi.cruisecontrol.modeltrainingsamples                                                         my-cluster   32           2                    False
    
    ^^^ Topics updated almost immediately after the UTO from 0.39 starts. It seems to first add the finalizer and than set it to error.
    
    timer-topic                                                                                        my-cluster   1            3                    True
    connect-cluster-configs                                                                            my-cluster   1            3                    True
    connect-cluster-status                                                                             my-cluster   5            3                    True
    consumer-offsets---84e7a678d08f4bd226872e5cdd4eb527fadc1c6a                                        my-cluster   50           3                    True
    strimzi-store-topic---effb8e3e057afce1ecf67c3f5d8e4e3ff177fc55                                     my-cluster   1            3                    True
    strimzi.cruisecontrol.metrics                                                                      my-cluster   1            3                    True
    strimzi.cruisecontrol.modeltrainingsamples                                                         my-cluster   32           2                    True
    connect-cluster-offsets                                                                            my-cluster   25           3                    True
    kafka-test-apps                                                                                    my-cluster   3            3                    True
    strimzi.cruisecontrol.partitionmetricsamples                                                       my-cluster   32           2                    True
    strimzi-topic-operator-kstreams-topic-store-changelog---b75e702040b99be8a9263134de3507fc0cc4017b   my-cluster   1            3                    True
    
    ^^^ Fxed after few (2?) minutes, probably with next periodical reconciliation?
    
  • The watch of the topcis with full YAMLs can be found here: https://gist.github.com/scholzj/06b6745b2b58f311f2b553d6a4e6ccf4#file-watch-topics-txt
  • The DEBUG log from UTO is here: https://gist.githubusercontent.com/scholzj/06b6745b2b58f311f2b553d6a4e6ccf4/raw/d7e88e679b57d447dbaa072fb07aa415364894c2/UTO.log
  • It is not clear if this has some impact on the topics in Kafka or just on the KafkaTopic resources. In one case, I saw some issues with a topic in the Connect cluster. But other times I reproduced it it seemed fine. SO not sure if that was a coincidence or not.
@scholzj
Copy link
Member Author

scholzj commented Dec 15, 2023

This is possibly a blocker for 0.39.0 release! At least unless we confirm that it is only a cosmetic issue in UTO without any negative impact. In that case it would be ugly, but perhaps not blocking.

@scholzj scholzj changed the title [UTO] Unidirectional Topic Operator does seem to cause disruption when ugrading from 0.38 with BTO [UTO] Unidirectional Topic Operator does seem to cause disruption when upgrading from 0.38 with BTO Dec 15, 2023
fvaleri added a commit to fvaleri/strimzi-kafka-operator that referenced this issue Dec 18, 2023
This issue is caused by stale metadata of one or more brokers after restarting the cluster (no risk of data loss).
Using the reproducer, we can see that the UTO fails at 14:27:39 with UnknownTopicOrPartitionException (retriable), while one of the brokers first knows about my-topic at 14:27:44.
This triggers topic creation logic which fails with TopicExistsException.

UTO log:
2023-12-17 14:27:39,55262 TRACE [kafka-admin-client-thread | strimzi-topic-operator-a93c1635-76c3-4c9f-b61f-68c1a6ac98c3] BatchingTopicController:754 - Admin.describeTopics([__strimzi_store_topic, strimzi.cruisecontrol.partitionmetricsamples, __strimzi-topic-operator-kstreams-topic-store-changelog, timer-topic, connect-cluster-status, strimzi.cruisecontrol.modeltrainingsamples, strimzi.cruisecontrol.metrics, my-topic, __consumer_offsets, connect-cluster-offsets]) failed with java.util.concurrent.CompletionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.

Broker log:
2023-12-17 14:27:44,209 TRACE [Broker id=1000] Cached leader info UpdateMetadataPartitionState(topicName='my-topic', partitionIndex=0, controllerEpoch=1, leader=2000, leaderEpoch=4, isr=[1001, 2000, 2001], zkVersion=7, replicas=[2000, 2001, 1001], offlineReplicas=[]) for partition my-topic-0 in response to UpdateMetadata request sent by controller 1001 epoch 2 with correlation id 0 (state.change.logger) [control-plane-kafka-request-handler-0]

I'm proposing to catch and ignore the TopicExistsException, wich is also what BTO does.
If the topic was created by a third party before the UTO, the next reconciliation will try to revert any configuration drift.

Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
@scholzj scholzj linked a pull request Dec 18, 2023 that will close this issue
scholzj pushed a commit that referenced this issue Dec 18, 2023
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
scholzj pushed a commit that referenced this issue Dec 18, 2023
Signed-off-by: Federico Valeri <fedevaleri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant