Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to delete topic is reversed by Topic Operator #1157

Closed
chris-vest opened this issue Dec 12, 2018 · 24 comments · Fixed by #3982
Closed

Attempt to delete topic is reversed by Topic Operator #1157

chris-vest opened this issue Dec 12, 2018 · 24 comments · Fixed by #3982

Comments

@chris-vest
Copy link
Contributor

chris-vest commented Dec 12, 2018

Strimzi Version: 0.9.0-rc1

I want to delete a topic, but cannot seem to do so using the CLI tools nor deleting the KafkaTopic CRD. The topic operator insists on recreating it.

When deleting KafkaTopic CRD for topic in question:

[2018-12-12 12:48:13,612] INFO <opicOperator:559> [oop-thread-0] All three topics are identical [2018-12-12 12:48:13,613] INFO <TopicWatcher:40> [oop-thread-0] Success processing KafkaTopic watch event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:48:15,831] INFO <TopicWatcher:37> [ault.svc/...] KafkaTopic watch received event DELETED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:48:15,842] INFO <opicOperator:419> [oop-thread-0] Reconciling topic nutmeg-sit, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:48:15,842] INFO <opicOperator:319> [oop-thread-0] Deleting topic 'nutmeg-sit' [2018-12-12 12:48:15,863] INFO <TopicWatcher:40> [oop-thread-0] Success processing KafkaTopic watch event DELETED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:48:15,891] INFO <opicsWatcher:79> [oop-thread-1] Deleted topics: [nutmeg-sit] [2018-12-12 12:48:15,898] INFO <opicOperator:419> [oop-thread-0] Reconciling topic null, k8sTopic:null, kafkaTopic:null, privateTopic:null [2018-12-12 12:48:15,965] INFO <opicsWatcher:94> [oop-thread-1] Created topics: [nutmeg-sit] [2018-12-12 12:48:16,186] INFO <opicOperator:419> [oop-thread-0] Reconciling topic nutmeg-sit, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null [2018-12-12 12:48:16,194] INFO <TopicWatcher:37> [ault.svc/...] KafkaTopic watch received event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:48:16,207] INFO <opicOperator:419> [oop-thread-1] Reconciling topic nutmeg-sit, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:48:16,207] INFO <opicOperator:559> [oop-thread-1] All three topics are identical [2018-12-12 12:48:16,207] INFO <TopicWatcher:40> [oop-thread-1] Success processing KafkaTopic watch event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c}

When I use the kafka-topics.sh CLI:
[2018-12-12 12:49:40,902] INFO <TopicWatcher:37> [ault.svc/...] KafkaTopic watch received event DELETED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:49:40,907] INFO <TopicWatcher:40> [oop-thread-1] Success processing KafkaTopic watch event MODIFIED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:49:40,909] INFO <opicOperator:419> [oop-thread-1] Reconciling topic null, k8sTopic:null, kafkaTopic:null, privateTopic:null [2018-12-12 12:49:40,909] INFO <TopicWatcher:40> [oop-thread-1] Success processing KafkaTopic watch event DELETED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:49:41,327] INFO <opicsWatcher:94> [oop-thread-0] Created topics: [nutmeg-sit] [2018-12-12 12:49:41,371] INFO <opicOperator:955> [oop-thread-1] Starting periodic reconciliation [2018-12-12 12:49:41,387] INFO <opicOperator:419> [oop-thread-1] Reconciling topic _schemas, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:49:41,388] INFO <opicOperator:559> [oop-thread-1] All three topics are identical [2018-12-12 12:49:41,388] INFO <opicOperator:365> [oop-thread-1] Success reconciling KafkaTopic strimzi/schemas---251ea2dcbbf0dc3858aa5637c32a54d26159edc1 [2018-12-12 12:49:41,397] INFO <opicOperator:419> [oop-thread-0] Reconciling topic _schemas, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:49:41,397] INFO <opicOperator:559> [oop-thread-0] All three topics are identical [2018-12-12 12:49:41,397] INFO <opicOperator:365> [oop-thread-0] Success reconciling KafkaTopic strimzi/schemas---251ea2dcbbf0dc3858aa5637c32a54d26159edc1 [2018-12-12 12:49:41,546] INFO <opicOperator:419> [oop-thread-1] Reconciling topic nutmeg-sit, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null [2018-12-12 12:49:41,558] INFO <TopicWatcher:37> [ault.svc/...] KafkaTopic watch received event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c} [2018-12-12 12:49:41,565] INFO <opicOperator:419> [oop-thread-0] Reconciling topic nutmeg-sit, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:49:41,565] INFO <opicOperator:319> [oop-thread-0] Deleting topic 'nutmeg-sit' [2018-12-12 12:49:41,586] INFO <opicOperator:365> [oop-thread-0] Success reconciling KafkaTopic null [2018-12-12 12:49:41,592] INFO <opicOperator:419> [oop-thread-0] Reconciling topic nutmeg-sit, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null [2018-12-12 12:49:41,599] ERROR <TopicWatcher:49> [oop-thread-0] Failure processing KafkaTopic watch event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c}: Topic 'nutmeg-sit' already exists. org.apache.kafka.common.errors.TopicExistsException: Topic 'nutmeg-sit' already exists. [2018-12-12 12:49:41,602] WARN <opicOperator:96> [oop-thread-0] Failure processing KafkaTopic watch event ADDED on resource nutmeg-sit with labels {strimzi.io/cluster=c}: Topic 'nutmeg-sit' already exists. [2018-12-12 12:49:41,609] INFO <opicsWatcher:79> [oop-thread-0] Deleted topics: [nutmeg-sit] [2018-12-12 12:49:41,616] INFO <opicOperator:419> [oop-thread-1] Reconciling topic nutmeg-sit, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null [2018-12-12 12:49:41,628] INFO <opicsWatcher:94> [oop-thread-0] Created topics: [nutmeg-sit] [2018-12-12 12:49:41,637] INFO <opicOperator:192> [oop-thread-1] Created topic 'nutmeg-sit' for KafkaTopic 'nutmeg-sit' [2018-12-12 12:49:41,651] INFO <opicOperator:419> [oop-thread-0] Reconciling topic nutmeg-sit, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull [2018-12-12 12:49:41,651] INFO <opicOperator:559> [oop-thread-0] All three topics are identical

The topic in question is nutmeg-sit - both attempting to delete topic using CLI and CRD fails. Doing both at the same time also fails.

@chris-vest chris-vest changed the title Attempt to delete topic is reversed by TopiC Operator Attempt to delete topic is reversed by Topic Operator Dec 12, 2018
@sknot-rh
Copy link
Member

Hello. How did you create the topic? I did a quick test by applying examples/topic/kafka-topic.yaml with 0.9.0-rc1 and the topic stays deleted.

@chris-vest
Copy link
Contributor Author

@stanlyDoge The topic was initially created using Kafka Connect with the Debezium MySQL connector. The Connect cluster had been shut down in order to ensure that it was in fact the topic operator which was recreating the topics. As assumed, when the Connect cluster was removed, and the topic was deleted from one of the brokers using the kafka-topics.sh tool or deleting the KafkaTopic CRD for the topic in question, the topic was recreated by the topic operator.

@bvalluri
Copy link

bvalluri commented Mar 29, 2019

Event we are facing the same issue. Deleting topic is recreated with reconciliation. Below is the log for that. Topic is created by our own micro service

019-03-29 11:02:19 INFO K8sTopicWatcher:37 - KafkaTopic watch received event DELETED on resource job.request with labels {strimzi.io/kind=topic}
2019-03-29 11:02:19 INFO TopicOperator:376 - Reconciling topic job.request, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
2019-03-29 11:02:19 INFO TopicOperator:321 - Deleting topic 'job.request'
2019-03-29 11:02:19 INFO K8sTopicWatcher:40 - Success processing KafkaTopic watch event DELETED on resource job.request with labels {strimzi.io/kind=topic}
2019-03-29 11:02:19 INFO ZkTopicsWatcher:79 - Deleted topics: [job.request]
2019-03-29 11:02:19 INFO TopicOperator:376 - Reconciling topic null, k8sTopic:null, kafkaTopic:null, privateTopic:null
2019-03-29 11:03:27 INFO ZkTopicsWatcher:94 - Created topics: [job.request]
2019-03-29 11:03:27 INFO TopicOperator:376 - Reconciling topic job.request, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null
2019-03-29 11:03:27 INFO K8sTopicWatcher:37 - KafkaTopic watch received event ADDED on resource job.request with labels {strimzi.io/kind=topic}
2019-03-29 11:03:27 INFO TopicOperator:376 - Reconciling topic job.request, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
2019-03-29 11:03:27 INFO TopicOperator:516 - All three topics are identical
2019-03-29 11:03:27 INFO K8sTopicWatcher:40 - Success processing KafkaTopic watch event ADDED on resource job.request with labels {strimzi.io/kind=topic}

@sumitmundhada
Copy link

sumitmundhada commented Oct 15, 2020

I am also facing the same issue. We have a helm chart that creates 45 KafkaTopic resources. When uninstalling the chart, strimzi does not delete one topic. It is a random topic everytime. In the topic operator pod, I see this message for the topic. (We have three replicas for all of our topics)
All three topics are identical

Here are the steps to reproduce:

  1. helm install <chart-name> <chart-location>
  2. Get a shell in the broker pod and validate if the topics are created bin/kafka-topics.sh --list --bootstrap-server=localhost:9092
  3. helm uninstall <chart-name>
  4. Using the shell in step 2 validate if topics are indeed deleted.

In my case one topic is not deleted. I am using helm3. Also, chart I am deploying is an umbrella helm chart, where the topics are created in one of the subchart. The umbrella helm chart is installed in a different namespace than the kafka-topics helm chart.
As per my understanding, this happens when the number of topics are more.

@scholzj
Copy link
Member

scholzj commented Oct 15, 2020

@sumitmundhada I think you will need to provide some more details. There are several things which can happen:

  • Your Helm Chart does not delete all all the resources
  • The topic is deleted but some application re-creates it (or it is auto-recreated if you didn't disabled it)

But without the full logs it is not clear what happened.

@sumitmundhada
Copy link

sumitmundhada commented Oct 16, 2020

@scholzj We are currently getting started with Strimzi.

  1. Currently the umbrella helm chart has only one subchart, the chart for creating kafka topics.
  2. I am confident that there is no application that is creating that topic. Also the topic that does not get deleted is never the same. Sometimes it is just one. At times there are two topics that remain there. The auto recreation property is the same for all the topics (I am not sure how to set that. The topic configurations I am providing are very simple like partitions, replicas and retentions milliseconds)

What logs can I provide you with?

@scholzj
Copy link
Member

scholzj commented Oct 16, 2020

Ok, well ... the logs should confirm it then. I think for this just the logs from the Topic Operator container are needed (it is one of the containers inside the Entity Operator pod).

@scholzj
Copy link
Member

scholzj commented Oct 19, 2020

@stanlyDoge @tombentley @ppatierno You are the TO experts. Can you have a look?

@scholzj
Copy link
Member

scholzj commented Oct 19, 2020

@sumitmundhada To me it looks like the client recreates it. If you look at these logs for example:

[2020-10-19 11:43:13,169 INFO <opicsWatcher:80> [calhost:2181 Deleted topics: [prefix-1-topic-6
[2020-10-19 11:43:13,176 INFO <opicOperator:536> [oop-thread-0 3625893|/brokers/topics -prefix-1-topic-6: Reconciling topic null, k8sTopic:null, kafkaTopic:null, privateTopic:null
[2020-10-19 11:43:13,616 INFO <opicsWatcher:96> [calhost:2181 Created topics: [prefix-1-topic-6
[2020-10-19 11:43:15,038 INFO <opicOperator:536> [oop-thread-0 3625894|/brokers/topics +prefix-1-topic-6: Reconciling topic prefix-1-topic-6, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null

It IMHO suggests the topic was deleted by the TO but then recreated in Kafka (notice the kafkaTopic:nonnull on the last line).

@sumitmundhada
Copy link

To me it looks like the client recreates it.

Can you explain this a little more, please? (I am actually new to Kafka)

Also, I currently do not have any consumer or producer working with any of these topics. Also, deleting them via kubectl always works. The topics are not recreated in that case. This behaviour is observed only when creating/deleting them via helm. (Additional Info: We have at least 700 other topics in that cluster. And we are using helm 3)

@scholzj
Copy link
Member

scholzj commented Oct 19, 2020

Kafka by default auto-creates topics when consumers / producers try to use them. Thsi can be disabled in the KAfka configuration (in .spec.kafka.config using the option auto.create.topics.enable: "false"). So my assumption from the log is that they get auto-recreated by the clients. If that is not the case, maybe the others will find something else then ... they understand the topic operator more than I do.

@sknot-rh
Copy link
Member

I've tried to create, send/receive, delete multiple topics unsing kubectl but I was not able to get this faulty behaviour. Maybe we should wait for @tombentley to see what he thinks.

@sumitmundhada
Copy link

I've tried to create, send/receive, delete multiple topics unsing kubectl but I was not able to get this faulty behaviour. Maybe we should wait for @tombentley to see what he thinks.

Yes. Have seen the same behaviour. The issue is only when creating them via helm chart. I am using the helm install/uninstall commands to be specific.

@scholzj
Copy link
Member

scholzj commented Oct 19, 2020

Well, the operator does not know about Helm. It is only watching the custom resources. So I would expect that either your Helm Chart does something special to delete the topics (but there is not really that many way to delete a resource) or it is more about the timing (e.g. some race condition).

@sumitmundhada
Copy link

Somehow my previous comment with the logs got deleted. So posting the logs again.
strimzi-logs.txt

@tombentley
Copy link
Member

@sumitmundhada would it be possible to reproduce with the TO logging at TRACE level?

@sumitmundhada
Copy link

@tombentley Attaching a slightly bigger log file with log level set to TRACE for the topic operator. I tried to replicate the same helm chart on strimzi cluster deployed on my local minikube cluster. The topics created via helm chart were named prefix-2-topic1, prefix-2-topic2 and so on till prefix-2-topic-45. The four topics that did not get deleted when I uninstalled the chart were prefix-2-topic20, prefix-2-topic26, prefix-2-topic32, and prefix-2-topic35. Please let me know if I have not correctly captured the logs.

strimzi-trace-logs.log

@sumitmundhada
Copy link

sumitmundhada commented Oct 21, 2020

One more thing that I noticed was that when the topics were initially created they had 10 partitions. But the 4 topics mentioned above, that were recreated after uninstalling the chart, they had just 1 partition.

@scholzj
Copy link
Member

scholzj commented Oct 21, 2020

One more thing that I noticed was that when the topics were initially created they had 10 partitions. But the 4 topics mentioned above, that were recreated after uninstalling the chart, they had just 1 partition.

That normally suggests the auto.creation with default settings ... just saying ;-)

@tombentley
Copy link
Member

@sumitmundhada thanks for the logs. They do seem to show something going on with prefix-2-topic20 at least.

  1. The Kube watch event for the deletion of the KafkaTopic doesn't appear in the logs, but obviously it was deleted.

  2. TO tries to delete the topic in Kafka, but this timesout.

    2020-10-21 18:31:00,523 ERROR <TopicWatcher:56> [oop-thread-1 2184|kube -prefix-2-topic20|387563: Failure processing KafkaTopic watch event DELETED on resource prefix-2-topic20 with labels {app.kubernetes.io/managed-by=Helm, strimzi.io/cluster=kafka, tenant-id=prefix-2}: Not deleted yet
      name: "prefix-2-topic20"
      selfLink: "/apis/kafka.strimzi.io/v1beta1/namespaces/kafka/kafkatopics/prefix-2-topic20"
    , message=Failure processing KafkaTopic watch event DELETED on resource prefix-2-topic20 with labels {app.kubernetes.io/managed-by=Helm, strimzi.io/cluster=kafka, tenant-id=prefix-2}: Not deleted yet)
    

    This exits the reconciliation without deleting the topic metadata from the topic store. (Aside: The timeout itself, when a lot of deletions are being done at the same time, isn't completely surprising. Deletion of large numbers of topics is something which doesn't scale well due to the Kafka controller's single threaded model ).

  3. The kafka controller eventually starts processing the deletion and the /brokers/topics znode gets removed, which the operator sees

    [2020-10-21 18:31:00,541 DEBUG <opicOperator:471> [oop-thread-0 2244|/brokers/topics -prefix-2-topic20: Executing action onTopicDeleted on topic prefix-2-topic20
    
  4. The TO doesn't find the KafkaTopic, but assumes that the topic doesn't exist in Kafka, the reconciliation cleans up the topic metadata left over from the previous aborted reconciliation:

    [2020-10-21 18:31:00,613 INFO  <opicOperator:536> [oop-thread-0 2244|/brokers/topics -prefix-2-topic20: Reconciling topic prefix-2-topic20, k8sTopic:null, kafkaTopic:null, privateTopic:nonnull
    [2020-10-21 18:31:00,613 DEBUG <opicOperator:577> [oop-thread-0 2244|/brokers/topics -prefix-2-topic20: KafkaTopic deleted in k8s and topic deleted in kafka => delete from topicStore
    
  5. Somehow (I don't know how, it's not in the logs) the TO has observed the "creation" of the /brokers/topics znode

    [2020-10-21 18:31:00,654 DEBUG <opicOperator:471> [oop-thread-1 2246|/brokers/topics +prefix-2-topic20: Executing action onTopicCreated on topic prefix-2-topic20
    

    and this event is now processed. It succeeds in getting the topic's description from a kafka broker (I assume the broker hasn't yet received an UpdateMetadataRequest), so the TO proceeds to create a KafkaTopic for it

    [2020-10-21 18:31:00,927 INFO  <opicOperator:536> [oop-thread-1 2246|/brokers/topics +prefix-2-topic20: Reconciling topic prefix-2-topic20, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null
    

I put "creation" in scare quotes because the logging of how we track creation and deletion is a little weird:

[2020-10-21 18:31:00,566 DEBUG <opicsWatcher:72> [calhost:2181 znode /brokers/topics now has children [prefix-2-topic37, prefix-2-topic35, prefix-2-topic32, prefix-2-topic26
, prefix-2-topic20, previous children [prefix-2-topic39, prefix-2-topic37, prefix-2-topic35, prefix-2-topic32, prefix-2-topic26, prefix-2-topic20

note the lack of ] in the lists.

We could wait longer in step 2, but we can't wait forever. So a subsequent reconciliation needs to be able to recover. Maybe in step 3 we should hold off processing the event until the Admin client says the topic doesn't exist. That code would need to be resilient to races itself though.

@sumitmundhada
Copy link

I would be lying if I say I understand 100% of this, but what would be the next steps in this case? Will this issue be sufficient to track this bug?

@sumitmundhada
Copy link

@tombentley @scholzj Do you see this as a prominent issue that needs a fix?

@scholzj
Copy link
Member

scholzj commented Nov 10, 2020

@sumitmundhada I do not really know enough about the Topic Operator ... so I personally do not dare to touch it :-o

@tombentley
Copy link
Member

I do see this as an issue which needs to be addressed, but I don't have the time right now to look at it. It's on my to do list.

tombentley added a commit to tombentley/barnabas that referenced this issue Nov 19, 2020
Signed-off-by: Tom Bentley <tbentley@redhat.com>
tombentley added a commit that referenced this issue Nov 26, 2020
Try to fix #1157 and possibly other issues where the TO behaves incorrectly wrt creation or deletion.

When we observe the deletion of a topic via ZK don't assume the topic deletion has propagated to observations of the topic via the Admin client. Instead wait for the Admin client to confirm that the topic doesn't exist before proceeding with the reconciliation. Using a null result from `Kafka.topicMetadata()` to imply non-existence
was prone to a race where we queried a non-controller broker for the
metadata and that broker had not yet processed an UPDATE_METADATA request
from the controller (e.g. the topic was recently created or deleted). By
adding an explicit check for existence (handled by the controller because it's a `createTopics(validateOnly=true)`)
we should have more consistent responses.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
tombentley added a commit that referenced this issue Dec 10, 2020
Try to fix #1157 and possibly other issues where the TO behaves incorrectly wrt creation or deletion.

When we observe the deletion of a topic via ZK don't assume the topic deletion has propagated to observations of the topic via the Admin client. Instead wait for the Admin client to confirm that the topic doesn't exist before proceeding with the reconciliation. Using a null result from `Kafka.topicMetadata()` to imply non-existence
was prone to a race where we queried a non-controller broker for the
metadata and that broker had not yet processed an UPDATE_METADATA request
from the controller (e.g. the topic was recently created or deleted). By
adding an explicit check for existence (handled by the controller because it's a `createTopics(validateOnly=true)`)
we should have more consistent responses.

Signed-off-by: Tom Bentley <tbentley@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants