[Bug] race when creating new topics with TopicOperator #1775

joekohlsdorf · 2019-07-04T19:55:41Z

Describe the bug

When creating a large number of topics at once with the topic operator I consistently see a race where some of them are recognized as unmanaged by the topic operator getting their Kubernetes resource definitions renamed to include a hash.

To Reproduce

Create 500 KafkaTopic resource definitions, then kubectl apply them all at once.
I used a fresh cluster with no load for this test.

Expected behavior

I expect all Kafka topics to be created without getting their Kubernetes resource definitions renamed.

Environment (please complete the following information):
Strimzi version: 0.12.1 (same behavior on 0.11.1)
Installation method: YAML files
Kubernetes cluster: 1.11
Infrastructure: AWS, installed with kops

YAML files and logs

Log for a single topic where I saw this behavior:

[2019-07-04 19:32:22,911] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:30:52,150] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:29:21,479] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:27:50,792] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:26:27,559] INFO <TopicWatcher:39> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:27,559] INFO <opicOperator:599> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: All three topics are identicaluul 04 15:26:39.274	[2019-07-04 19:26:27,559] INFO <opicOperator:460> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:26,704] INFO <TopicWatcher:39> [oop-thread-1] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: Success processing event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,683] INFO <TopicWatcher:36> [0.64.0.1/...] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,674] INFO <opicOperator:460> [oop-thread-1] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null
[2019-07-04 19:26:26,335] INFO <TopicWatcher:39> [oop-thread-1] 1088|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585342: Success processing event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,044] INFO <TopicWatcher:36> [0.64.0.1/...] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,037] INFO <TopicWatcher:36> [0.64.0.1/...] 1088|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585342: event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,008] INFO <TopicWatcher:39> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:25,947] INFO <opicOperator:210> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Created topic 'testtopic112' for KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23'
[2019-07-04 19:26:25,586] INFO <opicOperator:460> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:25,093] INFO <TopicWatcher:39> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Success processing event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:24,510] INFO <opicOperator:343> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Deleting topic 'testtopic112'
[2019-07-04 19:26:24,510] INFO <opicOperator:460> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:24,412] INFO <TopicWatcher:39> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: Success processing event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:24,412] INFO <opicOperator:599> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: All three topics are identical
[2019-07-04 19:26:24,411] INFO <opicOperator:460> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:24,014] INFO <TopicWatcher:36> [0.64.0.1/...] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,312] INFO <TopicWatcher:36> [0.64.0.1/...] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,305] INFO <TopicWatcher:36> [0.64.0.1/...] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,292] INFO <TopicWatcher:39> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,285] INFO <opicOperator:210> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Created topic 'testtopic112' for KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23'
[2019-07-04 19:26:23,048] INFO <opicOperator:460> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:21,952] INFO <TopicWatcher:39> [oop-thread-1] 560|kube -testtopic112|113585006: Success processing event DELETED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:21,549] INFO <opicOperator:343> [oop-thread-1] 560|kube -testtopic112|113585006: Deleting topic 'testtopic112'
[2019-07-04 19:26:21,549] INFO <opicOperator:460> [oop-thread-1] 560|kube -testtopic112|113585006: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:20,744] WARN <opicOperator:106> [oop-thread-1] Failure processing KafkaTopic watch event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
io.strimzi.operator.topic.OperatorException: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
[2019-07-04 19:26:20,743] ERROR <TopicWatcher:48> [oop-thread-1] 558|kube =testtopic112|113585005: Failure processing KafkaTopic watch event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
[2019-07-04 19:26:20,742] INFO <opicOperator:460> [oop-thread-1] 558|kube =testtopic112|113585005: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:20,197] INFO <TopicWatcher:36> [0.64.0.1/...] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:18,386] INFO <opicOperator:1239> [oop-thread-0] 133|initial kube testtopic112|113584140: Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112
[2019-07-04 19:26:17,576] INFO <TopicWatcher:36> [0.64.0.1/...] 560|kube -testtopic112|113585006: event DELETED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:17,568] INFO <TopicWatcher:36> [0.64.0.1/...] 558|kube =testtopic112|113585005: event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:17,512] INFO <opicOperator:460> [oop-thread-0] 133|initial kube testtopic112|113584140: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:nonnull
[2019-07-04 19:26:07,656] INFO <TopicWatcher:39> [oop-thread-1] 7|kube +testtopic112|113584140: Success processing event ADDED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:07,643] INFO <opicOperator:210> [oop-thread-1] 7|kube +testtopic112|113584140: Created topic 'testtopic112' for KafkaTopic 'testtopic112'
[2019-07-04 19:26:06,002] INFO <opicOperator:460> [oop-thread-1] 7|kube +testtopic112|113584140: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:05,658] INFO <TopicWatcher:36> [0.64.0.1/...] 7|kube +testtopic112|113584140: event ADDED on resource testtopic112 with labels {strimzi.io/cluster=cdc}

The text was updated successfully, but these errors were encountered:

tombentley · 2019-07-05T06:54:42Z

Note: The above log lines are in reverse order: Read the log bottom up.

tombentley · 2019-07-05T10:17:09Z

@joekohlsdorf thanks for this bug report. I've not been able to reproduce this yet. Is there any chance you can reproduce this with the Topic Operator logging at DEBUG? You can configure it easily in the Kafka resource, like this:

kind: Kafka
spec:
  # ...
  entityOperator:
    topicOperator:
      logging: 
        type: inline
        loggers:
          rootLogger.level: DEBUG

It seems that you created the topics immediately after/while the TO was starting up. Are you able to reproduce this when the TO has been running for a while (i.e. I'm wondering if this is a race somehow between the reconciliation we do at start up and the event-based reconciliation, or something which can happen purely event driven).

joekohlsdorf · 2019-07-12T14:14:31Z

Here is a debug log of this, due to the amount of log produced I only extracted lines which contain the topic name. Log is in reverse order.
https://gist.github.com/joekohlsdorf/983f65238691e86ddd5210d75c66b404

In the case of the issue shown in the logs, the topics were created right after starting up the topic operator, I deployed the manifests for cluster and topic creation together.

I can also reproduce this when the topic operator has been running for a while, sometimes I have to delete and redeploy topics multiple times to correct this issue.

joekohlsdorf · 2020-01-23T17:57:23Z

In the original bug report I changed the topic name in the debug log for privacy reasons.

While debugging a related issue I came to the conclusion that this could happen if topic names contain characters which cannot be represented in K8s resource names. The redacted topic names from the log contained underscores.
I hope this helps in further analyzing the issue.

tombentley · 2020-01-31T14:06:31Z

Further info on reproducing this:

I spun up a test cluster in Minikube, created 30 topics with underscores in their name and started rolling Kafka by changing the version in its spec. It seems to be independent of what change you make, rolling the cluster is enough.
This old issue I reported a long time ago might be related: #1775
In Minikube only 1 of my 30 test topics was affected and it took some attempts to trigger the issue. The other cluster runs with EBS backed volumes so I guess rolling takes longer and it is more probable to trigger this issue, there all topics with underscores in their name were affected.

forsberg · 2020-10-12T12:29:45Z

I'm hitting this or something similar. I have a Helm chart that creates multiple (about 14) topics, and in my setup (3 node Kafka cluster on EKS, using st1-backed volumes, 3 Zookeeper nodes), the behaviour seems to trigger most of the time I make a fresh installation of this helm chart.

All of the topics gets created, but most of them have no status, and are not actually available in the Kafka cluster.

The topics are not available neither under /strimzi/topics nor under /brokers/topics in ZK.

It doesn't help restarting the topic operator, the only mention of the bad topics in the debug log if I restart is Ignoring initial event for KafkaTopic topicname during initial reconcile.

If I add a topicName to the KafkaTopic spec, the topic operator will add an event to the KafkaTopic that it can't be renamed.

What works is to remove one KafkaTopic at a time, then rerun the Helm chart installation so it gets recreated.

I'll try to get back with some logs. Let me know what other information would be useful.

tombentley · 2020-10-13T08:31:22Z

@forsberg logs would be super useful.

forsberg · 2020-10-20T14:34:11Z

What's interesting is that at the time the Helm chart is installed, creating 5 topics - there's absolutely no mention of the names of these topics in the log (at level DEBUG) of the topic operator. So there's not much to report, log-wise.

The 5 topics all got the same metadata.creationTimestamp: 2020-10-20T10:05:18Z. The 5 topics I created were:

batmoduledata-review-test-aioka-wcn73q
energydata-review-test-aioka-wcn73q
powerdata-review-test-aioka-wcn73q
pvstringdata-review-test-aioka-wcn73q
systemdata-review-test-aioka-wcn73q
uidata-review-test-aioka-wcn73q

What I did have, however, was a number of (other, unrelated) topics where I had tried modifying the replicas count, which led to these kind of messages:

[2020-10-20 11:14:36,777] ERROR <opicOperator:1376> [oop-thread-1] Error reconciling KafkaTopic kafka/pvstringdata-staging
io.strimzi.operator.topic.ReplicationFactorChangeException: Changing 'spec.replicas' is not supported. This KafkaTopic's 'spec.replicas' should be reverted to 
1 and then the replication should be changed directly in Kafka.
        at io.strimzi.operator.topic.TopicOperator.update3Way(TopicOperator.java:700) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.strimzi.operator.topic.TopicOperator.reconcile(TopicOperator.java:597) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.strimzi.operator.topic.TopicOperator.lambda$getKafkaAndReconcile$36(TopicOperator.java:1371) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.vertx.core.Future.lambda$compose$3(Future.java:368) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.Future.lambda$recover$6(Future.java:495) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.Future.lambda$map$4(Future.java:416) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:176) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.doComplete(CompositeFutureImpl.java:252) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.tryComplete(CompositeFutureImpl.java:227) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.tryComplete(CompositeFutureImpl.java:232) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.complete(CompositeFutureImpl.java:199) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.lambda$all$0(CompositeFutureImpl.java:39) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.strimzi.operator.topic.KafkaImpl.lambda$mapFuture$7(KafkaImpl.java:168) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [io.netty.netty-transport-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]

It may be a coincidence, but after fixing the unrelated KafkaTopics to set spec.replicas=1, the log starts mentioning the 5 new topics created by the Helm chart:

[2020-10-20 11:38:42,466] DEBUG <TopicWatcher:72> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q: Concurrent modification in kube: new version 74006145
[2020-10-20 11:38:42,466] DEBUG <opicOperator:1253> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Topic batmoduledata-review-test-aioka-wcn73q exists in Kubernetes, but not Kafka
[2020-10-20 11:38:42,467] DEBUG <opicOperator:458> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Queuing action reconcile-with-kube on topic batmoduledata-review-test-aioka-wcn73q
[2020-10-20 11:38:42,467] DEBUG <opicOperator:461> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Adding first waiter reconcile-with-kube

And after this, my 5 KafkaTopic all got a status section and were available in Kafka. So that's weird.

Summing up the timeline:

2020-10-20T10:05:18Z - the 5 new KafkaTopic are created. No mention in the log, not created in Kafka.
2020-10-20 10:22:27 - I retrieve topic-operator.log
Approximately 11:38, I updated the topics which had an incorrect spec.replicas.
2020-10-20 11:38:42 - this is the lastTransitionTime of the KafkaTopic pvstringdata-staging after it has been fixed.
2020-10-20 11:42:31 - I retrieve topic-operator-after-fix.log. I screwed up a bit here, so topic-operator.log + topic-operator-after-fix.log is not a complete history, but no actions were made during the gap so I don't think it matters.

topic-operator.log
topic-operator-after-fix.log

scholzj · 2020-10-20T15:27:11Z

I know that the replication exception breaks other updates to the same topic. But I do not think it should impact other topics.

@stanlyDoge @ppatierno @tombentley Can you check this? You are the experts on TO.

matthewi · 2020-10-30T20:59:35Z

To add further data to this, I'm seeing what looks to be a related issue where the TopicOperator fails to materialize some topics in Kafka. On a container restart with DEBUG on I see the "ignoring" messages. Those are the ONLY indication that the operator is even aware of the topic resources in k8s. How long the operator is running makes no difference.

Modifying a KafkaTopic resource sometimes shows logs. I'll try and reproduce and see if I can get a clean run. Despite log messages, nothing seems to change in the kafka cluster. The topic does not materialize.

At least in the current cluster where I am observing this, if you delete the kafkatopic object and recreate it with a spec that includes topicName, replicas, and partitions, the topic operator appears to actually function. It is my understanding that those should be optional since replicas and partitions can be inherited from the cluster definition and topicName should only be used in select cases.

matthewi · 2020-11-02T17:35:45Z

❯ cat /tmp/kt3.yaml
apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
    labels:
        strimzi.io/cluster: test
    name: matt20201030
 11:25:22❯ kubectl -n test-kafka apply -f /tmp/kt3.yaml
kafkatopic.kafka.strimzi.io/matt20201030 created

$ stern -n test-kafka test-  -t -i "matt202010"
+ test-entity-operator-85865799d5-2c5ln › tls-sidecar
+ test-entity-operator-85865799d5-2c5ln › user-operator
+ test-entity-operator-85865799d5-2c5ln › topic-operator
+ test-zookeeper-2 › zookeeper
+ test-kafka-0 › tls-sidecar
+ test-kafka-0 › kafka
+ test-schemaregistry-76d4c7cb86-7dc5n › test-schemaregistry
+ test-zookeeper-1 › zookeeper
+ test-kafka-1 › kafka
+ test-kafka-1 › tls-sidecar
+ test-kafka-2 › kafka
+ test-kafka-2 › tls-sidecar
+ test-zookeeper-0 › zookeeper
- test-zookeeper-0
+ test-zookeeper-0 › zookeeper
- test-zookeeper-2
+ test-zookeeper-2 › zookeeper
- test-zookeeper-1
+ test-zookeeper-1 › zookeeper

Nothing gets logged and the topic doesn't materialize in zookeeper or kafka. There are no events logged:

❯ kubectl -n test-kafka describe kt matt20201030
Name:         matt20201030
Namespace:    test-kafka
Labels:       strimzi.io/cluster=test
Annotations:  <none>
API Version:  kafka.strimzi.io/v1beta1
Kind:         KafkaTopic
Metadata:
  Creation Timestamp:  2020-11-02T17:25:22Z
  Generation:          1
  Resource Version:    20566303
  Self Link:           /apis/kafka.strimzi.io/v1beta1/namespaces/test-kafka/kafkatopics/matt20201030
  UID:                 ebc82b09-abc5-4c33-b577-cbc39937b8ef
Events:                <none>

11:31:46❯ cat /tmp/kt2.yaml
apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
    labels:
        strimzi.io/cluster: test
    name: matt20201029
spec:
  topicName: matt20201029
  partitions: 8
  replicas: 3
❯ kubectl create -n test-kafka -f /tmp/kt2.yaml
kafkatopic.kafka.strimzi.io/matt20201029 created

```

and logs:

```
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167649700Z [2020-11-02 17:31:46,167] INFO  <TopicWatcher:43> [10.0.0.1/...] 136398|kube +matt20201029|20568057: event ADDED on resource matt20201029 generation=1, labels={strimzi.io/cluster=test}
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167691000Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:458> [10.0.0.1/...] 136398|kube +matt20201029|20568057: Queuing action onResourceEvent on topic matt20201029
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167697300Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:461> [10.0.0.1/...] 136398|kube +matt20201029|20568057: Adding first waiter onResourceEvent
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.168027900Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:470> [oop-thread-1] 136398|kube +matt20201029|20568057: Lock acquired
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.168119500Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:471> [oop-thread-1] 136398|kube +matt20201029|20568057: Executing action onResourceEvent on topic matt20201029
```

This behavior seems to be very consistent. I'm currently on strimzi 0.19 with kafka 2.5.

tombentley · 2021-02-10T09:30:27Z

Has anyone reproduced this using Strimzi 0.21?

sknot-rh · 2021-02-10T10:13:25Z

I did not try yet. I can do later.

scholzj · 2021-03-04T11:51:06Z

@sknot-rh tried to reproduce this without success. So it was probably fixed by one of the other PRs in Strimzi 0.22. Closing.

joekohlsdorf added the bug label Jul 4, 2019

tombentley added the topic operator label Sep 12, 2019

scholzj closed this as completed Mar 4, 2021

AmarendraSingh88 mentioned this issue Nov 25, 2021

Issue when creating new topics with TopicOperator #5943

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] race when creating new topics with TopicOperator #1775

[Bug] race when creating new topics with TopicOperator #1775

joekohlsdorf commented Jul 4, 2019

tombentley commented Jul 5, 2019

tombentley commented Jul 5, 2019

joekohlsdorf commented Jul 12, 2019

joekohlsdorf commented Jan 23, 2020

tombentley commented Jan 31, 2020

forsberg commented Oct 12, 2020

tombentley commented Oct 13, 2020

forsberg commented Oct 20, 2020

scholzj commented Oct 20, 2020

matthewi commented Oct 30, 2020

matthewi commented Nov 2, 2020 •

edited

tombentley commented Feb 10, 2021

sknot-rh commented Feb 10, 2021

scholzj commented Mar 4, 2021

[Bug] race when creating new topics with TopicOperator #1775

[Bug] race when creating new topics with TopicOperator #1775

Comments

joekohlsdorf commented Jul 4, 2019

tombentley commented Jul 5, 2019

tombentley commented Jul 5, 2019

joekohlsdorf commented Jul 12, 2019

joekohlsdorf commented Jan 23, 2020

tombentley commented Jan 31, 2020

forsberg commented Oct 12, 2020

tombentley commented Oct 13, 2020

forsberg commented Oct 20, 2020

scholzj commented Oct 20, 2020

matthewi commented Oct 30, 2020

matthewi commented Nov 2, 2020 • edited

tombentley commented Feb 10, 2021

sknot-rh commented Feb 10, 2021

scholzj commented Mar 4, 2021

matthewi commented Nov 2, 2020 •

edited