Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] race when creating new topics with TopicOperator #1775

Closed
joekohlsdorf opened this issue Jul 4, 2019 · 14 comments
Closed

[Bug] race when creating new topics with TopicOperator #1775

joekohlsdorf opened this issue Jul 4, 2019 · 14 comments

Comments

@joekohlsdorf
Copy link

Describe the bug

When creating a large number of topics at once with the topic operator I consistently see a race where some of them are recognized as unmanaged by the topic operator getting their Kubernetes resource definitions renamed to include a hash.

To Reproduce

Create 500 KafkaTopic resource definitions, then kubectl apply them all at once.
I used a fresh cluster with no load for this test.

Expected behavior

I expect all Kafka topics to be created without getting their Kubernetes resource definitions renamed.

Environment (please complete the following information):
Strimzi version: 0.12.1 (same behavior on 0.11.1)
Installation method: YAML files
Kubernetes cluster: 1.11
Infrastructure: AWS, installed with kops

YAML files and logs

Log for a single topic where I saw this behavior:

[2019-07-04 19:32:22,911] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:30:52,150] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:29:21,479] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:27:50,792] INFO <opicOperator:1182> [oop-thread-0] Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23
[2019-07-04 19:26:27,559] INFO <TopicWatcher:39> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:27,559] INFO <opicOperator:599> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: All three topics are identicaluul 04 15:26:39.274	[2019-07-04 19:26:27,559] INFO <opicOperator:460> [oop-thread-1] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:26,704] INFO <TopicWatcher:39> [oop-thread-1] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: Success processing event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,683] INFO <TopicWatcher:36> [0.64.0.1/...] 1130|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585374: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,674] INFO <opicOperator:460> [oop-thread-1] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:null
[2019-07-04 19:26:26,335] INFO <TopicWatcher:39> [oop-thread-1] 1088|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585342: Success processing event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,044] INFO <TopicWatcher:36> [0.64.0.1/...] 1090|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585344: event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,037] INFO <TopicWatcher:36> [0.64.0.1/...] 1088|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585342: event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:26,008] INFO <TopicWatcher:39> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:25,947] INFO <opicOperator:210> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Created topic 'testtopic112' for KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23'
[2019-07-04 19:26:25,586] INFO <opicOperator:460> [oop-thread-1] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:25,093] INFO <TopicWatcher:39> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Success processing event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:24,510] INFO <opicOperator:343> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Deleting topic 'testtopic112'
[2019-07-04 19:26:24,510] INFO <opicOperator:460> [oop-thread-1] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:24,412] INFO <TopicWatcher:39> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: Success processing event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:24,412] INFO <opicOperator:599> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: All three topics are identical
[2019-07-04 19:26:24,411] INFO <opicOperator:460> [oop-thread-1] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:24,014] INFO <TopicWatcher:36> [0.64.0.1/...] 1018|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585293: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,312] INFO <TopicWatcher:36> [0.64.0.1/...] 978|kube -testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585262: event DELETED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,305] INFO <TopicWatcher:36> [0.64.0.1/...] 977|kube =testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585261: event MODIFIED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,292] INFO <TopicWatcher:39> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Success processing event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:23,285] INFO <opicOperator:210> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Created topic 'testtopic112' for KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23'
[2019-07-04 19:26:23,048] INFO <opicOperator:460> [oop-thread-1] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:21,952] INFO <TopicWatcher:39> [oop-thread-1] 560|kube -testtopic112|113585006: Success processing event DELETED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:21,549] INFO <opicOperator:343> [oop-thread-1] 560|kube -testtopic112|113585006: Deleting topic 'testtopic112'
[2019-07-04 19:26:21,549] INFO <opicOperator:460> [oop-thread-1] 560|kube -testtopic112|113585006: Reconciling topic testtopic112, k8sTopic:null, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:20,744] WARN <opicOperator:106> [oop-thread-1] Failure processing KafkaTopic watch event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
io.strimzi.operator.topic.OperatorException: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
[2019-07-04 19:26:20,743] ERROR <TopicWatcher:48> [oop-thread-1] 558|kube =testtopic112|113585005: Failure processing KafkaTopic watch event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}: Topic 'testtopic112' is already managed via KafkaTopic 'testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23' it cannot also be managed via the KafkaTopic 'testtopic112'
[2019-07-04 19:26:20,742] INFO <opicOperator:460> [oop-thread-1] 558|kube =testtopic112|113585005: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:nonnull, privateTopic:nonnull
[2019-07-04 19:26:20,197] INFO <TopicWatcher:36> [0.64.0.1/...] 914|kube +testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23|113585218: event ADDED on resource testtopic112---b1ffe0f66cc65c6370fe3ff96df4bc33dcd2de23 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:18,386] INFO <opicOperator:1239> [oop-thread-0] 133|initial kube testtopic112|113584140: Success reconciling KafkaTopic cornershop-qa-kafka-cdc/testtopic112
[2019-07-04 19:26:17,576] INFO <TopicWatcher:36> [0.64.0.1/...] 560|kube -testtopic112|113585006: event DELETED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:17,568] INFO <TopicWatcher:36> [0.64.0.1/...] 558|kube =testtopic112|113585005: event MODIFIED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:17,512] INFO <opicOperator:460> [oop-thread-0] 133|initial kube testtopic112|113584140: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:nonnull
[2019-07-04 19:26:07,656] INFO <TopicWatcher:39> [oop-thread-1] 7|kube +testtopic112|113584140: Success processing event ADDED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
[2019-07-04 19:26:07,643] INFO <opicOperator:210> [oop-thread-1] 7|kube +testtopic112|113584140: Created topic 'testtopic112' for KafkaTopic 'testtopic112'
[2019-07-04 19:26:06,002] INFO <opicOperator:460> [oop-thread-1] 7|kube +testtopic112|113584140: Reconciling topic testtopic112, k8sTopic:nonnull, kafkaTopic:null, privateTopic:null
[2019-07-04 19:26:05,658] INFO <TopicWatcher:36> [0.64.0.1/...] 7|kube +testtopic112|113584140: event ADDED on resource testtopic112 with labels {strimzi.io/cluster=cdc}
@tombentley
Copy link
Member

Note: The above log lines are in reverse order: Read the log bottom up.

@tombentley
Copy link
Member

@joekohlsdorf thanks for this bug report. I've not been able to reproduce this yet. Is there any chance you can reproduce this with the Topic Operator logging at DEBUG? You can configure it easily in the Kafka resource, like this:

kind: Kafka
spec:
  # ...
  entityOperator:
    topicOperator:
      logging: 
        type: inline
        loggers:
          rootLogger.level: DEBUG

It seems that you created the topics immediately after/while the TO was starting up. Are you able to reproduce this when the TO has been running for a while (i.e. I'm wondering if this is a race somehow between the reconciliation we do at start up and the event-based reconciliation, or something which can happen purely event driven).

@joekohlsdorf
Copy link
Author

Here is a debug log of this, due to the amount of log produced I only extracted lines which contain the topic name. Log is in reverse order.
https://gist.github.com/joekohlsdorf/983f65238691e86ddd5210d75c66b404

In the case of the issue shown in the logs, the topics were created right after starting up the topic operator, I deployed the manifests for cluster and topic creation together.

I can also reproduce this when the topic operator has been running for a while, sometimes I have to delete and redeploy topics multiple times to correct this issue.

@joekohlsdorf
Copy link
Author

In the original bug report I changed the topic name in the debug log for privacy reasons.

While debugging a related issue I came to the conclusion that this could happen if topic names contain characters which cannot be represented in K8s resource names. The redacted topic names from the log contained underscores.
I hope this helps in further analyzing the issue.

@tombentley
Copy link
Member

Further info on reproducing this:

I spun up a test cluster in Minikube, created 30 topics with underscores in their name and started rolling Kafka by changing the version in its spec. It seems to be independent of what change you make, rolling the cluster is enough.
This old issue I reported a long time ago might be related: #1775
In Minikube only 1 of my 30 test topics was affected and it took some attempts to trigger the issue. The other cluster runs with EBS backed volumes so I guess rolling takes longer and it is more probable to trigger this issue, there all topics with underscores in their name were affected.

@forsberg
Copy link

I'm hitting this or something similar. I have a Helm chart that creates multiple (about 14) topics, and in my setup (3 node Kafka cluster on EKS, using st1-backed volumes, 3 Zookeeper nodes), the behaviour seems to trigger most of the time I make a fresh installation of this helm chart.

All of the topics gets created, but most of them have no status, and are not actually available in the Kafka cluster.

The topics are not available neither under /strimzi/topics nor under /brokers/topics in ZK.

It doesn't help restarting the topic operator, the only mention of the bad topics in the debug log if I restart is Ignoring initial event for KafkaTopic topicname during initial reconcile.

If I add a topicName to the KafkaTopic spec, the topic operator will add an event to the KafkaTopic that it can't be renamed.

What works is to remove one KafkaTopic at a time, then rerun the Helm chart installation so it gets recreated.

I'll try to get back with some logs. Let me know what other information would be useful.

@tombentley
Copy link
Member

@forsberg logs would be super useful.

@forsberg
Copy link

What's interesting is that at the time the Helm chart is installed, creating 5 topics - there's absolutely no mention of the names of these topics in the log (at level DEBUG) of the topic operator. So there's not much to report, log-wise.

The 5 topics all got the same metadata.creationTimestamp: 2020-10-20T10:05:18Z. The 5 topics I created were:

batmoduledata-review-test-aioka-wcn73q
energydata-review-test-aioka-wcn73q
powerdata-review-test-aioka-wcn73q
pvstringdata-review-test-aioka-wcn73q
systemdata-review-test-aioka-wcn73q
uidata-review-test-aioka-wcn73q

What I did have, however, was a number of (other, unrelated) topics where I had tried modifying the replicas count, which led to these kind of messages:

[2020-10-20 11:14:36,777] ERROR <opicOperator:1376> [oop-thread-1] Error reconciling KafkaTopic kafka/pvstringdata-staging
io.strimzi.operator.topic.ReplicationFactorChangeException: Changing 'spec.replicas' is not supported. This KafkaTopic's 'spec.replicas' should be reverted to 
1 and then the replication should be changed directly in Kafka.
        at io.strimzi.operator.topic.TopicOperator.update3Way(TopicOperator.java:700) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.strimzi.operator.topic.TopicOperator.reconcile(TopicOperator.java:597) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.strimzi.operator.topic.TopicOperator.lambda$getKafkaAndReconcile$36(TopicOperator.java:1371) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.vertx.core.Future.lambda$compose$3(Future.java:368) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.Future.lambda$recover$6(Future.java:495) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.Future.lambda$map$4(Future.java:416) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:176) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.doComplete(CompositeFutureImpl.java:252) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.tryComplete(CompositeFutureImpl.java:227) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.tryComplete(CompositeFutureImpl.java:232) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.complete(CompositeFutureImpl.java:199) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.CompositeFutureImpl.lambda$all$0(CompositeFutureImpl.java:39) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.strimzi.operator.topic.KafkaImpl.lambda$mapFuture$7(KafkaImpl.java:168) ~[io.strimzi.topic-operator-0.19.0.jar:0.19.0]
        at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [io.netty.netty-transport-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.50.Final.jar:4.1.50.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]

It may be a coincidence, but after fixing the unrelated KafkaTopics to set spec.replicas=1, the log starts mentioning the 5 new topics created by the Helm chart:

[2020-10-20 11:38:42,466] DEBUG <TopicWatcher:72> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q: Concurrent modification in kube: new version 74006145
[2020-10-20 11:38:42,466] DEBUG <opicOperator:1253> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Topic batmoduledata-review-test-aioka-wcn73q exists in Kubernetes, but not Kafka
[2020-10-20 11:38:42,467] DEBUG <opicOperator:458> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Queuing action reconcile-with-kube on topic batmoduledata-review-test-aioka-wcn73q
[2020-10-20 11:38:42,467] DEBUG <opicOperator:461> [oop-thread-1] 342626|periodic kube batmoduledata-review-test-aioka-wcn73q|74006145: Adding first waiter reconcile-with-kube

And after this, my 5 KafkaTopic all got a status section and were available in Kafka. So that's weird.

Summing up the timeline:

  • 2020-10-20T10:05:18Z - the 5 new KafkaTopic are created. No mention in the log, not created in Kafka.
  • 2020-10-20 10:22:27 - I retrieve topic-operator.log
  • Approximately 11:38, I updated the topics which had an incorrect spec.replicas.
  • 2020-10-20 11:38:42 - this is the lastTransitionTime of the KafkaTopic pvstringdata-staging after it has been fixed.
  • 2020-10-20 11:42:31 - I retrieve topic-operator-after-fix.log. I screwed up a bit here, so topic-operator.log + topic-operator-after-fix.log is not a complete history, but no actions were made during the gap so I don't think it matters.

topic-operator.log
topic-operator-after-fix.log

@scholzj
Copy link
Member

scholzj commented Oct 20, 2020

I know that the replication exception breaks other updates to the same topic. But I do not think it should impact other topics.

@stanlyDoge @ppatierno @tombentley Can you check this? You are the experts on TO.

@matthewi
Copy link

To add further data to this, I'm seeing what looks to be a related issue where the TopicOperator fails to materialize some topics in Kafka. On a container restart with DEBUG on I see the "ignoring" messages. Those are the ONLY indication that the operator is even aware of the topic resources in k8s. How long the operator is running makes no difference.

Modifying a KafkaTopic resource sometimes shows logs. I'll try and reproduce and see if I can get a clean run. Despite log messages, nothing seems to change in the kafka cluster. The topic does not materialize.

At least in the current cluster where I am observing this, if you delete the kafkatopic object and recreate it with a spec that includes topicName, replicas, and partitions, the topic operator appears to actually function. It is my understanding that those should be optional since replicas and partitions can be inherited from the cluster definition and topicName should only be used in select cases.

@matthewi
Copy link

matthewi commented Nov 2, 2020

❯ cat /tmp/kt3.yaml
apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
    labels:
        strimzi.io/cluster: test
    name: matt20201030
 11:25:22❯ kubectl -n test-kafka apply -f /tmp/kt3.yaml
kafkatopic.kafka.strimzi.io/matt20201030 created

$ stern -n test-kafka test-  -t -i "matt202010"
+ test-entity-operator-85865799d5-2c5ln › tls-sidecar
+ test-entity-operator-85865799d5-2c5ln › user-operator
+ test-entity-operator-85865799d5-2c5ln › topic-operator
+ test-zookeeper-2 › zookeeper
+ test-kafka-0 › tls-sidecar
+ test-kafka-0 › kafka
+ test-schemaregistry-76d4c7cb86-7dc5n › test-schemaregistry
+ test-zookeeper-1 › zookeeper
+ test-kafka-1 › kafka
+ test-kafka-1 › tls-sidecar
+ test-kafka-2 › kafka
+ test-kafka-2 › tls-sidecar
+ test-zookeeper-0 › zookeeper
- test-zookeeper-0
+ test-zookeeper-0 › zookeeper
- test-zookeeper-2
+ test-zookeeper-2 › zookeeper
- test-zookeeper-1
+ test-zookeeper-1 › zookeeper

Nothing gets logged and the topic doesn't materialize in zookeeper or kafka. There are no events logged:

❯ kubectl -n test-kafka describe kt matt20201030
Name:         matt20201030
Namespace:    test-kafka
Labels:       strimzi.io/cluster=test
Annotations:  <none>
API Version:  kafka.strimzi.io/v1beta1
Kind:         KafkaTopic
Metadata:
  Creation Timestamp:  2020-11-02T17:25:22Z
  Generation:          1
  Resource Version:    20566303
  Self Link:           /apis/kafka.strimzi.io/v1beta1/namespaces/test-kafka/kafkatopics/matt20201030
  UID:                 ebc82b09-abc5-4c33-b577-cbc39937b8ef
Events:                <none>
11:31:46❯ cat /tmp/kt2.yaml
apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaTopic
metadata:
    labels:
        strimzi.io/cluster: test
    name: matt20201029
spec:
  topicName: matt20201029
  partitions: 8
  replicas: 3
❯ kubectl create -n test-kafka -f /tmp/kt2.yaml
kafkatopic.kafka.strimzi.io/matt20201029 created

```

and logs:

```
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167649700Z [2020-11-02 17:31:46,167] INFO  <TopicWatcher:43> [10.0.0.1/...] 136398|kube +matt20201029|20568057: event ADDED on resource matt20201029 generation=1, labels={strimzi.io/cluster=test}
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167691000Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:458> [10.0.0.1/...] 136398|kube +matt20201029|20568057: Queuing action onResourceEvent on topic matt20201029
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.167697300Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:461> [10.0.0.1/...] 136398|kube +matt20201029|20568057: Adding first waiter onResourceEvent
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.168027900Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:470> [oop-thread-1] 136398|kube +matt20201029|20568057: Lock acquired
test-entity-operator-85865799d5-2c5ln topic-operator 2020-11-02T17:31:46.168119500Z [2020-11-02 17:31:46,167] DEBUG <opicOperator:471> [oop-thread-1] 136398|kube +matt20201029|20568057: Executing action onResourceEvent on topic matt20201029
```

This behavior seems to be very consistent. I'm currently on strimzi 0.19 with kafka 2.5.

@tombentley
Copy link
Member

Has anyone reproduced this using Strimzi 0.21?

@sknot-rh
Copy link
Member

I did not try yet. I can do later.

@scholzj
Copy link
Member

scholzj commented Mar 4, 2021

@sknot-rh tried to reproduce this without success. So it was probably fixed by one of the other PRs in Strimzi 0.22. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants