Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

Closed
larvinloy opened this issue Jun 30, 2022 · 5 comments
Closed
Labels

Comments

@larvinloy
Copy link

Describe the bug
I was trying out topologySpreadConstraints on our testing environment. I added the following configuration for broker:

topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAyway
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
              - kafka.broker
       - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAyway
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
              - kafka.zookeeper

The second item in the constraints array was a mistake and was supposed to be applied to zookeeper. Nonetheless, this is what I observed:

  1. Update on existing kafka cluster:
  • No effect
  • But, haven't tested scaling out brokers. Suspect it would fail.
  1. Creating new kafka cluster with this config:
  • Operator throws an NPE
  • I suspect it has to do with this line where it gets the pod infro using the fabric8 client, but the metadata is missing since it's looking up the wrong label?

NPE stacktrace:

2022-06-27 06:44:46 ERROR AbstractOperator:247 - Reconciliation #18(timer) Kafka(kafka/kafka-cluster): createOrUpdate failed
java.lang.NullPointerException: null
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:637) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:364) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:277) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

This was no longer the issue once the zookeeper topology constraint was removed from broker config, but I think we should handle this error better instead of throwing an NPE.

@larvinloy larvinloy added the bug label Jun 30, 2022
@scholzj
Copy link
Member

scholzj commented Jun 30, 2022

Can you please share the full Kafka CR to reproduce this? Without the full operator, it is also quite hard to understand what exactly happened. Are you saying that by using this topologySpreadConstraints I get the error right in the first reconciliation?

@larvinloy
Copy link
Author

larvinloy commented Jun 30, 2022

Are you saying that by using this topologySpreadConstraints I get the error right in the first reconciliation?

Yes

Debug level logs leading up to the NPE:
extract-2022-06-30T17_58_15.818Z.csv

Kafka CR:
cr.log

Note: Some sensitive info has been redacted from CR/logs

@scholzj
Copy link
Member

scholzj commented Jun 30, 2022

Great, thanks. I will have a look.

@scholzj
Copy link
Member

scholzj commented Jun 30, 2022

This still seems to be a problem in the current main when StatefulSets are used. Just the stacktrace changed a bit:

2022-06-30 22:13:06 INFO  KafkaRoller:291 - Reconciliation #1(watch) Kafka(myproject/my-cluster): Could not roll pod {my-cluster-kafka-2}, giving up after 10 attempts. Total delay between attempts 127750ms
java.lang.NullPointerException: null
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:634) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:338) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:276) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]

When StrimziPodSets are used, this particular Kafka CR also ends up with other errors. But even then, other situations might trigger this.

When the pod does not exist, we for sure con't need to reconfigure or restart it. So we should probably wait for readiness? When the pod does nto exist, it will anyway fail sooner or later on readiness. Other alternative might be to just kill the reconciliation right away and throw some error? Also, before it runs into the rolling part, it is trying to connect for 5 minutes to collect the configuration which seems also unnecessary.
This is caused by the pod being non-existent.

@scholzj
Copy link
Member

scholzj commented Jun 30, 2022

Might be best to fix after #6663 is merged to avoid conflicts. The NPEs are not nice, but technically, this would anyway end in an error, just a different one. So I don't think it matter that much if it is fixed now or in few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants