NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

larvinloy · 2022-06-30T17:18:09Z

Describe the bug
I was trying out topologySpreadConstraints on our testing environment. I added the following configuration for broker:

topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAyway
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
              - kafka.broker
       - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: ScheduleAyway
        labelSelector:
          matchExpressions:
          - key: app
            operator: In
            values:
              - kafka.zookeeper

The second item in the constraints array was a mistake and was supposed to be applied to zookeeper. Nonetheless, this is what I observed:

Update on existing kafka cluster:

No effect
But, haven't tested scaling out brokers. Suspect it would fail.

Creating new kafka cluster with this config:

Operator throws an NPE
I suspect it has to do with this line where it gets the pod infro using the fabric8 client, but the metadata is missing since it's looking up the wrong label?

NPE stacktrace:

2022-06-27 06:44:46 ERROR AbstractOperator:247 - Reconciliation #18(timer) Kafka(kafka/kafka-cluster): createOrUpdate failed
java.lang.NullPointerException: null
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:637) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:364) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:277) ~[io.strimzi.cluster-operator-0.27.1.jar:0.27.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]

This was no longer the issue once the zookeeper topology constraint was removed from broker config, but I think we should handle this error better instead of throwing an NPE.

The text was updated successfully, but these errors were encountered:

scholzj · 2022-06-30T17:22:12Z

Can you please share the full Kafka CR to reproduce this? Without the full operator, it is also quite hard to understand what exactly happened. Are you saying that by using this topologySpreadConstraints I get the error right in the first reconciliation?

larvinloy · 2022-06-30T18:01:29Z

Are you saying that by using this topologySpreadConstraints I get the error right in the first reconciliation?

Yes

Debug level logs leading up to the NPE:
extract-2022-06-30T17_58_15.818Z.csv

Kafka CR:
cr.log

Note: Some sensitive info has been redacted from CR/logs

scholzj · 2022-06-30T18:03:01Z

Great, thanks. I will have a look.

scholzj · 2022-06-30T23:27:05Z

This still seems to be a problem in the current main when StatefulSets are used. Just the stacktrace changed a bit:

2022-06-30 22:13:06 INFO  KafkaRoller:291 - Reconciliation #1(watch) Kafka(myproject/my-cluster): Could not roll pod {my-cluster-kafka-2}, giving up after 10 attempts. Total delay between attempts 127750ms
java.lang.NullPointerException: null
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:634) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:338) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:276) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]

When StrimziPodSets are used, this particular Kafka CR also ends up with other errors. But even then, other situations might trigger this.

When the pod does not exist, we for sure con't need to reconfigure or restart it. So we should probably wait for readiness? When the pod does nto exist, it will anyway fail sooner or later on readiness. Other alternative might be to just kill the reconciliation right away and throw some error? Also, before it runs into the rolling part, it is trying to connect for 5 minutes to collect the configuration which seems also unnecessary.
This is caused by the pod being non-existent.

scholzj · 2022-06-30T23:39:15Z

Might be best to fix after #6663 is merged to avoid conflicts. The NPEs are not nice, but technically, this would anyway end in an error, just a different one. So I don't think it matter that much if it is fixed now or in few days.

larvinloy added the bug label Jun 30, 2022

ShubhamRwt mentioned this issue Oct 14, 2022

Resolves NullPointerException thrown by KafkaRoller when pod doesn't exist #7478

Merged

8 tasks

scholzj closed this as completed Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

larvinloy commented Jun 30, 2022

scholzj commented Jun 30, 2022

larvinloy commented Jun 30, 2022 •

edited

scholzj commented Jun 30, 2022

scholzj commented Jun 30, 2022

scholzj commented Jun 30, 2022

NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

NPE when incorrect topologySpreadConstraints are applied on a fresh cluster #7007

Comments

larvinloy commented Jun 30, 2022

scholzj commented Jun 30, 2022

larvinloy commented Jun 30, 2022 • edited

scholzj commented Jun 30, 2022

scholzj commented Jun 30, 2022

scholzj commented Jun 30, 2022

larvinloy commented Jun 30, 2022 •

edited