Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Takes too long for KafkaUser to be ready #5691

Closed
GarimaBathla opened this issue Oct 9, 2021 · 44 comments · Fixed by #7583
Closed

Takes too long for KafkaUser to be ready #5691

GarimaBathla opened this issue Oct 9, 2021 · 44 comments · Fixed by #7583
Labels

Comments

@GarimaBathla
Copy link

Describe the bug
When creating Kafka users say 1-5 users it works just fine, KafkaUsers are in Ready status in no time. But when I try to create bulk KafkaUser - say 50 users at a time, some users can take almost 10 mins to be "Ready", so this means client cannot publish/consume messages using these KafkaUsers - wait can be very long as the number of KafkaUsers on a Kafka cluster increase.

To Reproduce
Steps to reproduce the behavior:

  1. To repro - you need 1. Kafka cluster, 1. topic already created.

  2. Create 50 KafkaUsers on Strimzi Kafka cluster at the same time - via a script (I create it via K8 api)

  3. Right away with in a minute trying to use all these users to publish/read from using these users and you will get Topic Authorization Failed, Group Authorization Failed errors, because users are not yet "Ready"

  4. Create Custom Resource 'Kafka User' using the yaml below:

kind: KafkaUser
metadata:
  name: cog-reader
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Read
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe
      - resource:
          type: group
          name: cog-group
          patternType: literal
        operation: Read
  1. Authorization simple, authentication SASL_SSL/SCRAM-SHA-512
  2. Apply 50 such users in parallel to Strimzi cluster.
  3. Check the logs of the user-operator and it is cluttered with TimeoutExceptions
2021-10-09 06:55:40 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=describeAcls, deadlineMs=1633762539991, tries=1, nextAllowedTryMs=1633762540099) timed out at 1633762539999 after 1 attempt(s)
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeAcls
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
2021-10-09 06:55:40 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=alterUserScramCredentials, deadlineMs=1633762539992, tries=1, nextAllowedTryMs=1633762540099) timed out at 1633762539999 after 1 attempt(s)
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
rg.apache.kafka.common.errors.TimeoutException: Call(callName=describeClientQuotas, deadlineMs=1633758992264, tries=1, nextAllowedTryMs=1633758992365) timed out at 1633758992265 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
2021-10-09 05:56:32 DEBUG StatusDiff:46 - Status differs: {"op":"add","path":"/conditions/0/reason","value":"TimeoutException"}
2021-10-09 05:56:32 DEBUG StatusDiff:48 - Desired Status path /conditions/0/reason has value "TimeoutException"
org.apache.kafka.common.errors.TimeoutException: Call(callName=describeClientQuotas, deadlineMs=1633758992264, tries=1, nextAllowedTryMs=1633758992365) timed out at 1633758992265 after 1 attempt(s)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas

java.lang.Exception: TimeoutException: Timed out waiting to send the call. Call: describeClientQuotas
        at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$TimeoutProcessor.handleTimeouts(KafkaAdminClient.java:912) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.timeoutCallsToSend(KafkaAdminClient.java:993) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1301) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
        at java.lang.Thread.run(Thread.java:829) [?:?]

Expected behavior
Creating Kafka user with ACL's should be a straightforward call and it should have deterministic output when 5 users are created or 50 users are created

Environment (please complete the following information):

  • Strimzi version: [e.g. main, 0.25.0]

  • Installation method:

    • kubectl get ns kafka || kubectl create ns kafka
    • helm repo add strimzi https://strimzi.io/charts/
    • helm upgrade --install strimzi-kafka strimzi/strimzi-kafka-operator --version="0.25.0" -n kafka
  • Kubernetes cluster: EKS 1.18

  • Infrastructure: Amazon EKS

YAML files and logs
Kafka -cluster yaml

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: pod-kafka-cluster
  namespace: kafka
spec:
  kafka:
    version: 2.8.0
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: external
        authentication:
          type: scram-sha-512
        port: 9094
        tls: true
        type: loadbalancer
    authorization:
      type: simple
    config:
      offsets.topic.replication.factor: 1
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      log.message.format.version: "2.8"
    storage:
      type: persistent-claim
      size: 50Gi
      deleteClaim: true
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      size: 50Gi
      deleteClaim: true
  entityOperator:
    userOperator: 
      reconciliationIntervalSeconds: 157680000
      logging:
        type: inline
        loggers:
          rootLogger.level: DEBUG      
    topicOperator: 
      reconciliationIntervalSeconds: 157680000
      logging:
        type: inline
        loggers:
          rootLogger.level: DEBUG      

Topic yaml

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: pod-notifications
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  partitions: 1
  replicas: 3

Publisher user (create multiple - this is just an example)

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: pod-agent-writer
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Write
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe

Consumer - create multiple this is jus an example

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
  name: cog-reader
  labels:
    strimzi.io/cluster: pod-kafka-cluster
spec:
  authentication:
    type: scram-sha-512
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Read
      - resource:
          type: topic
          name: pod-notifications
          patternType: literal
        operation: Describe
      - resource:
          type: group
          name: cog-group
          patternType: literal
        operation: Read

Additional context
Use case: I have need to create Kafka. users/topics on demand, so I have a micro service that invokes K8's api that in-turn invoke Strimzi API's to create Kafka resources. topic operator works fine and topics are in ready status. KafkaUsers can take too long to be ready - and this causes some components that rely on this micro service to fail.

Earlier periodic reconciliation was causing too much issues, so I turned periodic reconciliation off. Now as the number of users have increased all those timeout errors are back. I don't know. if Kafka. Admin client is already busy and ignoring all the calls?

Any help here is appreciated. Any pointers,

@GarimaBathla
Copy link
Author

I see the below exception in the broker logs

2021-10-09 18:23:06,696 INFO [ReplicaFetcher replicaId=1, leaderId=0, fetcherId=0] Error sending fetch request (sessionId=1689750346, epoch=70118) to node 0: (org.apache.kafka.clients.FetchSessionHandler) [ReplicaFetcherThread-0-0]
java.io.IOException: Client was shutdown before response was read
        at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:109)
        at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:110)
        at kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:217)
        at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:317)
        at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:141)
        at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:140)
        at scala.Option.foreach(Option.scala:437)
        at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:140)
        at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:123)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

is it possible the admin client in user-operator is timing out too soon before admin can actually respond?

@sknot-rh
Copy link
Member

sknot-rh commented Oct 11, 2021

I just tried with your configuration and the users were ready in range of seconds. I will do more testing to see if I can get the same error as you.
// edit:
With 500 Kafka Users I am getting these exceptions too.

@GarimaBathla
Copy link
Author

I wonder if this has something to do with JVM memory footprint of the user-operator / or Kafka's memory foot print? I have used the default settings and have never changed it.

In my case it happens with very low number of users - which is a concern.

What is your version of Strimzi?

@scholzj
Copy link
Member

scholzj commented Oct 11, 2021

There are no default settings. You have to configure the memory and CPU requests in the Kafka CR for your use-case. Without that it might default either to some LimitRange settings or have no settings at all which means it has all and no resources depending on your cluster situation.

@GarimaBathla
Copy link
Author

thanks @scholzj - (updating here for completeness) could you share the sample yaml example on how can I set CPU/MEM for Kafka cluster? I will try to play around with it. or if @sknot-rh could share what is his Kafka Configuration which works for 500 users within seconds, it will help me to fine tune my Kafka cluster settings.

@GarimaBathla
Copy link
Author

#4471 - I don't know if my issue is related to this - but I am following this thread - as I see my cluster also getting OOM killed now and then.

@scholzj
Copy link
Member

scholzj commented Oct 12, 2021

It is described in the docs => you just add the resources block with the appropriate settings to the YAML. This is my resource I use for development which should give you an idea:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  labels:
    app: my-cluster
spec:
  kafka:
    replicas: 3
    resources:
      requests:
        memory: 2Gi
        cpu: 500m
      limits:
        memory: 2Gi
        cpu: "1"
    jvmOptions:
      -Xms: 1024m
      -Xmx: 1024m
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
        authentication:
          type: scram-sha-512
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
    authorization:
      type: simple
    config:
      auto.create.topics.enable: "false"
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
    storage:
      type: jbod
      volumes:
        - id: 0
          type: persistent-claim
          size: 100Gi
  zookeeper:
    replicas: 3
    resources:
      requests:
        memory: 1Gi
        cpu: "0.3"
      limits:
        memory: 1Gi
        cpu: "0.5"
    jvmOptions:
      -Xms: 512m
      -Xmx: 512m
    storage:
      type: persistent-claim
      size: 100Gi
  entityOperator:
    topicOperator:
      resources:
        requests:
          memory: 256Mi
          cpu: "0.2"
        limits:
          memory: 256Mi
          cpu: "0.5"
    userOperator:
      resources:
        requests:
          memory: 512Mi
          cpu: "0.2"
        limits:
          memory: 512Mi
          cpu: "0.5"
  kafkaExporter:
    resources:
      requests:
        memory: 256Mi
        cpu: "0.1"
      limits:
        memory: 256Mi
        cpu: "0.5"

You will need to adjust these for your use-cases.

@sknot-rh
Copy link
Member

@GarimaBathla I did further testing with 500 KafkaUsers. I did find out that each KafkaUser does require 5 AdminClient calls each reconciliation. That means high number of KUs is basically DDoSing the kafka brokers in the matter of AC requests. I did try to increase number of kafka brokers (3 -> 7) which improved the situation a bit, but there are still KUs which are not ready after long time. I think we should try to batch the requests to save some broker's load (@tombentley 's idea).
What is your case for creating such a high number of KUs?

@GarimaBathla
Copy link
Author

@sknot-rh -my use case for creating KU's on demand; my application supports multiple clients using same Kafka infrastructure. for each client, we need to setup topics/users with ACL's so each client can receive its own set of notifications based on its actions - say supporting multi-tenant DB applications.

Again in my use-case the load on brokers is pretty low - we send say 70 notifications/hr on a topic - and I might have 60 topics per cluster. but for each topic I have many consumers say 6 users - who consume from that topic per tenant.

My need is not 500, but say 300 total Kafka users, and I want to create 30 at a time.

@sknot-rh - I am surprised you only saw this problem with 500 KU's and not lower number. Could you share your Kafka cluster configuration - Like in above thread I think I have 2 problems - 1. I don't allocate enough resources to my kafka cluster - it has default Strimzi settings, 2 . I create on-demand resources that result in load on broker.

So I want to try your configuration/version as well. Thanks for your help.

@sknot-rh
Copy link
Member

I used the same configuration as you posted above. I think it is very flaky.

@tombentley
Copy link
Member

I think we should try to batch the requests to save some broker's load (@tombentley 's idea).

A good first step would be to explore avoid reconciling all the CRs at the same time (on the periodic reconciliation), instead spreading them evenly over the reconciliation interval. That would help with periodic reconciliations. It wouldn't help with the event-based reconciliations that happen when a whole bunch of CRs getting created, deleted or modified at the same time. For those we'd need batching.

@sknot-rh
Copy link
Member

I tried batching for quotas and I was able to get 340 ready KafkaUsers. For comparison, without batching I can get ~80 ready KafkaUsers.

@GarimaBathla
Copy link
Author

@sknot-rh could you share more? How can I batch?

@tombentley for my usecase I have almost turned off periodic reconciliation all together - set it to 10 years time. But if batching works, I can test by turning it on.

@GarimaBathla
Copy link
Author

I tested by increasing Kafka resources (CPU/Mem) from Strimzi default settings to 500m, 1G and also increase Strimzi user operators and topic operator to how much @scholzj has listed in the above example. None of these changes made any difference to the amount of time it takes to create custom resources;

I serialized creation of KU 1 user at a time and it turns out creation of a single KU takes 6 seconds - ( 6 seconds to get in ready status) , but if I create these users in parallel say 10 KU's at a time a single user can take around 25 seconds.

so this is not a resource issue, but indeed batching problem like it is listed above.

And if I turn on reconciliation, user-operator goes bonkers and is never able to get out of trouble, it is stuck in timeout exception loop.

@sknot-rh - could you please advice path forward? Would it be easy to do the batching of users and give me a new patch of strimzi?

truly appreciate all the help you all have provided.

here are the logs:

LOGS when creating single user at a time.

2021-10-13 00:43:49 INFO  OperatorWatcher:38 - Reconciliation #2803(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): KafkaUser sys-ktest11-cogmanager-reader in namespace kafka was ADDED
2021-10-13 00:43:49 INFO  AbstractOperator:219 - Reconciliation #2803(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): KafkaUser sys-ktest11-cogmanager-reader will be checked for creation or modification

2021-10-13 00:43:53 INFO  OperatorWatcher:38 - Reconciliation #2804(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): KafkaUser sys-ktest11-cogmanager-reader in namespace kafka was MODIFIED
2021-10-13 00:43:53 INFO  CrdOperator:113 - Reconciliation #2803(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): Status of KafkaUser sys-ktest11-cogmanager-reader in namespace kafka has been updated
2021-10-13 00:43:53 INFO  AbstractOperator:466 - Reconciliation #2803(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): reconciled
2021-10-13 00:43:53 INFO  AbstractOperator:219 - Reconciliation #2804(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): KafkaUser sys-ktest11-cogmanager-reader will be checked for creation or modification
2021-10-13 00:43:54 INFO  OperatorWatcher:38 - Reconciliation #2805(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): KafkaUser sys-ktest11-pod-agent-writer in namespace kafka was ADDED
2021-10-13 00:43:54 INFO  AbstractOperator:219 - Reconciliation #2805(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): KafkaUser sys-ktest11-pod-agent-writer will be checked for creation or modification
2021-10-13 00:43:57 INFO  AbstractOperator:466 - Reconciliation #2804(watch) KafkaUser(kafka/sys-ktest11-cogmanager-reader): reconciled
2021-10-13 00:44:00 INFO  OperatorWatcher:38 - Reconciliation #2806(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): KafkaUser sys-ktest11-pod-agent-writer in namespace kafka was MODIFIED
2021-10-13 00:44:00 INFO  CrdOperator:113 - Reconciliation #2805(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): Status of KafkaUser sys-ktest11-pod-agent-writer in namespace kafka has been updated
2021-10-13 00:44:00 INFO  AbstractOperator:466 - Reconciliation #2805(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): reconciled
2021-10-13 00:44:00 INFO  AbstractOperator:219 - Reconciliation #2806(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): KafkaUser sys-ktest11-pod-agent-writer will be checked for creation or modification
2021-10-13 00:44:02 INFO  OperatorWatcher:38 - Reconciliation #2807(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): KafkaUser acc-ktest11-password-rotation-writer in namespace kafka was ADDED
2021-10-13 00:44:02 INFO  AbstractOperator:219 - Reconciliation #2807(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): KafkaUser acc-ktest11-password-rotation-writer will be checked for creation or modification
2021-10-13 00:44:04 INFO  AbstractOperator:466 - Reconciliation #2806(watch) KafkaUser(kafka/sys-ktest11-pod-agent-writer): reconciled
2021-10-13 00:44:08 INFO  OperatorWatcher:38 - Reconciliation #2808(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): KafkaUser acc-ktest11-password-rotation-writer in namespace kafka was MODIFIED
2021-10-13 00:44:08 INFO  CrdOperator:113 - Reconciliation #2807(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): Status of KafkaUser acc-ktest11-password-rotation-writer in namespace kafka has been updated
2021-10-13 00:44:08 INFO  AbstractOperator:466 - Reconciliation #2807(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): reconciled
2021-10-13 00:44:08 INFO  AbstractOperator:219 - Reconciliation #2808(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): KafkaUser acc-ktest11-password-rotation-writer will be checked for creation or modification
2021-10-13 00:44:09 INFO  OperatorWatcher:38 - Reconciliation #2809(watch) KafkaUser(kafka/sys-ktest11-password-reader): KafkaUser sys-ktest11-password-reader in namespace kafka was ADDED
2021-10-13 00:44:09 INFO  AbstractOperator:219 - Reconciliation #2809(watch) KafkaUser(kafka/sys-ktest11-password-reader): KafkaUser sys-ktest11-password-reader will be checked for creation or modification
2021-10-13 00:44:12 INFO  AbstractOperator:466 - Reconciliation #2808(watch) KafkaUser(kafka/acc-ktest11-password-rotation-writer): reconciled
2021-10-13 00:44:16 INFO  OperatorWatcher:38 - Reconciliation #2810(watch) KafkaUser(kafka/sys-ktest11-password-reader): KafkaUser sys-ktest11-password-reader in namespace kafka was MODIFIED
2021-10-13 00:44:16 INFO  CrdOperator:113 - Reconciliation #2809(watch) KafkaUser(kafka/sys-ktest11-password-reader): Status of KafkaUser sys-ktest11-password-reader in namespace kafka has been updated
2021-10-13 00:44:16 INFO  AbstractOperator:466 - Reconciliation #2809(watch) KafkaUser(kafka/sys-ktest11-password-reader): reconciled
2021-10-13 00:44:16 INFO  AbstractOperator:219 - Reconciliation #2810(watch) KafkaUser(kafka/sys-ktest11-password-reader): KafkaUser sys-ktest11-password-reader will be checked for creation or modification
2021-10-13 00:44:17 INFO  OperatorWatcher:38 - Reconciliation #2811(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): KafkaUser sys-ktest12-cogmanager-reader in namespace kafka was ADDED
2021-10-13 00:44:17 INFO  AbstractOperator:219 - Reconciliation #2811(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): KafkaUser sys-ktest12-cogmanager-reader will be checked for creation or modification
2021-10-13 00:44:19 INFO  AbstractOperator:466 - Reconciliation #2810(watch) KafkaUser(kafka/sys-ktest11-password-reader): reconciled
2021-10-13 00:44:23 INFO  OperatorWatcher:38 - Reconciliation #2812(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): KafkaUser sys-ktest12-cogmanager-reader in namespace kafka was MODIFIED
2021-10-13 00:44:23 INFO  CrdOperator:113 - Reconciliation #2811(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): Status of KafkaUser sys-ktest12-cogmanager-reader in namespace kafka has been updated
2021-10-13 00:44:23 INFO  AbstractOperator:466 - Reconciliation #2811(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): reconciled
2021-10-13 00:44:23 INFO  AbstractOperator:219 - Reconciliation #2812(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): KafkaUser sys-ktest12-cogmanager-reader will be checked for creation or modification
2021-10-13 00:44:23 INFO  OperatorWatcher:38 - Reconciliation #2813(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): KafkaUser sys-ktest12-pod-agent-writer in namespace kafka was ADDED
2021-10-13 00:44:23 INFO  AbstractOperator:219 - Reconciliation #2813(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): KafkaUser sys-ktest12-pod-agent-writer will be checked for creation or modification
2021-10-13 00:44:26 INFO  AbstractOperator:466 - Reconciliation #2812(watch) KafkaUser(kafka/sys-ktest12-cogmanager-reader): reconciled
2021-10-13 00:44:29 INFO  CrdOperator:113 - Reconciliation #2813(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): Status of KafkaUser sys-ktest12-pod-agent-writer in namespace kafka has been updated
2021-10-13 00:44:29 INFO  OperatorWatcher:38 - Reconciliation #2814(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): KafkaUser sys-ktest12-pod-agent-writer in namespace kafka was MODIFIED
2021-10-13 00:44:29 INFO  AbstractOperator:466 - Reconciliation #2813(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): reconciled
2021-10-13 00:44:29 INFO  AbstractOperator:219 - Reconciliation #2814(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): KafkaUser sys-ktest12-pod-agent-writer will be checked for creation or modification
2021-10-13 00:44:30 INFO  OperatorWatcher:38 - Reconciliation #2815(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): KafkaUser acc-ktest12-password-rotation-writer in namespace kafka was ADDED
2021-10-13 00:44:30 INFO  AbstractOperator:219 - Reconciliation #2815(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): KafkaUser acc-ktest12-password-rotation-writer will be checked for creation or modification
2021-10-13 00:44:35 INFO  AbstractOperator:466 - Reconciliation #2814(watch) KafkaUser(kafka/sys-ktest12-pod-agent-writer): reconciled
2021-10-13 00:44:36 INFO  OperatorWatcher:38 - Reconciliation #2816(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): KafkaUser acc-ktest12-password-rotation-writer in namespace kafka was MODIFIED
2021-10-13 00:44:36 INFO  CrdOperator:113 - Reconciliation #2815(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): Status of KafkaUser acc-ktest12-password-rotation-writer in namespace kafka has been updated
2021-10-13 00:44:36 INFO  AbstractOperator:466 - Reconciliation #2815(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): reconciled
2021-10-13 00:44:36 INFO  AbstractOperator:219 - Reconciliation #2816(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): KafkaUser acc-ktest12-password-rotation-writer will be checked for creation or modification
2021-10-13 00:44:37 INFO  OperatorWatcher:38 - Reconciliation #2817(watch) KafkaUser(kafka/sys-ktest12-password-reader): KafkaUser sys-ktest12-password-reader in namespace kafka was ADDED
2021-10-13 00:44:37 INFO  AbstractOperator:219 - Reconciliation #2817(watch) KafkaUser(kafka/sys-ktest12-password-reader): KafkaUser sys-ktest12-password-reader will be checked for creation or modification
2021-10-13 00:44:39 INFO  AbstractOperator:466 - Reconciliation #2816(watch) KafkaUser(kafka/acc-ktest12-password-rotation-writer): reconciled
2021-10-13 00:44:43 INFO  OperatorWatcher:38 - Reconciliation #2818(watch) KafkaUser(kafka/sys-ktest12-password-reader): KafkaUser sys-ktest12-password-reader in namespace kafka was MODIFIED
2021-10-13 00:44:43 INFO  CrdOperator:113 - Reconciliation #2817(watch) KafkaUser(kafka/sys-ktest12-password-reader): Status of KafkaUser sys-ktest12-password-reader in namespace kafka has been updated
2021-10-13 00:44:43 INFO  AbstractOperator:466 - Reconciliation #2817(watch) KafkaUser(kafka/sys-ktest12-password-reader): reconciled
2021-10-13 00:44:43 INFO  AbstractOperator:219 - Reconciliation #2818(watch) KafkaUser(kafka/sys-ktest12-password-reader): KafkaUser sys-ktest12-password-reader will be checked for creation or modification
2021-10-13 00:44:45 INFO  OperatorWatcher:38 - Reconciliation #2819(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): KafkaUser sys-ktest13-cogmanager-reader in namespace kafka was ADDED
2021-10-13 00:44:45 INFO  AbstractOperator:219 - Reconciliation #2819(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): KafkaUser sys-ktest13-cogmanager-reader will be checked for creation or modification
2021-10-13 00:44:50 INFO  AbstractOperator:466 - Reconciliation #2818(watch) KafkaUser(kafka/sys-ktest12-password-reader): reconciled
2021-10-13 00:44:51 INFO  OperatorWatcher:38 - Reconciliation #2820(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): KafkaUser sys-ktest13-cogmanager-reader in namespace kafka was MODIFIED
2021-10-13 00:44:51 INFO  CrdOperator:113 - Reconciliation #2819(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): Status of KafkaUser sys-ktest13-cogmanager-reader in namespace kafka has been updated
2021-10-13 00:44:51 INFO  AbstractOperator:466 - Reconciliation #2819(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): reconciled
2021-10-13 00:44:51 INFO  AbstractOperator:219 - Reconciliation #2820(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): KafkaUser sys-ktest13-cogmanager-reader will be checked for creation or modification
2021-10-13 00:44:51 INFO  OperatorWatcher:38 - Reconciliation #2821(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): KafkaUser sys-ktest13-pod-agent-writer in namespace kafka was ADDED
2021-10-13 00:44:51 INFO  AbstractOperator:219 - Reconciliation #2821(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): KafkaUser sys-ktest13-pod-agent-writer will be checked for creation or modification
2021-10-13 00:44:56 INFO  AbstractOperator:466 - Reconciliation #2820(watch) KafkaUser(kafka/sys-ktest13-cogmanager-reader): reconciled
2021-10-13 00:44:58 INFO  OperatorWatcher:38 - Reconciliation #2822(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): KafkaUser sys-ktest13-pod-agent-writer in namespace kafka was MODIFIED
2021-10-13 00:44:58 INFO  CrdOperator:113 - Reconciliation #2821(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): Status of KafkaUser sys-ktest13-pod-agent-writer in namespace kafka has been updated
2021-10-13 00:44:58 INFO  AbstractOperator:466 - Reconciliation #2821(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): reconciled
2021-10-13 00:44:58 INFO  AbstractOperator:219 - Reconciliation #2822(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): KafkaUser sys-ktest13-pod-agent-writer will be checked for creation or modification
2021-10-13 00:45:00 INFO  OperatorWatcher:38 - Reconciliation #2823(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): KafkaUser acc-ktest13-password-rotation-writer in namespace kafka was ADDED
2021-10-13 00:45:00 INFO  AbstractOperator:219 - Reconciliation #2823(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): KafkaUser acc-ktest13-password-rotation-writer will be checked for creation or modification
2021-10-13 00:45:01 INFO  AbstractOperator:466 - Reconciliation #2822(watch) KafkaUser(kafka/sys-ktest13-pod-agent-writer): reconciled
2021-10-13 00:45:06 INFO  OperatorWatcher:38 - Reconciliation #2824(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): KafkaUser acc-ktest13-password-rotation-writer in namespace kafka was MODIFIED
2021-10-13 00:45:06 INFO  CrdOperator:113 - Reconciliation #2823(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): Status of KafkaUser acc-ktest13-password-rotation-writer in namespace kafka has been updated
2021-10-13 00:45:06 INFO  AbstractOperator:466 - Reconciliation #2823(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): reconciled
2021-10-13 00:45:06 INFO  AbstractOperator:219 - Reconciliation #2824(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): KafkaUser acc-ktest13-password-rotation-writer will be checked for creation or modification
2021-10-13 00:45:07 INFO  OperatorWatcher:38 - Reconciliation #2825(watch) KafkaUser(kafka/sys-ktest13-password-reader): KafkaUser sys-ktest13-password-reader in namespace kafka was ADDED
2021-10-13 00:45:07 INFO  AbstractOperator:219 - Reconciliation #2825(watch) KafkaUser(kafka/sys-ktest13-password-reader): KafkaUser sys-ktest13-password-reader will be checked for creation or modification
2021-10-13 00:45:10 INFO  AbstractOperator:466 - Reconciliation #2824(watch) KafkaUser(kafka/acc-ktest13-password-rotation-writer): reconciled
2021-10-13 00:45:14 INFO  OperatorWatcher:38 - Reconciliation #2826(watch) KafkaUser(kafka/sys-ktest13-password-reader): KafkaUser sys-ktest13-password-reader in namespace kafka was MODIFIED
2021-10-13 00:45:14 INFO  CrdOperator:113 - Reconciliation #2825(watch) KafkaUser(kafka/sys-ktest13-password-reader): Status of KafkaUser sys-ktest13-password-reader in namespace kafka has been updated
2021-10-13 00:45:14 INFO  AbstractOperator:466 - Reconciliation #2825(watch) KafkaUser(kafka/sys-ktest13-password-reader): reconciled
2021-10-13 00:45:14 INFO  AbstractOperator:219 - Reconciliation #2826(watch) KafkaUser(kafka/sys-ktest13-password-reader): KafkaUser sys-ktest13-password-reader will be checked for creation or modification
2021-10-13 00:45:15 INFO  OperatorWatcher:38 - Reconciliation #2827(watch) KafkaUser(kafka/sys-ktest14-cogmanager-reader): KafkaUser sys-ktest14-cogmanager-reader in namespace kafka was ADDED
2021-10-13 00:45:15 INFO  AbstractOperator:219 - Reconciliation #2827(watch) KafkaUser(kafka/sys-ktest14-cogmanager-reader): KafkaUser sys-ktest14-cogmanager-reader will be checked for creation or modification
2021-10-13 00:45:18 INFO  AbstractOperator:466 - Reconciliation #2826(watch) KafkaUser(kafka/sys-ktest13-password-reader): reconciled

@scholzj
Copy link
Member

scholzj commented Oct 13, 2021

I tested by increasing Kafka resources (CPU/Mem) from Strimzi default settings to 500m, 1G and also increase Strimzi user operators and topic operator to how much @scholzj has listed in the above example. None of these changes made any difference to the amount of time it takes to create custom resources;

Well, that was an example of how to configure the resources. As it was clearly sad there, you need to adjust the actual values. The exact resources there are used for development as mentioned there, so nowhere near 300 users. That said, I never had any issues with using it with a single digit number of users.

@sknot-rh
Copy link
Member

I created a PoC. The merging of reconciliation result for TLS and SCRAM user needs to be improved sknot-rh@8719e87#diff-9c42d639637a67deb0200524f18b23daf141adeddc47f4d8d04145c76ed97995R61

@GarimaBathla
Copy link
Author

GarimaBathla commented Oct 13, 2021

@scholzj - thanks! yes I understood that was just a dev setting. I don't create 300 users at a time, I only create say 30 users at a time and I changed the way I create a user - forcing it to create a single Kafka User at a time instead of batching to create 30 users at a time and it takes 6 seconds per user to be ready - logs were attached. so 6 seconds on one user * 30 = 180 seconds ~ 3 mins. so I realized it is not a resource issue , it is batching issue, if we can create all 30 users with few seconds that would be ideal.

I have actually increased the memory of user-operator to

      reconciliationIntervalSeconds: 157680000
      resources:
        limits:
          cpu: "0.5"
          memory: 1024Mi
        requests:
          cpu: "0.5"
          memory: 1024Mi

but it was same results as when I used default settings.

@sknot-rh - thank you! how can I go about testing this fix in my env? I use helm to install strimzi - can this be shared as a patch via helm repo?

@GarimaBathla
Copy link
Author

@sknot-rh - if a patch release is not possible, could you please share by when can this be fixed in strimzi?

@yxiang92128
Copy link

yxiang92128 commented Nov 10, 2021

To add a data point, I created 1100 users with 30 seconds sleep in between in a script and only 14 of them got into "Ready" status and the rest after an overnight wait still got stuck in "NotReady" status.
"status": {
"conditions": [
{
"lastTransitionTime": "2021-11-10T00:45:53.514612Z",
"message": "Call(callName=alterUserScramCredentials, deadlineMs=1636505138052, tries=1, nextAllowedTryMs=1636505138209) timed out at 1636505138109 after 1 attempt(s)",
"reason": "TimeoutException",
"status": "True",
"type": "NotReady"
}
],

The question is there seems to be a "TimeoutException" occurring. Is there a config parameter we can adjust to get the new users unstuck into "Ready" status?

there seems to be a limit of the number of users that can be created otherwise Broker will start reporting TimeoutException for the new users created. I think I am seeing issues starting from 200 or so users.

Thanks.
@GarimaBathla @sknot-rh

@sknot-rh
Copy link
Member

Yeah, creating the users with some delays does not help IMO. The issue is, that AdminClient sends numberOfUsers*5 requests (200 users means 1000 requests) to brokers each reconciliation. So after there is a certain amount of users, the brokers are basically DDoS-ed and only a few of the users get reconciled and thus Ready in the status.
With my patch above I was able to get 400 users ready however, we need to devise a better way how to batch users reconciliation.

@vutkin
Copy link

vutkin commented Feb 1, 2022

Hi all, is any solution? I have same problem with 36 users only. What is going on?
strimzi operator version: 0.25.0

@GarimaBathla
Copy link
Author

@sknot-rh @scholzj -

I am still struggling with this problem, I have implemented some work-arounds in my code such that I don't do any parallelization and create one kafka user at a time, and even then I see there are times it can take 40-50 seconds sometimes for a kafka user to be in ready status.

I noticed today that for each kafka user user-operator is making few calls, and one of them is describe client quotas - in my case this is redundant call as I have not enabled any quotas at all.

Is it possible for user-operator to not make some calls all the time?

2022-02-19 13:06:37 DEBUG KafkaAdminClient:815 - [AdminClient clientId=adminclient-1] Call(callName=describeClientQuotas, deadlineMs=1645275997049, tries=2, nextAllowedTryMs=1645275997166) timed out at 1645275997066 after 2 attempt(s)
java.lang.Exception: DisconnectException: Cancelled describeClientQuotas request with correlation id 5374 due to node 0 being disconnected
at org.apache.kafka.clients.admin.KafkaAdminClient$Call.failWithTimeout(KafkaAdminClient.java:816) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
at org.apache.kafka.clients.admin.KafkaAdminClient$Call.fail(KafkaAdminClient.java:789) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.handleResponses(KafkaAdminClient.java:1183) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.processRequests(KafkaAdminClient.java:1341) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
at org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.run(KafkaAdminClient.java:1264) [org.apache.kafka.kafka-clients-2.8.0.jar:?]
at java.lang.Thread.run(Thread.java:829) [?:?]

@scholzj
Copy link
Member

scholzj commented Feb 19, 2022

When your user has no quitas set, the operator has to make sure there are no quotas configured for it in Kafka either. So these calles are required.

@GarimaBathla
Copy link
Author

GarimaBathla commented Feb 19, 2022

well when operator is making sure no quotas are configured in kafka either, there is an assumption that changes are done to kafka without operator's knowledge. the Kafka-user Kubernetes custom resource should be enough for user-operator to find out if there is any quota or not and if it is configured only then make describeClientQuotas call

Does that not sound right?

I am impacted because in my case it is an unnecessary call being made and slowing down and having cascade effect on future kafka users.

@scholzj
Copy link
Member

scholzj commented Feb 19, 2022

there is an assumption that changes are done to kafka without operator's knowledge

Correct, that assumption is there and it is valid assumption. The operator has to ensure that it does not happen.

But it is not the only reason for this. The operator itself is stateless -> it depends on the state of the custom resources. So it does not know whether the user never used quotas or whether the quotas were just removed.

So no, it is not an unnecessary call.

@GarimaBathla
Copy link
Author

I see - custom resource might have changed - current version vs previous version, so it is ensuring with the kafka.

I wish this bug would be resolved soon, is there any timeline you can share - it is really slowing down the system. my latest calls to create 2 kafka users took 2 mins.

@scholzj
Copy link
Member

scholzj commented Feb 19, 2022

I don't think we have any timeline. But it is on our TODO list.

@GarimaBathla
Copy link
Author

GarimaBathla commented Feb 21, 2022

@scholzj and. @sknot-rh @tombentley -

I am looking at the user-operator logs and the code and I see the below behavior -

there are 2 sets of duplicate calls being made to Kafka broker via admin-client when a Kafka user is added to the system.

KafkaUserOperator.CreateOrUpdate
1. describeClientQuotas - for both tlsUser and ScramUser - kafkaUserQuotasOperator.reconcile method is being invoked and these 2 executions check independently if the user has any quotas inside KafkaUserQuotasOperator.java exists method.

boolean exists(Reconciliation reconciliation, String username) throws Exception { return describeUserQuotas(reconciliation, username) != null; }
kafkaUserQuotasOperator.reconcile(reconciliation, KafkaUserModel.getTlsUserName(userName), finalTlsQuotas),
kafkaUserQuotasOperator.reconcile(reconciliation, KafkaUserModel.getScramUserName(userName),
finalScramOrNoneQuotas))),

2. describeAcls - describeAcls is invoked for both tlsUser and scramUser - though a user can be only TLS or Scram and not both.
Inside SimpleAclOperator.reconcile method, both method executions invoke describeAcl -

And I think this adds up to the 4 calls

I don't know why AdminClient is not able to connect to the Kafka brokers and times out but if there are the same calls made with the exact same parameters, I think these can be merged.

@GarimaBathla
Copy link
Author

@scholzj - is there a way for user-operator configuration to override this value

p.setProperty(AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG, "10000");

AdminClientConfig.REQUEST_TIMEOUT_MS_CONFIG - it is set to 10 seconds currently.

@scholzj
Copy link
Member

scholzj commented Feb 21, 2022

I don't think you can configure the Admin client. As for the calls for TLS / SCRAM users, this is needed to handle changes tot he authentication type.

@GarimaBathla
Copy link
Author

GarimaBathla commented Feb 21, 2022

For duplicate calls, both reconciliations are making 2 describe calls, though they are doing 2 different auth methods, but those calls are exactly same in 2 different paths and as a result adding 2 extra calls and can be merged such that reconciliation for quotas and ACLS for tls and scram auths both make a single joint call and reuse the results.

@scholzj
Copy link
Member

scholzj commented Feb 21, 2022

It is not the same calls if they are for different users or for different type of resource, are they?

@GarimaBathla
Copy link
Author

2022-02-21 06:45:36 DEBUG NetworkClient:522 - [AdminClient clientId=adminclient-1] Sending DESCRIBE_ACLS request with header RequestHeader(apiKey=DESCRIBE_ACLS, apiVersion=2, clientId=adminclient-1, correlationId=84) and timeout 10000 to node 1: DescribeAclsRequestData(resourceTypeFilter=1, resourceNameFilter=null, patternTypeFilter=1, principalFilter='User:CN=sys-gbtest1242-password-reader', hostFilter=null, operation=1, permissionType=1)

2022-02-21 06:45:36 DEBUG NetworkClient:522 - [AdminClient clientId=adminclient-1] Sending DESCRIBE_ACLS request with header RequestHeader(apiKey=DESCRIBE_ACLS, apiVersion=2, clientId=adminclient-1, correlationId=85) and timeout 10000 to node 1: DescribeAclsRequestData(resourceTypeFilter=1, resourceNameFilter=null, patternTypeFilter=1, principalFilter='User:sys-gbtest1242-password-reader', hostFilter=null, operation=1, permissionType=1)

Hi Jakub - you are right, these calls seems to be same but they are actually not,

principal filter is different in 2 calls, principalFilter='User:sys-gbtest1242-password-reader vs principalFilter='User:CN=sys-gbtest1242-password-reader'.

In kafka configuration is it possible to have tlsAuth as well SCRAM auth via strimzi?

@scholzj
Copy link
Member

scholzj commented Feb 21, 2022

In kafka configuration is it possible to have tlsAuth as well SCRAM auth via strimzi?

Yes, it is.

@GarimaBathla
Copy link
Author

@scholzj - wanted to update here that this is actually not a major issue anymore for us.

We did some investigation on our side and what we found was that all the Kafka brokers were running on a single worker node and as a result the cpu utilization was shot to 99% which resulted in extreme slowness of the user operator over a period.

This is resolved by configuring pod affinity inside the Kafka configuration like below

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
spec:
kafka:
# ...
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- CLUSTER-NAME-kafka
topologyKey: "kubernetes.io/hostname"

I am guessing for every repro that is out there this is the root cause, and pod affinity needs to be set. It is important for Kafka brokers to run on different nodes.

thank you to you and your team for all your help.

@GarimaBathla
Copy link
Author

Oh and now I can create 100 users in 2 mins, but I do ensure to create a single user at a time and still don’t do any parallel processing and have turned off periodic reconciliation, as in my case I never change the users ever.

@ctron
Copy link

ctron commented Jun 15, 2022

As already mentioned here: #6944 (comment)

Peeking at the source code of the operator, it looks like all reconciliation requests are started as futures, and then processed in parallel. Which might cause a huge spike in requests, and maybe some overload.

It looks to me like the reconciliation process should limit the number of concurrent reconcile operations. Or finish them one by one.

Might might take longer, but would be more stable, and not fail once the number of resources/users gets "too big".

ctron added a commit to ctron/strimzi-kafka-operator that referenced this issue Jun 15, 2022
This change performs the full/periodic reconciliation
of the user resource sequentially. This prevent this
operator from spiking a lot of requests and constantly
failing user resources.

closes strimzi#5691

Signed-off-by: Jens Reimann <jreimann@redhat.com>
ctron added a commit to ctron/strimzi-kafka-operator that referenced this issue Jun 17, 2022
This change performs the full/periodic reconciliation
of the user resource sequentially. This prevent this
operator from spiking a lot of requests and constantly
failing user resources.

closes strimzi#5691

Signed-off-by: Jens Reimann <jreimann@redhat.com>
@ctron
Copy link

ctron commented Jun 17, 2022

I published an image which contains the fix from #6945 on quay.io: https://quay.io/repository/ctronstrimzi/operator?tab=tags

You can activate this for Strimzi 0.28 by tweaking the Kafka object:

spec:
  entityOperator:
    userOperator:
      image: 'quay.io/ctronstrimzi/operator:0.28.0-1'

So far, this fixed the issue for us, for the periodic reconciliations (not for the (re-)creation of that watches).

@scholzj
Copy link
Member

scholzj commented Aug 2, 2022

Triaged on 2.8.2022: Other issues caused by this and closed as duplicates are #5887 and #5790. We are now working on this and some fixes might be part of Strimzi 0.31 release.

@scholzj
Copy link
Member

scholzj commented Sep 8, 2022

I made some progress on this in https://github.com/scholzj/strimzi-kafka-operator/tree/refactor-user-operator-to-follow-the-controller-model, but this did not made it into 0.31. Hopefully it will for 0.32.

scholzj added a commit to scholzj/strimzi-kafka-operator that referenced this issue Nov 15, 2022
@scholzj
Copy link
Member

scholzj commented Nov 16, 2022

This should be fixed in Strimzi 0.33. The expected release date is currently around second half of December or early January.

@chary1112004
Copy link

We faced same issue in strimzi 0.34.0 even number of users are small (< 20 users)

java.util.concurrent.CompletionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: alterUserScramCredentials

@scholzj
Copy link
Member

scholzj commented Nov 14, 2023

@chary1112004 You should probably start a new discussion for it (https://github.com/strimzi/strimzi-kafka-operator/discussions) where you share your full configurations and logs.

@chary1112004
Copy link

@scholzj thank you. I create a new discussion in https://github.com/orgs/strimzi/discussions/9346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
8 participants