Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ KafkaUsers #10021

Closed
see-quick opened this issue Apr 24, 2024 · 5 comments
Closed

Comments

@see-quick
Copy link
Member

Related problem

The SCRAM-SHA credentials cache in the User Operator fails to maintain readiness under high user load scenarios. This issue becomes apparent when the number of Kafka users managed by the User Operator approaches 4800. At this point, the system throws a RuntimeException indicating that the SCRAM-SHA Credentials Cache is not ready, which prevents the User Operator from processing further user reconciliations effectively.

Suggested solution

I would like to enhance the cache initialization and error-handling mechanisms to ensure that the cache can handle high loads without becoming unready. Possible improvements could include:

  • Implementing an exponential back-off retry mechanism during cache loading to handle transient failures more gracefully?
  • Adjusting the refresh interval dynamically based on system load and error rates to ensure the cache remains up-to-date without overwhelming the system.

Alternatives

No response

Additional context

  • The problem typically manifests at around 4800 users but may vary slightly based on cluster performance and specific configurations.
    The current caching mechanism is critical for reducing Kafka broker load by minimizing unnecessary calls to the Kafka Admin API.
  • This issue has been observed consistently in environments with high user churn and in stress-testing scenarios designed to simulate peak loads.
  • Moreover the UO configuration changes does not have any impact on how many KafkaUsers can UO handle.
Use Case: capacityDefaultUseCase
+------------+--------------------------+---------------------+---------------------------------+-------------------------------+----------------------+------------------------------+---------------------------------+--------------------------------------+------------------------------------+-------------------------------------+----------------------------------+
| Experiment | IN: OPERATION_TIMEOUT_MS | IN: WORK_QUEUE_SIZE | IN: CONTROLLER_THREAD_POOL_SIZE | IN: CACHE_REFRESH_INTERVAL_MS | IN: BATCH_QUEUE_SIZE | IN: BATCH_MAXIMUM_BLOCK_SIZE | IN: BATCH_MAXIMUM_BLOCK_TIME_MS | IN: USER_OPERATIONS_THREAD_POOL_SIZE | OUT: Successful KafkaUsers Created | jvm_memory_used_megabytes_total.txt | system_load_average_per_core.txt |
| 1          | 300000                   | 10000               | 50                              | 15000                         | 1024                 | 100                          | 100                             | 50                                   | 4800                               | 716.524856                          | 0.23875                          |
| 2          | 300000                   | 10000               | 100                             | 20000                         | 2048                 | 200                          | 50                              | 100                                  | 4800                               | 877.894032                          | 0.2825                           |
| 3          | 300000                   | 10000               | 75                              | 15000                         | 1500                 | 150                          | 75                              | 75                                   | 4800                               | 548.884896                          | 0.2825                           |
| 4          | 300000                   | 10000               | 100                             | 30000                         | 4096                 | 300                          | 100                             | 100                                  | 4800                               | 638.626608                          | 0.243125                         |
| 5          | 300000                   | 10000               | 50                              | 10000                         | 512                  | 50                           | 25                              | 50                                   | 4800                               | 725.024336                          | 0.2375                           |
+------------+--------------------------+---------------------+---------------------------------+-------------------------------+----------------------+------------------------------+---------------------------------+--------------------------------------+------------------------------------+-------------------------------------+----------------------------------+

Logs and error messages related to the issue: [1]

[1] - https://gist.github.com/see-quick/d85dece01a711b5be9f2507a30e0124b

@scholzj
Copy link
Member

scholzj commented Apr 24, 2024

Please share the full log and not just a snippet.

@see-quick
Copy link
Member Author

see-quick commented Apr 24, 2024

Give me a moment, I will share more logs :)

Update: Here is the log from the UserOperator container [1]

[1] - https://drive.google.com/file/d/1juNl65yfXlQankK_t7eQttvoCopyjqri/view?usp=sharing

@see-quick
Copy link
Member Author

see-quick commented Apr 24, 2024

So based on the suggestion from @scholzj (i.e., try to increase the length of the username), we found out that it might be the problem of the LIMIT for the KafkaAdmin API request. We found out that KafkaUsers:

  • with the length of 22 chars it can handle over 4800 users
  • with the length of 44 chars it can handle over 4100 users
    ....

An interesting fact is that it's not linear (i.e., 44 chars does allow more than 2400 users to be created). So there might be some compression involved....

Here is another log from the User Operator but with 44 chars usernames [1].

This can be found in ZK pods, which could be related to large request with communication of ZK <-> Kafka

2024-04-24 14:58:18,783 WARN Closing connection to /10.128.15.132:59788 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-7-4]
java.io.IOException: Len error 1048654
	at org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:521)
	at org.apache.zookeeper.server.NettyServerCnxn.processQueuedBuffer(NettyServerCnxn.java:405)
	at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.userEventTriggered(NettyServerCnxnFactory.java:324)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:398)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
	at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
	at io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
	at io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:387)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:400)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
	at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1428)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:396)
	at io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:61)
	at io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:381)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-24 14:58:18,783 DEBUG close called for session id: 0x30049a59a2

and in Kafka logs one could see:

2024-04-24 15:09:18,292 INFO channel is told closing (org.apache.zookeeper.ClientCnxnSocketNetty) [nioEventLoopGroup-3-1]
2024-04-24 15:09:18,292 WARN Session 0x20049a87f010001 for server cluster-f32bb7bf-zookeeper-client/172.30.106.133:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn) [main-SendThread(cluster-f32bb7bf-zookeeper-client:2181)]
EndOfStreamException: channel for sessionid 0x20049a87f010001 is lost
	at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:286)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [/kafka-acl-changes-event-process-thread]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Connected. (kafka.zookeeper.ZooKeeperClient) [/kafka-acl-changes-event-process-thread]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [data-plane-kafka-request-handler-7]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Connected. (kafka.zookeeper.ZooKeeperClient) [data-plane-kafka-request-handler-7]

but Pods are up and ready with Ready state:

cluster-f32bb7bf-b-c2961f8f-0                       1/1     Running   0               50m
cluster-f32bb7bf-b-c2961f8f-1                       1/1     Running   0               50m
cluster-f32bb7bf-b-c2961f8f-2                       1/1     Running   0               50m
cluster-f32bb7bf-entity-operator-76f48ccf76-khfp8   2/2     Running   0               50m
cluster-f32bb7bf-kafka-exporter-8b5f8586-rx9zm      1/1     Running   0               50m
cluster-f32bb7bf-scraper-8549f7f979-lpz5x           1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-0                        1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-1                        1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-2                        1/1     Running   0               51m
...

[1] - https://drive.google.com/file/d/1kAj27sWi79kGs96i3vCWegpfhrz_cBJV/view?usp=sharing

@see-quick
Copy link
Member Author

I will not be able to attend today's triage but to summarize this problem occurs in ZK-based clusters. On KRaft clusters the limit is around 19 300 KafkaUsers [1] with identical environment and flavour machines.

[1] - #10018

@scholzj
Copy link
Member

scholzj commented May 2, 2024

Triaged on the community call on 2.5.2024: There might be some ZooKeeper tunning that might help. But given that ZooKeeper is being removed, it does not sem like it is worth a deeper investigation or changes in Kafka / ZooKeeper should there be any needed. This should be closed.

@scholzj scholzj closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants