Performance Issue with SCRAM-SHA Credentials Cache at 4800+ `KafkaUsers` #10021

see-quick · 2024-04-24T12:41:24Z

Alternatives

No response

Additional context

The problem typically manifests at around 4800 users but may vary slightly based on cluster performance and specific configurations.
The current caching mechanism is critical for reducing Kafka broker load by minimizing unnecessary calls to the Kafka Admin API.
This issue has been observed consistently in environments with high user churn and in stress-testing scenarios designed to simulate peak loads.
Moreover the UO configuration changes does not have any impact on how many KafkaUsers can UO handle.

Use Case: capacityDefaultUseCase
+------------+--------------------------+---------------------+---------------------------------+-------------------------------+----------------------+------------------------------+---------------------------------+--------------------------------------+------------------------------------+-------------------------------------+----------------------------------+
| Experiment | IN: OPERATION_TIMEOUT_MS | IN: WORK_QUEUE_SIZE | IN: CONTROLLER_THREAD_POOL_SIZE | IN: CACHE_REFRESH_INTERVAL_MS | IN: BATCH_QUEUE_SIZE | IN: BATCH_MAXIMUM_BLOCK_SIZE | IN: BATCH_MAXIMUM_BLOCK_TIME_MS | IN: USER_OPERATIONS_THREAD_POOL_SIZE | OUT: Successful KafkaUsers Created | jvm_memory_used_megabytes_total.txt | system_load_average_per_core.txt |
| 1          | 300000                   | 10000               | 50                              | 15000                         | 1024                 | 100                          | 100                             | 50                                   | 4800                               | 716.524856                          | 0.23875                          |
| 2          | 300000                   | 10000               | 100                             | 20000                         | 2048                 | 200                          | 50                              | 100                                  | 4800                               | 877.894032                          | 0.2825                           |
| 3          | 300000                   | 10000               | 75                              | 15000                         | 1500                 | 150                          | 75                              | 75                                   | 4800                               | 548.884896                          | 0.2825                           |
| 4          | 300000                   | 10000               | 100                             | 30000                         | 4096                 | 300                          | 100                             | 100                                  | 4800                               | 638.626608                          | 0.243125                         |
| 5          | 300000                   | 10000               | 50                              | 10000                         | 512                  | 50                           | 25                              | 50                                   | 4800                               | 725.024336                          | 0.2375                           |
+------------+--------------------------+---------------------+---------------------------------+-------------------------------+----------------------+------------------------------+---------------------------------+--------------------------------------+------------------------------------+-------------------------------------+----------------------------------+

Logs and error messages related to the issue: [1]

[1] - https://gist.github.com/see-quick/d85dece01a711b5be9f2507a30e0124b

The text was updated successfully, but these errors were encountered:

scholzj · 2024-04-24T12:44:24Z

Please share the full log and not just a snippet.

see-quick · 2024-04-24T13:10:33Z

Give me a moment, I will share more logs :)

Update: Here is the log from the UserOperator container [1]

[1] - https://drive.google.com/file/d/1juNl65yfXlQankK_t7eQttvoCopyjqri/view?usp=sharing

see-quick · 2024-04-24T14:53:24Z

So based on the suggestion from @scholzj (i.e., try to increase the length of the username), we found out that it might be the problem of the LIMIT for the KafkaAdmin API request. We found out that KafkaUsers:

with the length of 22 chars it can handle over 4800 users
with the length of 44 chars it can handle over 4100 users
....

An interesting fact is that it's not linear (i.e., 44 chars does allow more than 2400 users to be created). So there might be some compression involved....

Here is another log from the User Operator but with 44 chars usernames [1].

This can be found in ZK pods, which could be related to large request with communication of ZK <-> Kafka

2024-04-24 14:58:18,783 WARN Closing connection to /10.128.15.132:59788 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-7-4]
java.io.IOException: Len error 1048654
	at org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:521)
	at org.apache.zookeeper.server.NettyServerCnxn.processQueuedBuffer(NettyServerCnxn.java:405)
	at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.userEventTriggered(NettyServerCnxnFactory.java:324)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:398)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
	at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
	at io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
	at io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:387)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:400)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
	at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1428)
	at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:396)
	at io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:61)
	at io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:381)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-24 14:58:18,783 DEBUG close called for session id: 0x30049a59a2

and in Kafka logs one could see:

2024-04-24 15:09:18,292 INFO channel is told closing (org.apache.zookeeper.ClientCnxnSocketNetty) [nioEventLoopGroup-3-1]
2024-04-24 15:09:18,292 WARN Session 0x20049a87f010001 for server cluster-f32bb7bf-zookeeper-client/172.30.106.133:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. (org.apache.zookeeper.ClientCnxn) [main-SendThread(cluster-f32bb7bf-zookeeper-client:2181)]
EndOfStreamException: channel for sessionid 0x20049a87f010001 is lost
	at org.apache.zookeeper.ClientCnxnSocketNetty.doTransport(ClientCnxnSocketNetty.java:286)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1289)
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [/kafka-acl-changes-event-process-thread]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Connected. (kafka.zookeeper.ZooKeeperClient) [/kafka-acl-changes-event-process-thread]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Waiting until connected. (kafka.zookeeper.ZooKeeperClient) [data-plane-kafka-request-handler-7]
2024-04-24 15:09:18,293 INFO [ZooKeeperClient ACL authorizer] Connected. (kafka.zookeeper.ZooKeeperClient) [data-plane-kafka-request-handler-7]

but Pods are up and ready with Ready state:

cluster-f32bb7bf-b-c2961f8f-0                       1/1     Running   0               50m
cluster-f32bb7bf-b-c2961f8f-1                       1/1     Running   0               50m
cluster-f32bb7bf-b-c2961f8f-2                       1/1     Running   0               50m
cluster-f32bb7bf-entity-operator-76f48ccf76-khfp8   2/2     Running   0               50m
cluster-f32bb7bf-kafka-exporter-8b5f8586-rx9zm      1/1     Running   0               50m
cluster-f32bb7bf-scraper-8549f7f979-lpz5x           1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-0                        1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-1                        1/1     Running   0               51m
cluster-f32bb7bf-zookeeper-2                        1/1     Running   0               51m
...

[1] - https://drive.google.com/file/d/1kAj27sWi79kGs96i3vCWegpfhrz_cBJV/view?usp=sharing

see-quick · 2024-05-02T07:39:22Z

I will not be able to attend today's triage but to summarize this problem occurs in ZK-based clusters. On KRaft clusters the limit is around 19 300 KafkaUsers [1] with identical environment and flavour machines.

[1] - #10018

scholzj · 2024-05-02T08:50:07Z

Triaged on the community call on 2.5.2024: There might be some ZooKeeper tunning that might help. But given that ZooKeeper is being removed, it does not sem like it is worth a deeper investigation or changes in Kafka / ZooKeeper should there be any needed. This should be closed.

see-quick added user operator performance labels Apr 24, 2024

scholzj added the needs-triage label Apr 24, 2024

scholzj closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ `KafkaUsers` #10021

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ `KafkaUsers` #10021

see-quick commented Apr 24, 2024

scholzj commented Apr 24, 2024

see-quick commented Apr 24, 2024 •

edited

see-quick commented Apr 24, 2024 •

edited

see-quick commented May 2, 2024

scholzj commented May 2, 2024

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ KafkaUsers #10021

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ KafkaUsers #10021

Comments

see-quick commented Apr 24, 2024

Related problem

Suggested solution

Alternatives

Additional context

scholzj commented Apr 24, 2024

see-quick commented Apr 24, 2024 • edited

see-quick commented Apr 24, 2024 • edited

see-quick commented May 2, 2024

scholzj commented May 2, 2024

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ `KafkaUsers` #10021

Performance Issue with SCRAM-SHA Credentials Cache at 4800+ `KafkaUsers` #10021

see-quick commented Apr 24, 2024 •

edited

see-quick commented Apr 24, 2024 •

edited