-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Issue with SCRAM-SHA Credentials Cache at 4800+ KafkaUsers
#10021
Comments
Please share the full log and not just a snippet. |
Give me a moment, I will share more logs :) Update: Here is the log from the UserOperator container [1] [1] - https://drive.google.com/file/d/1juNl65yfXlQankK_t7eQttvoCopyjqri/view?usp=sharing |
So based on the suggestion from @scholzj (i.e., try to increase the length of the username), we found out that it might be the problem of the LIMIT for the
An interesting fact is that it's not linear (i.e., 44 chars does allow more than 2400 users to be created). So there might be some compression involved.... Here is another log from the User Operator but with 44 chars usernames [1]. This can be found in ZK pods, which could be related to large request with communication of ZK <-> Kafka 2024-04-24 14:58:18,783 WARN Closing connection to /10.128.15.132:59788 (org.apache.zookeeper.server.NettyServerCnxn) [nioEventLoopGroup-7-4]
java.io.IOException: Len error 1048654
at org.apache.zookeeper.server.NettyServerCnxn.receiveMessage(NettyServerCnxn.java:521)
at org.apache.zookeeper.server.NettyServerCnxn.processQueuedBuffer(NettyServerCnxn.java:405)
at org.apache.zookeeper.server.NettyServerCnxnFactory$CnxnChannelHandler.userEventTriggered(NettyServerCnxnFactory.java:324)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:398)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
at io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:117)
at io.netty.handler.codec.ByteToMessageDecoder.userEventTriggered(ByteToMessageDecoder.java:387)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:400)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:376)
at io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:368)
at io.netty.channel.DefaultChannelPipeline$HeadContext.userEventTriggered(DefaultChannelPipeline.java:1428)
at io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:396)
at io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:61)
at io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:381)
at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-24 14:58:18,783 DEBUG close called for session id: 0x30049a59a2 and in Kafka logs one could see:
but Pods are up and ready with Ready state: cluster-f32bb7bf-b-c2961f8f-0 1/1 Running 0 50m
cluster-f32bb7bf-b-c2961f8f-1 1/1 Running 0 50m
cluster-f32bb7bf-b-c2961f8f-2 1/1 Running 0 50m
cluster-f32bb7bf-entity-operator-76f48ccf76-khfp8 2/2 Running 0 50m
cluster-f32bb7bf-kafka-exporter-8b5f8586-rx9zm 1/1 Running 0 50m
cluster-f32bb7bf-scraper-8549f7f979-lpz5x 1/1 Running 0 51m
cluster-f32bb7bf-zookeeper-0 1/1 Running 0 51m
cluster-f32bb7bf-zookeeper-1 1/1 Running 0 51m
cluster-f32bb7bf-zookeeper-2 1/1 Running 0 51m
... [1] - https://drive.google.com/file/d/1kAj27sWi79kGs96i3vCWegpfhrz_cBJV/view?usp=sharing |
I will not be able to attend today's triage but to summarize this problem occurs in ZK-based clusters. On KRaft clusters the limit is around 19 300 KafkaUsers [1] with identical environment and flavour machines. [1] - #10018 |
Triaged on the community call on 2.5.2024: There might be some ZooKeeper tunning that might help. But given that ZooKeeper is being removed, it does not sem like it is worth a deeper investigation or changes in Kafka / ZooKeeper should there be any needed. This should be closed. |
Related problem
The SCRAM-SHA credentials cache in the User Operator fails to maintain readiness under high user load scenarios. This issue becomes apparent when the number of Kafka users managed by the User Operator approaches 4800. At this point, the system throws a
RuntimeException
indicating that the SCRAM-SHA Credentials Cache is not ready, which prevents the User Operator from processing further user reconciliations effectively.Suggested solution
I would like to enhance the cache initialization and error-handling mechanisms to ensure that the cache can handle high loads without becoming unready. Possible improvements could include:
Alternatives
No response
Additional context
The current caching mechanism is critical for reducing Kafka broker load by minimizing unnecessary calls to the Kafka Admin API.
KafkaUsers
can UO handle.Logs and error messages related to the issue: [1]
[1] - https://gist.github.com/see-quick/d85dece01a711b5be9f2507a30e0124b
The text was updated successfully, but these errors were encountered: