-
-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP/RFC: Improving the performance of fetching topics and consumer groups #134
Conversation
Thanks @ankon for this PR. Maybe there is a related issue on perf #55 with cache that could help you on response time ? Will give you my feedback soon |
True, I do change how to do that -- but that change shouldn't be externally visible, it merely reorders calls so that one avoids querying kafka too often, and rather use bulk queries. Especially the
Caching is certainly an interesting approach as well, and I'll try to have a look at that cache. Still these would be complementary: One improves the performance for the querying, and the caching reduces the number of times you'd have to query in the first place I suppose. |
Hello @ankon . I've a quick look but I will need to look at deeper. My first observation: It seems that yours changes is "breaking" pagination on topic list:
To have more debug output, try to run this : curl -i -X POST -H "Content-Type: application/json" \
-d '{ "configuredLevel": "TRACE" }' \
http://localhost:8080/loggers/org.kafkahq or change logback.xml. The most important part is to have only only topic on current page that is loaded, on big cluster describe topic offset can be very long, and pagination are here to avoid loading all the topic information. Tell me what do you think and thanks for the work 👍 |
Any progress on this PR? We have similar performance issue with a cluster having many topics and consumer groups. We are really interested with these enhancements. |
ac5776e
to
bd78d59
Compare
Thanks for taking a look!
Fair point, I likely lost that by joining things together. Fixed that by an explicit sorting step now, and while there rebased the branch onto the current HEAD of dev.
Right, sensible. I now configured my own instance with a low enough page-size, and then started to have a look. Reading through I need to think a bit further and read through more code here first, it's quite possible that the batching approach helps a lot in my situation (many groups, few topics), and would produce the complete opposite effect in other cases. @ftardif Sorry for the delays, this is right now "next to" my day job :) I do try to keep the branch compiling though, and for local testing you could try to checkout this branch and deploy it into your own docker registry (or use the produced kafkahq jar). Feedback whether this improves the world for you would be very much appreciated! npm install && npm run build && /gradlew test shadowJar && cp build/libs/kafkahq-*.jar docker/app/kafkahq.jar && docker build . -t kafkahq:$(git describe --abbrev --tags --dirty) |
I've now opted for the most "direct" approach: List the topics (or consumer groups) first, then page over that, and then lookup the content from Kafka for only that page. Depending on how your groups/topics distribute over each other this may be more or less "good", so another round of review/testing would definitely be appreciated. |
Thanks @ankon. To be honest, I'll try to explain you what the last research let me understand recently with espacially #55, #137 and some others reports. Here is the history :
But some report like #137 show what is the main trouble in the main conception of these page :
So what will be next action IMO :
I'll try too long to avoid this but since the consumer offset query can lead to a timeout on a lot of case, there is no way IMO to avoid this. Since this is a big work, I'll try to review the last addition on your PR next week to see if It will handle some improvement that can be useful in a short time before big refactoring to move all async. Thanks for your work and keep in touch next week 👍 |
Hey @here !
https://github.com/tchiotludo/kafkahq#pagination with option Tell me what is your feeling and if it help in your case. |
I'll try to rebase the branch onto that. I'm a bit suspicious here, as "more threads, more better" hasn't always been true in my experience :) FWIW, what I now did to make KafkaHQ be somewhat useable: I disabled the fetching of consumer groups completely (ea6ef98), as in my case this data isn't really needed. |
I've seen a better experience on my side. I've seen your last modification, seems to be clearly what I state on my previous comment, move to async will do the trick. Just need to find time |
Interested in this, our cluster has 20 topics and it takes 10 seconds to load the topic list page (pagination = 10 records so not even all are loaded). |
@jorgheymans even with last version ? |
@tchiotludo yes that's with the latest version, we just started using kafkahq :) |
FWIW, we also don't need need consumer group fetching, group lag is kept track of using prometheus metrics. So what i was thinking as possibilities:
WDYT ? |
To be honest, I really think the consumer group lag is a major feature in KafkaHQ. I'm using personally every day. This feature slow down the application for now, and the only solution is to go async IMO. This async feature will took more time but will resolve all the issue about performance (#55) since this is not the only things that take time on topic list (offset also take time), so removing consumer groups is only a quick patch that will not resolved all the performance issue. Another options is experimental cache : #55 (comment) |
OK i understand it's an important feature. I was just looking for a quick way out i guess :-) Also in the case where you have let's say 100 consumer groups for a topic would it clutter the table or it's graceful enough to handle this ? Consumer groups accumulate over time, with applications / console consumers / kafkacat etc etc . |
Good point, Never happen for me with like 4-5 consumers groups per topic & console-consumer being deleted. Will keep it in mind when I will do async |
Having consumer group information would certainly be nice for us ... to some extent. We have a multi-tenant solution, with a couple of topics per tenant. Multiple services (>40!) are working over these topics, and each of these services for every instance creates a unique consumer group. On our test environment I just checked now: We have maybe 5 tenants, and right now almost 4000 consumer groups. Most of these groups are "dead", and their lag would be irrelevant (it roughly would indicate how long these groups are dead :D). Now, this may be fairly unique, and it may be "wrong" in terms of best practices. We're considering changes that would reduce the number of groups, but realistically the number will always be a lot higher than just 5 per topic. At that point the UI of KafkaHQ isn't able to show the groups anyways apart by cutting of some -- and there isn't a good heuristic really to work out which ones to cut: Drop the ones with a high lag would be fine for us, but would obviously miss to show the ones with a high lag not caused by the group being dead. :) I think a switch to disable this actually works quite well in this case. I have now been able to use KafkaHQ for diagnosing some support questions, and I love it even without the consumer groups :) |
We have a cluster with hundreds of topics and hundreds of consumer-groups.
Latency due to calculating consumer lag on HQ is the #1 pain for the dev
team. I would very much agree with a toggle switch to keep the topic and
consumer page leaner and only join the consumer lag information once we
double click on either a specific topic or a consumer group.
thanks
Frederic Tardif
Le lun. 25 nov. 2019, à 09 h 32, Andreas Kohn <notifications@github.com> a
écrit :
… Having consumer group information would certainly be nice for us ... to
some extent.
We have a multi-tenant solution, with a couple of topics per tenant.
Multiple services (>40!) are working over these topics, and each of these
services *for every instance* creates a unique consumer group. On our
test environment I just checked now: We have maybe 5 tenants, and right now
almost 4000 consumer groups. Most of these groups are "dead", and their lag
would be irrelevant (it roughly would indicate how long these groups are
dead :D).
Now, this may be fairly unique, and it may be "wrong" in terms of best
practices. We're considering changes that would reduce the number of
groups, but realistically the number will always be a lot higher than just
5 per topic. At that point the UI of KafkaHQ isn't able to show the groups
anyways apart by cutting of some -- and there isn't a good heuristic really
to work out which ones to cut: Drop the ones with a high lag would be fine
for us, but would obviously miss to show the ones with a high lag *not
caused by the group being dead*. :)
I think a switch to disable this actually works quite well in this case. I
have now been able to *use* KafkaHQ for diagnosing some support
questions, and I love it even without the consumer groups :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#134>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABWCFTDC4CFAZIX2TLNABTQVPOXXANCNFSM4JDRZWEQ>
.
|
As i say, this is not the only things that is slow, removing consumer groups will not save you for others things (like topic offset that is the main slowness reason). Async is the only solution:
PR are welcome guys 😄 |
@tchiotludo i would provide a PR but #153 :/ |
This fixes problems with the missing encoding of the path parameter, and at the same time avoids doing string manipulation on things we know already what they are ("URL path segments"). URI.js does a good job of combining these as needed already.
Sure, we can do that -- after all this is a bit the point of this PR :) See #159, note that I did only compile-test that version. |
hey @ankon, I just push a new dev version that will "I hope" fix a lot of performance issue. So first sorry for that, I started with a completely different approach than you, and after seen some good things on this PR but when I try to merge it, there is too much conflict, and I "stole" some part of this PR, really sorry for that. Maybe you can try with dev version to see if it fit with you're need and maybe a new PR with all addition I don't see ? I know this is hard work for you, so as you wish and sorry to have stolen some parts without merging it 🙏 |
Thanks for your work!
No problem at all: This PR wasn't really meant for merging as-is, but more for discussion -- and I think we did have that :)
I actually just deployed it, as I wanted to try a "unmodified version" to see whether we still would need additional patches and it would make sense to start rebasing/merging. Turns out: We don't, the version 10ba9ad seems to work just fine in our environment. We did still keep the consumer groups disabled, because we don't need them anyways (see above). So I think maybe best could be to close this PR, and if there is new/more/etc things open a new one. WDYT? |
Thanks @ankon , really glad to hear that this version is Ok for you :) Need to take some more time to go async for all time consuming operation (allowing to remove the remove of consumergroup cols) but will be more longer. So like you said I will close this one and feel free to add a new one if you have more optimization in mind, thanks 👍 |
* implemented sse search on topic details * changed base url
I found KafkaHQ quite nice -- but it is way too slow and times out even with a "small" local setup of our environment.
Looking a bit at the trace logging and the code I found a few places that definitely could be improved:
The linked commits "help" (my local setup is now nice and fast to load the
/:cluster/topic
page), but it still times out in my test production setup with a few hundred consumer groups. Ultimately I do want this to work on my production setup, which has similar amounts of consumer groups, but considerably more topics distributed over these groups.