Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP/RFC: Improving the performance of fetching topics and consumer groups #134

Closed
wants to merge 16 commits into from

Conversation

ankon
Copy link
Contributor

@ankon ankon commented Oct 22, 2019

I found KafkaHQ quite nice -- but it is way too slow and times out even with a "small" local setup of our environment.

Looking a bit at the trace logging and the code I found a few places that definitely could be improved:

  • TopicRepository was fetching all topics and then filtering them
  • ConsumerGroupRepository was similar, and additionally also did that when looking up offsets

The linked commits "help" (my local setup is now nice and fast to load the /:cluster/topic page), but it still times out in my test production setup with a few hundred consumer groups. Ultimately I do want this to work on my production setup, which has similar amounts of consumer groups, but considerably more topics distributed over these groups.

@tchiotludo
Copy link
Owner

Thanks @ankon for this PR.
Will have a look soon can be a good optimization but I need to look at closer, since you change how you get all consumer groups for a topic.
As I see, I don't see any other way to list all consumer group to have all the group subscribe to a topic .
Maybe this PR will me make think there is a better way.

Maybe there is a related issue on perf #55 with cache that could help you on response time ?

Will give you my feedback soon

@ankon
Copy link
Contributor Author

ankon commented Oct 22, 2019

[...] change how you get all consumers groups for a topic.

True, I do change how to do that -- but that change shouldn't be externally visible, it merely reorders calls so that one avoids querying kafka too often, and rather use bulk queries.

Especially the ConsumerGroupRepository::findByName was a problem, because it would first find all consumers groups, then do a query for each of these, which produces a bunch of consumed topics for that group as a side-effect. And then it would query information about these topics in that loop, which requires creating and closing a Kafka consumer (~3s depending on the heartbeat times for us). In our case at least the consumer groups actually use the same topics in many cases, so the refactoring uses that to first build a complete list of all topics used by any consumer group, and then just do a single query for all of these.

Maybe there is a related issue on perf #55 with cache that could help you on response time ?

Caching is certainly an interesting approach as well, and I'll try to have a look at that cache. Still these would be complementary: One improves the performance for the querying, and the caching reduces the number of times you'd have to query in the first place I suppose.

@tchiotludo
Copy link
Owner

Hello @ankon .

I've a quick look but I will need to look at deeper.

My first observation: It seems that yours changes is "breaking" pagination on topic list:

  • To be more precise, the order is not the same (natural sort order on dev branch, seems to be random on your branch).
  • Also .map(CompletableFuture::completedFuture) : break the lazy loading on pagination, there is some topic that are loaded that is not displayed on current page.

To have more debug output, try to run this :

curl -i -X POST -H "Content-Type: application/json" \
       -d '{ "configuredLevel": "TRACE" }' \
       http://localhost:8080/loggers/org.kafkahq

or change logback.xml.
And change the pagination with kafkahq.topic.page-size in order to see pagination in action.

The most important part is to have only only topic on current page that is loaded, on big cluster describe topic offset can be very long, and pagination are here to avoid loading all the topic information.

Tell me what do you think and thanks for the work 👍

@ftardif
Copy link

ftardif commented Nov 1, 2019

Any progress on this PR? We have similar performance issue with a cluster having many topics and consumer groups. We are really interested with these enhancements.

@ankon
Copy link
Contributor Author

ankon commented Nov 1, 2019

Thanks for taking a look!

To be more precise, the order is not the same (natural sort order on dev branch, seems to be random on your branch).

Fair point, I likely lost that by joining things together. Fixed that by an explicit sorting step now, and while there rebased the branch onto the current HEAD of dev.

The most important part is to have only only topic on current page that is loaded, on big cluster describe topic offset can be very long, and pagination are here to avoid loading all the topic information.

Right, sensible. I now configured my own instance with a low enough page-size, and then started to have a look. Reading through org.kafkahq.utils.CompletablePaged it seems though that pagination so far works in quite a roundabout way: The TopicController creates a bunch of futures for all matching topics, and the CompletablePage then cancels all but the needed futures. By moving towards more "batching" of requests I did indeed make the paging less efficient in how it reduces work (to the point where we have all results ready before even looking at the pages), but at the same time I also avoid most of the roundtrip work that the paging would cancel.

I need to think a bit further and read through more code here first, it's quite possible that the batching approach helps a lot in my situation (many groups, few topics), and would produce the complete opposite effect in other cases.

@ftardif Sorry for the delays, this is right now "next to" my day job :) I do try to keep the branch compiling though, and for local testing you could try to checkout this branch and deploy it into your own docker registry (or use the produced kafkahq jar). Feedback whether this improves the world for you would be very much appreciated!

npm install && npm run build && /gradlew test shadowJar && cp build/libs/kafkahq-*.jar docker/app/kafkahq.jar && docker build . -t kafkahq:$(git describe --abbrev --tags --dirty)

@ankon
Copy link
Contributor Author

ankon commented Nov 1, 2019

I need to think a bit further and read through more code here first, it's quite possible that the batching approach helps a lot in my situation (many groups, few topics), and would produce the complete opposite effect in other cases.

I've now opted for the most "direct" approach: List the topics (or consumer groups) first, then page over that, and then lookup the content from Kafka for only that page. Depending on how your groups/topics distribute over each other this may be more or less "good", so another round of review/testing would definitely be appreciated.

@tchiotludo
Copy link
Owner

Thanks @ankon.

To be honest, I'll try to explain you what the last research let me understand recently with espacially #55, #137 and some others reports.
I never be able to reproduce the latency on my side because mostly my kafka cluster is local (on local network).
I've some chance to look at KafkaHQ on Confluent Cloud and Aiven (on internet) with more topic & consumer groups.

Here is the history :

  • before the TopicRepository was doing exactly what you are doing (send batch query with all the topic & consumer group in one time, and this time there is no pagination), result better performance than now on small cluster.
  • after some report, I've seen that people with a lot of topic / consumer group report a real latency, so I decided to do the version with CompletableFuture, result, only load current page but remove the batch fetch, see a better performance on my local computer. What i just realized, is that default executor will be ForkJoinPool (with 1 worker per cpu). Since most of people used KafkaHQ in docker with a limited CPU (mostly 1 from report), that will completely ruin the uses of CompletableFuture allOf, all with be done in sync and not in parallel like I hope.
    • one solution here will be to provide a custom ExecutorPool configurable to have some more parallel query (I still think about provide this option as a quick & dirty solution).
  • I'll try to add some cache on Loading topics page takes time with a big number of partitions / topics  #55, but it's clearly a thing I don't want at start of KafkaHQ: Yahoo Kafka manager is not real time with a lot of cache that is really frustrating when developing new feature based on kafka.

But some report like #137 show what is the main trouble in the main conception of these page :

  • getOffset of topic can be very long, especially on a big cluster or unstable cluster.
  • Similarly, display the consumer group (with lag) on the same page can be slow.

So what will be next action IMO :

  • Move a lot of thing to async ajax call espacially offset / log dir size and consumer group on topic page.
  • Move list of topic lag to async on consumer group page.
  • let people customize the number of parallel thread to complete the CompletableFuture on a page.

I'll try too long to avoid this but since the consumer offset query can lead to a timeout on a lot of case, there is no way IMO to avoid this.

Since this is a big work, I'll try to review the last addition on your PR next week to see if It will handle some improvement that can be useful in a short time before big refactoring to move all async.

Thanks for your work and keep in touch next week 👍

@tchiotludo
Copy link
Owner

Hey @here !
New release of version 0.10.0 include the first easy part :

let people customize the number of parallel thread to complete the CompletableFuture on a page.

https://github.com/tchiotludo/kafkahq#pagination with option kafkahq.pagination.threads.

Tell me what is your feeling and if it help in your case.

@ankon
Copy link
Contributor Author

ankon commented Nov 19, 2019

I'll try to rebase the branch onto that.

I'm a bit suspicious here, as "more threads, more better" hasn't always been true in my experience :)

FWIW, what I now did to make KafkaHQ be somewhat useable: I disabled the fetching of consumer groups completely (ea6ef98), as in my case this data isn't really needed.

@tchiotludo
Copy link
Owner

I've seen a better experience on my side.
But it always depend on Kafka server response time.

I've seen your last modification, seems to be clearly what I state on my previous comment, move to async will do the trick. Just need to find time

@jorgheymans
Copy link
Contributor

jorgheymans commented Nov 21, 2019

Interested in this, our cluster has 20 topics and it takes 10 seconds to load the topic list page (pagination = 10 records so not even all are loaded).

@tchiotludo
Copy link
Owner

@jorgheymans even with last version ?

@jorgheymans
Copy link
Contributor

@tchiotludo yes that's with the latest version, we just started using kafkahq :)

@jorgheymans
Copy link
Contributor

FWIW, we also don't need need consumer group fetching, group lag is kept track of using prometheus metrics. So what i was thinking as possibilities:

  • have a config toggle that allows sites to disable it if they have no use for it. This should be easy and not impact existing users.
  • instead of showing consumer groups for all topics by default, have a button or something in that row cell that upon click fetches the group info only for that topic. I don't think there is a need to have a global overview of all consumer groups for all topics, and if so that should be made a different screen alltogether instead of packing it all in the topic list.

WDYT ?

@tchiotludo
Copy link
Owner

To be honest, I really think the consumer group lag is a major feature in KafkaHQ. I'm using personally every day.

This feature slow down the application for now, and the only solution is to go async IMO.
It will help both people, people that need this information and people that don't need it since the application will be very quick to display main information.

This async feature will took more time but will resolve all the issue about performance (#55) since this is not the only things that take time on topic list (offset also take time), so removing consumer groups is only a quick patch that will not resolved all the performance issue.

Another options is experimental cache : #55 (comment)
It will speed up the whole KafkaHQ experience but that can lead to some bug.
You can give a try if you want, available on last version.

@jorgheymans
Copy link
Contributor

OK i understand it's an important feature. I was just looking for a quick way out i guess :-)

Also in the case where you have let's say 100 consumer groups for a topic would it clutter the table or it's graceful enough to handle this ? Consumer groups accumulate over time, with applications / console consumers / kafkacat etc etc .

@tchiotludo
Copy link
Owner

Good point, Never happen for me with like 4-5 consumers groups per topic & console-consumer being deleted. Will keep it in mind when I will do async

@ankon
Copy link
Contributor Author

ankon commented Nov 25, 2019

Having consumer group information would certainly be nice for us ... to some extent.

We have a multi-tenant solution, with a couple of topics per tenant. Multiple services (>40!) are working over these topics, and each of these services for every instance creates a unique consumer group. On our test environment I just checked now: We have maybe 5 tenants, and right now almost 4000 consumer groups. Most of these groups are "dead", and their lag would be irrelevant (it roughly would indicate how long these groups are dead :D).

Now, this may be fairly unique, and it may be "wrong" in terms of best practices. We're considering changes that would reduce the number of groups, but realistically the number will always be a lot higher than just 5 per topic. At that point the UI of KafkaHQ isn't able to show the groups anyways apart by cutting of some -- and there isn't a good heuristic really to work out which ones to cut: Drop the ones with a high lag would be fine for us, but would obviously miss to show the ones with a high lag not caused by the group being dead. :)

I think a switch to disable this actually works quite well in this case. I have now been able to use KafkaHQ for diagnosing some support questions, and I love it even without the consumer groups :)

@ftardif
Copy link

ftardif commented Nov 25, 2019 via email

@tchiotludo
Copy link
Owner

As i say, this is not the only things that is slow, removing consumer groups will not save you for others things (like topic offset that is the main slowness reason).

Async is the only solution:

  • removing all the offset fetch / consumergroups from main page.
  • maybe allow cache only on this part (that is not needed in real time).

PR are welcome guys 😄

@jorgheymans
Copy link
Contributor

@tchiotludo i would provide a PR but #153 :/

This fixes problems with the missing encoding of the path parameter, and at the same time avoids doing
string manipulation on things we know already what they are ("URL path segments"). URI.js does a good
job of combining these as needed already.
@jorgheymans
Copy link
Contributor

@ankon maybe ea6ef98 could be part of a separate PR so we can contribute to this change only ?

@ankon
Copy link
Contributor Author

ankon commented Nov 28, 2019

Sure, we can do that -- after all this is a bit the point of this PR :)

See #159, note that I did only compile-test that version.

tchiotludo added a commit that referenced this pull request Dec 7, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
tchiotludo added a commit that referenced this pull request Dec 7, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
tchiotludo added a commit that referenced this pull request Dec 7, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
tchiotludo added a commit that referenced this pull request Dec 8, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
tchiotludo added a commit that referenced this pull request Dec 8, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
tchiotludo added a commit that referenced this pull request Dec 8, 2019
Reintroduce the RequestScope
Remove the CompletablePaged and try to fetch data per list
Remove acls from models and delegate to controller
Update to micronaut 1.27

use a lot of PR #134 for inspiration
close #156
relate to #55
@tchiotludo
Copy link
Owner

hey @ankon, I just push a new dev version that will "I hope" fix a lot of performance issue.

So first sorry for that, I started with a completely different approach than you, and after seen some good things on this PR but when I try to merge it, there is too much conflict, and I "stole" some part of this PR, really sorry for that.

Maybe you can try with dev version to see if it fit with you're need and maybe a new PR with all addition I don't see ?

I know this is hard work for you, so as you wish and sorry to have stolen some parts without merging it 🙏

@ankon
Copy link
Contributor Author

ankon commented Dec 9, 2019

Thanks for your work!

So first sorry for that, I started with a completely different approach than you, and after seen some good things on this PR but when I try to merge it, there is too much conflict, and I "stole" some part of this PR, really sorry for that.

No problem at all: This PR wasn't really meant for merging as-is, but more for discussion -- and I think we did have that :)

Maybe you can try with dev version to see if it fit with you're need and maybe a new PR with all addition I don't see ?

I actually just deployed it, as I wanted to try a "unmodified version" to see whether we still would need additional patches and it would make sense to start rebasing/merging. Turns out: We don't, the version 10ba9ad seems to work just fine in our environment. We did still keep the consumer groups disabled, because we don't need them anyways (see above).

So I think maybe best could be to close this PR, and if there is new/more/etc things open a new one. WDYT?

@tchiotludo
Copy link
Owner

Thanks @ankon , really glad to hear that this version is Ok for you :)
Mean that we are on the right way and first part of optimization is working.
I'll release a new version really soon 🎉

Need to take some more time to go async for all time consuming operation (allowing to remove the remove of consumergroup cols) but will be more longer.
Will wait for more community feedback with current version.

So like you said I will close this one and feel free to add a new one if you have more optimization in mind, thanks 👍

@tchiotludo tchiotludo closed this Dec 9, 2019
ghost pushed a commit that referenced this pull request Jun 19, 2020
* implemented sse search on topic details

* changed base url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants