WIP/RFC: Improving the performance of fetching topics and consumer groups #134

ankon · 2019-10-22T15:52:25Z

I found KafkaHQ quite nice -- but it is way too slow and times out even with a "small" local setup of our environment.

Looking a bit at the trace logging and the code I found a few places that definitely could be improved:

TopicRepository was fetching all topics and then filtering them
ConsumerGroupRepository was similar, and additionally also did that when looking up offsets

The linked commits "help" (my local setup is now nice and fast to load the /:cluster/topic page), but it still times out in my test production setup with a few hundred consumer groups. Ultimately I do want this to work on my production setup, which has similar amounts of consumer groups, but considerably more topics distributed over these groups.

tchiotludo · 2019-10-22T20:18:16Z

Thanks @ankon for this PR.
Will have a look soon can be a good optimization but I need to look at closer, since you change how you get all consumer groups for a topic.
As I see, I don't see any other way to list all consumer group to have all the group subscribe to a topic .
Maybe this PR will me make think there is a better way.

Maybe there is a related issue on perf #55 with cache that could help you on response time ?

Will give you my feedback soon

ankon · 2019-10-22T22:02:47Z

[...] change how you get all consumers groups for a topic.

True, I do change how to do that -- but that change shouldn't be externally visible, it merely reorders calls so that one avoids querying kafka too often, and rather use bulk queries.

Especially the ConsumerGroupRepository::findByName was a problem, because it would first find all consumers groups, then do a query for each of these, which produces a bunch of consumed topics for that group as a side-effect. And then it would query information about these topics in that loop, which requires creating and closing a Kafka consumer (~3s depending on the heartbeat times for us). In our case at least the consumer groups actually use the same topics in many cases, so the refactoring uses that to first build a complete list of all topics used by any consumer group, and then just do a single query for all of these.

Maybe there is a related issue on perf #55 with cache that could help you on response time ?

Caching is certainly an interesting approach as well, and I'll try to have a look at that cache. Still these would be complementary: One improves the performance for the querying, and the caching reduces the number of times you'd have to query in the first place I suppose.

tchiotludo · 2019-10-28T07:14:30Z

Hello @ankon .

I've a quick look but I will need to look at deeper.

My first observation: It seems that yours changes is "breaking" pagination on topic list:

To be more precise, the order is not the same (natural sort order on dev branch, seems to be random on your branch).
Also .map(CompletableFuture::completedFuture) : break the lazy loading on pagination, there is some topic that are loaded that is not displayed on current page.

To have more debug output, try to run this :

curl -i -X POST -H "Content-Type: application/json" \
       -d '{ "configuredLevel": "TRACE" }' \
       http://localhost:8080/loggers/org.kafkahq

or change logback.xml.
And change the pagination with kafkahq.topic.page-size in order to see pagination in action.

The most important part is to have only only topic on current page that is loaded, on big cluster describe topic offset can be very long, and pagination are here to avoid loading all the topic information.

Tell me what do you think and thanks for the work 👍

ftardif · 2019-11-01T12:53:19Z

Any progress on this PR? We have similar performance issue with a cluster having many topics and consumer groups. We are really interested with these enhancements.

…calls to Kafka

…rips

ankon · 2019-11-01T20:58:21Z

Thanks for taking a look!

To be more precise, the order is not the same (natural sort order on dev branch, seems to be random on your branch).

Fair point, I likely lost that by joining things together. Fixed that by an explicit sorting step now, and while there rebased the branch onto the current HEAD of dev.

The most important part is to have only only topic on current page that is loaded, on big cluster describe topic offset can be very long, and pagination are here to avoid loading all the topic information.

Right, sensible. I now configured my own instance with a low enough page-size, and then started to have a look. Reading through org.kafkahq.utils.CompletablePaged it seems though that pagination so far works in quite a roundabout way: The TopicController creates a bunch of futures for all matching topics, and the CompletablePage then cancels all but the needed futures. By moving towards more "batching" of requests I did indeed make the paging less efficient in how it reduces work (to the point where we have all results ready before even looking at the pages), but at the same time I also avoid most of the roundtrip work that the paging would cancel.

I need to think a bit further and read through more code here first, it's quite possible that the batching approach helps a lot in my situation (many groups, few topics), and would produce the complete opposite effect in other cases.

@ftardif Sorry for the delays, this is right now "next to" my day job :) I do try to keep the branch compiling though, and for local testing you could try to checkout this branch and deploy it into your own docker registry (or use the produced kafkahq jar). Feedback whether this improves the world for you would be very much appreciated!

npm install && npm run build && /gradlew test shadowJar && cp build/libs/kafkahq-*.jar docker/app/kafkahq.jar && docker build . -t kafkahq:$(git describe --abbrev --tags --dirty)

…benefits

…er group page

ankon · 2019-11-01T21:57:37Z

I need to think a bit further and read through more code here first, it's quite possible that the batching approach helps a lot in my situation (many groups, few topics), and would produce the complete opposite effect in other cases.

I've now opted for the most "direct" approach: List the topics (or consumer groups) first, then page over that, and then lookup the content from Kafka for only that page. Depending on how your groups/topics distribute over each other this may be more or less "good", so another round of review/testing would definitely be appreciated.

tchiotludo · 2019-11-01T22:41:52Z

Thanks @ankon.

To be honest, I'll try to explain you what the last research let me understand recently with espacially #55, #137 and some others reports.
I never be able to reproduce the latency on my side because mostly my kafka cluster is local (on local network).
I've some chance to look at KafkaHQ on Confluent Cloud and Aiven (on internet) with more topic & consumer groups.

Here is the history :

before the TopicRepository was doing exactly what you are doing (send batch query with all the topic & consumer group in one time, and this time there is no pagination), result better performance than now on small cluster.
after some report, I've seen that people with a lot of topic / consumer group report a real latency, so I decided to do the version with CompletableFuture, result, only load current page but remove the batch fetch, see a better performance on my local computer. What i just realized, is that default executor will be ForkJoinPool (with 1 worker per cpu). Since most of people used KafkaHQ in docker with a limited CPU (mostly 1 from report), that will completely ruin the uses of CompletableFuture allOf, all with be done in sync and not in parallel like I hope.
- one solution here will be to provide a custom ExecutorPool configurable to have some more parallel query (I still think about provide this option as a quick & dirty solution).
I'll try to add some cache on Loading topics page takes time with a big number of partitions / topics #55, but it's clearly a thing I don't want at start of KafkaHQ: Yahoo Kafka manager is not real time with a lot of cache that is really frustrating when developing new feature based on kafka.

But some report like #137 show what is the main trouble in the main conception of these page :

getOffset of topic can be very long, especially on a big cluster or unstable cluster.
Similarly, display the consumer group (with lag) on the same page can be slow.

So what will be next action IMO :

Move a lot of thing to async ajax call espacially offset / log dir size and consumer group on topic page.
Move list of topic lag to async on consumer group page.
let people customize the number of parallel thread to complete the CompletableFuture on a page.

I'll try too long to avoid this but since the consumer offset query can lead to a timeout on a lot of case, there is no way IMO to avoid this.

Since this is a big work, I'll try to review the last addition on your PR next week to see if It will handle some improvement that can be useful in a short time before big refactoring to move all async.

Thanks for your work and keep in touch next week 👍

tchiotludo · 2019-11-18T20:20:41Z

Hey @here !
New release of version 0.10.0 include the first easy part :

let people customize the number of parallel thread to complete the CompletableFuture on a page.

https://github.com/tchiotludo/kafkahq#pagination with option kafkahq.pagination.threads.

Tell me what is your feeling and if it help in your case.

ankon · 2019-11-19T11:16:38Z

I'll try to rebase the branch onto that.

I'm a bit suspicious here, as "more threads, more better" hasn't always been true in my experience :)

FWIW, what I now did to make KafkaHQ be somewhat useable: I disabled the fetching of consumer groups completely (ea6ef98), as in my case this data isn't really needed.

tchiotludo · 2019-11-19T11:22:49Z

I've seen a better experience on my side.
But it always depend on Kafka server response time.

I've seen your last modification, seems to be clearly what I state on my previous comment, move to async will do the trick. Just need to find time

jorgheymans · 2019-11-21T13:52:16Z

Interested in this, our cluster has 20 topics and it takes 10 seconds to load the topic list page (pagination = 10 records so not even all are loaded).

tchiotludo · 2019-11-21T16:51:24Z

@jorgheymans even with last version ?

jorgheymans · 2019-11-24T10:10:27Z

@tchiotludo yes that's with the latest version, we just started using kafkahq :)

jorgheymans · 2019-11-25T09:30:42Z

FWIW, we also don't need need consumer group fetching, group lag is kept track of using prometheus metrics. So what i was thinking as possibilities:

have a config toggle that allows sites to disable it if they have no use for it. This should be easy and not impact existing users.
instead of showing consumer groups for all topics by default, have a button or something in that row cell that upon click fetches the group info only for that topic. I don't think there is a need to have a global overview of all consumer groups for all topics, and if so that should be made a different screen alltogether instead of packing it all in the topic list.

WDYT ?

tchiotludo · 2019-11-25T09:48:12Z

To be honest, I really think the consumer group lag is a major feature in KafkaHQ. I'm using personally every day.

This feature slow down the application for now, and the only solution is to go async IMO.
It will help both people, people that need this information and people that don't need it since the application will be very quick to display main information.

This async feature will took more time but will resolve all the issue about performance (#55) since this is not the only things that take time on topic list (offset also take time), so removing consumer groups is only a quick patch that will not resolved all the performance issue.

Another options is experimental cache : #55 (comment)
It will speed up the whole KafkaHQ experience but that can lead to some bug.
You can give a try if you want, available on last version.

jorgheymans · 2019-11-25T10:40:01Z

OK i understand it's an important feature. I was just looking for a quick way out i guess :-)

Also in the case where you have let's say 100 consumer groups for a topic would it clutter the table or it's graceful enough to handle this ? Consumer groups accumulate over time, with applications / console consumers / kafkacat etc etc .

tchiotludo · 2019-11-25T12:50:46Z

Good point, Never happen for me with like 4-5 consumers groups per topic & console-consumer being deleted. Will keep it in mind when I will do async

ankon · 2019-11-25T14:32:27Z

Having consumer group information would certainly be nice for us ... to some extent.

We have a multi-tenant solution, with a couple of topics per tenant. Multiple services (>40!) are working over these topics, and each of these services for every instance creates a unique consumer group. On our test environment I just checked now: We have maybe 5 tenants, and right now almost 4000 consumer groups. Most of these groups are "dead", and their lag would be irrelevant (it roughly would indicate how long these groups are dead :D).

Now, this may be fairly unique, and it may be "wrong" in terms of best practices. We're considering changes that would reduce the number of groups, but realistically the number will always be a lot higher than just 5 per topic. At that point the UI of KafkaHQ isn't able to show the groups anyways apart by cutting of some -- and there isn't a good heuristic really to work out which ones to cut: Drop the ones with a high lag would be fine for us, but would obviously miss to show the ones with a high lag not caused by the group being dead. :)

I think a switch to disable this actually works quite well in this case. I have now been able to use KafkaHQ for diagnosing some support questions, and I love it even without the consumer groups :)

ftardif · 2019-11-25T16:17:04Z

We have a cluster with hundreds of topics and hundreds of consumer-groups. Latency due to calculating consumer lag on HQ is the #1 pain for the dev team. I would very much agree with a toggle switch to keep the topic and consumer page leaner and only join the consumer lag information once we double click on either a specific topic or a consumer group. thanks Frederic Tardif Le lun. 25 nov. 2019, à 09 h 32, Andreas Kohn <notifications@github.com> a écrit :

…

Having consumer group information would certainly be nice for us ... to some extent. We have a multi-tenant solution, with a couple of topics per tenant. Multiple services (>40!) are working over these topics, and each of these services *for every instance* creates a unique consumer group. On our test environment I just checked now: We have maybe 5 tenants, and right now almost 4000 consumer groups. Most of these groups are "dead", and their lag would be irrelevant (it roughly would indicate how long these groups are dead :D). Now, this may be fairly unique, and it may be "wrong" in terms of best practices. We're considering changes that would reduce the number of groups, but realistically the number will always be a lot higher than just 5 per topic. At that point the UI of KafkaHQ isn't able to show the groups anyways apart by cutting of some -- and there isn't a good heuristic really to work out which ones to cut: Drop the ones with a high lag would be fine for us, but would obviously miss to show the ones with a high lag *not caused by the group being dead*. :) I think a switch to disable this actually works quite well in this case. I have now been able to *use* KafkaHQ for diagnosing some support questions, and I love it even without the consumer groups :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#134>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABWCFTDC4CFAZIX2TLNABTQVPOXXANCNFSM4JDRZWEQ> .

tchiotludo · 2019-11-25T16:51:04Z

As i say, this is not the only things that is slow, removing consumer groups will not save you for others things (like topic offset that is the main slowness reason).

Async is the only solution:

removing all the offset fetch / consumergroups from main page.
maybe allow cache only on this part (that is not needed in real time).

PR are welcome guys 😄

jorgheymans · 2019-11-26T08:41:52Z

@tchiotludo i would provide a PR but #153 :/

This fixes problems with the missing encoding of the path parameter, and at the same time avoids doing string manipulation on things we know already what they are ("URL path segments"). URI.js does a good job of combining these as needed already.

jorgheymans · 2019-11-28T07:00:53Z

@ankon maybe ea6ef98 could be part of a separate PR so we can contribute to this change only ?

ankon · 2019-11-28T09:59:11Z

Sure, we can do that -- after all this is a bit the point of this PR :)

See #159, note that I did only compile-test that version.

Reintroduce the RequestScope Remove the CompletablePaged and try to fetch data per list Remove acls from models and delegate to controller Update to micronaut 1.27 use a lot of PR #134 for inspiration close #156 relate to #55

tchiotludo · 2019-12-08T19:51:44Z

hey @ankon, I just push a new dev version that will "I hope" fix a lot of performance issue.

So first sorry for that, I started with a completely different approach than you, and after seen some good things on this PR but when I try to merge it, there is too much conflict, and I "stole" some part of this PR, really sorry for that.

Maybe you can try with dev version to see if it fit with you're need and maybe a new PR with all addition I don't see ?

I know this is hard work for you, so as you wish and sorry to have stolen some parts without merging it 🙏

ankon · 2019-12-09T14:04:36Z

Thanks for your work!

So first sorry for that, I started with a completely different approach than you, and after seen some good things on this PR but when I try to merge it, there is too much conflict, and I "stole" some part of this PR, really sorry for that.

No problem at all: This PR wasn't really meant for merging as-is, but more for discussion -- and I think we did have that :)

Maybe you can try with dev version to see if it fit with you're need and maybe a new PR with all addition I don't see ?

I actually just deployed it, as I wanted to try a "unmodified version" to see whether we still would need additional patches and it would make sense to start rebasing/merging. Turns out: We don't, the version 10ba9ad seems to work just fine in our environment. We did still keep the consumer groups disabled, because we don't need them anyways (see above).

So I think maybe best could be to close this PR, and if there is new/more/etc things open a new one. WDYT?

tchiotludo · 2019-12-09T19:35:14Z

Thanks @ankon , really glad to hear that this version is Ok for you :)
Mean that we are on the right way and first part of optimization is working.
I'll release a new version really soon 🎉

Need to take some more time to go async for all time consuming operation (allowing to remove the remove of consumergroup cols) but will be more longer.
Will wait for more community feedback with current version.

So like you said I will close this one and feel free to add a new one if you have more optimization in mind, thanks 👍

* implemented sse search on topic details * changed base url

ankon added 6 commits November 1, 2019 21:00

Improve performance of the /:cluster/topic page by avoiding repeated …

5017e28

…calls to Kafka

Restructure ConsumerGroupRepository::findByName to avoid Kafka roundt…

cd3ae40

…rips

Query log dirs for multiple topics at the same time

653970c

Use the values() function to simplify the code further

0658f03

Replace the for/if loop with a stream/filter/map/collect sequence

e4ff7a4

Sort the topics by name

bd78d59

ankon force-pushed the wip/performance-loading branch from ac5776e to bd78d59 Compare November 1, 2019 20:49

ankon added 6 commits November 1, 2019 22:43

Remove unused import

4a9a90a

Remove a (now) useless .stream.collect

04d46b4

Use Java streams instead of for/if

9e350bd

Factor out a "Paged" class from the "CompletablePaged" class

bf45676

Use the Paged class to simplify the pagination and keep the batching …

7368d9f

…benefits

Adapt the changes from the Topic{Controller,Repository} to the consum…

043cf33

…er group page

ankon added 2 commits November 1, 2019 23:14

Remove unused import

40d4fac

Make including the consumer group data optional

ea6ef98

Enable TRACE logging by default on this branch

a31584e

tchiotludo mentioned this pull request Nov 18, 2019

Loading topics page takes time with a big number of partitions / topics #55

Closed

Build the search URL using segments

a638abf

This fixes problems with the missing encoding of the path parameter, and at the same time avoids doing string manipulation on things we know already what they are ("URL path segments"). URI.js does a good job of combining these as needed already.

ankon mentioned this pull request Nov 28, 2019

Make including the consumer group data optional #159

Merged

tchiotludo force-pushed the dev branch from f2937d8 to f1bf1cd Compare December 7, 2019 22:49

tchiotludo force-pushed the dev branch from f1bf1cd to 510bf1b Compare December 8, 2019 19:14

tchiotludo force-pushed the dev branch from 510bf1b to 3869ef3 Compare December 8, 2019 19:21

tchiotludo force-pushed the dev branch from 3869ef3 to e0523c3 Compare December 8, 2019 19:37

tchiotludo closed this Dec 9, 2019

tchiotludo mentioned this pull request Jan 6, 2020

handle case when Broker: Leader not available #137

Open

ghost pushed a commit that referenced this pull request Jun 19, 2020

AKHQ-222 topic search not working (#134)

81670fa

* implemented sse search on topic details * changed base url

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/RFC: Improving the performance of fetching topics and consumer groups #134

WIP/RFC: Improving the performance of fetching topics and consumer groups #134

ankon commented Oct 22, 2019 •

edited

Loading

tchiotludo commented Oct 22, 2019

ankon commented Oct 22, 2019

tchiotludo commented Oct 28, 2019

ftardif commented Nov 1, 2019

ankon commented Nov 1, 2019

ankon commented Nov 1, 2019

tchiotludo commented Nov 1, 2019

tchiotludo commented Nov 18, 2019

ankon commented Nov 19, 2019

tchiotludo commented Nov 19, 2019

jorgheymans commented Nov 21, 2019 •

edited

Loading

tchiotludo commented Nov 21, 2019

jorgheymans commented Nov 24, 2019

jorgheymans commented Nov 25, 2019

tchiotludo commented Nov 25, 2019

jorgheymans commented Nov 25, 2019

tchiotludo commented Nov 25, 2019

ankon commented Nov 25, 2019

ftardif commented Nov 25, 2019 via email

tchiotludo commented Nov 25, 2019

jorgheymans commented Nov 26, 2019

jorgheymans commented Nov 28, 2019

ankon commented Nov 28, 2019

tchiotludo commented Dec 8, 2019

ankon commented Dec 9, 2019

tchiotludo commented Dec 9, 2019

WIP/RFC: Improving the performance of fetching topics and consumer groups #134

WIP/RFC: Improving the performance of fetching topics and consumer groups #134

Conversation

ankon commented Oct 22, 2019 • edited Loading

tchiotludo commented Oct 22, 2019

ankon commented Oct 22, 2019

tchiotludo commented Oct 28, 2019

ftardif commented Nov 1, 2019

ankon commented Nov 1, 2019

ankon commented Nov 1, 2019

tchiotludo commented Nov 1, 2019

tchiotludo commented Nov 18, 2019

ankon commented Nov 19, 2019

tchiotludo commented Nov 19, 2019

jorgheymans commented Nov 21, 2019 • edited Loading

tchiotludo commented Nov 21, 2019

jorgheymans commented Nov 24, 2019

jorgheymans commented Nov 25, 2019

tchiotludo commented Nov 25, 2019

jorgheymans commented Nov 25, 2019

tchiotludo commented Nov 25, 2019

ankon commented Nov 25, 2019

ftardif commented Nov 25, 2019 via email

tchiotludo commented Nov 25, 2019

jorgheymans commented Nov 26, 2019

jorgheymans commented Nov 28, 2019

ankon commented Nov 28, 2019

tchiotludo commented Dec 8, 2019

ankon commented Dec 9, 2019

tchiotludo commented Dec 9, 2019

ankon commented Oct 22, 2019 •

edited

Loading

jorgheymans commented Nov 21, 2019 •

edited

Loading