Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.3.x increases latency for consumed messages #903

Closed
myazinn opened this issue Jun 5, 2023 · 13 comments
Closed

2.3.x increases latency for consumed messages #903

myazinn opened this issue Jun 5, 2023 · 13 comments

Comments

@myazinn
Copy link
Contributor

myazinn commented Jun 5, 2023

This issue is basically a copy of this comment as was discussed in Discord.

We monitor topic lags (diff between message timestamp and when our service starts processing it), and here's what we've got with several ZIO and zio-kafka versions (different colours - different topic-partitions)
image
Left part (small lags) - ZIO 2.0.11 + zio-kafka 2.1.3. Middle part (huge lags for some topic partitions) - ZIO 2.0.14 + zio-kafka 2.3.1. Right side (after red tooltip, small lags again) - ZIO 2.0.14 + zio-kafka 2.2.

2.3.1 seems to also consume more CPU (different colours - different pods)
image

Memory consumption is roughly the same, but 2.3.1 puts much more pressure on GC (ZGC) for some reason
image

@guizmaii
Copy link
Member

guizmaii commented Jun 6, 2023

Thanks @myazinn!

We made a change (here, not yet released. Will be released in 2.3.2) allowing you to disable an internal optimisation that I suspect to be the origin of your issue.

Would it be possible for you to test if it does fix your issue maybe, please?
For that you'll need to use a snapshot version of zio-kafka.

To do that, you'll have to:

  1. add this in your build.sbt:
ThisBuild / resolvers ++= Resolver.sonatypeOssRepos("snapshots")
  1. change the zio-kafka version to 2.3.1+28-c25aaa34-SNAPSHOT
  2. disable the "optimistic resume" internal optimisation by adding this in your ConsumerSettings:
val settings = 
  ConsumerSettings(
    ...
    enableOptimisticResume = false
  )

I'm interested in seeing what are the latencies registered in your graphs when this optimisation is disabled.

@myazinn
Copy link
Contributor Author

myazinn commented Jun 6, 2023

Hi @guizmaii
You are right, enableOptimisticResume = false does solve a problem.
image
Left part - zio-kafka 2.2
Middle part (after a small spike at 20:22) - zio-kafka 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = false.
Right part (huge lags after 20:35) zio-kafka 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = true. Anything else was the same as in the middle part, switching the flag was the only change. Anything else is the same as in the middle part, switching the flag was the only change.
Lags were growing so I rolled back changes. Left and middle parts (2.2 and 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = false respectively) had the same lags.

@guizmaii
Copy link
Member

guizmaii commented Jun 7, 2023

Thank you so much for this @myazinn! Super helpful! ❤️

You are right, enableOptimisticResume = false does solve a problem.

When you say "does solve a problem", does that mean you have more problems with zio-kafka?

@erikvanoosten Did you expect such an issue with the optimistic resume optimisation?

@myazinn
Copy link
Contributor Author

myazinn commented Jun 7, 2023

When you say "does solve a problem", does that mean you have more problems with zio-kafka?

No, it's just my bad English. It does solve THE problem 😄

@erikvanoosten
Copy link
Collaborator

erikvanoosten commented Jun 7, 2023

@erikvanoosten Did you expect such an issue with the optimistic resume optimisation?

I already knew that the Java client can stick to a partition for a very long time. However, I had never realized that very efficient consumers could be affected by it once we do optimistic resume.

@erikvanoosten
Copy link
Collaborator

@myazinn What are the consumer properties? It could be that tweaking the consumer properties can help as well.

@erikvanoosten
Copy link
Collaborator

See also discussion here #844 (comment).

@myazinn
Copy link
Contributor Author

myazinn commented Jun 12, 2023

@erikvanoosten
Apologies for making you wait. Here're my settings

  consumer {
    enable.auto.commit = false
    receive.buffer.bytes = 1048576
    group.id = "my-group"
    send.buffer.bytes = 1048576
    max.partition.fetch.bytes = 1048576
    fetch.max.bytes = 52428800
    max.poll.interval.ms = 300000
    max.poll.records = 500
    fetch.max.wait.ms = 100
    default.api.timeout.ms = 60000
    auto.offset.reset = "latest"
    partition.assignment.strategy = "org.apache.kafka.clients.consumer.RoundRobinAssignor"
  }

@erikvanoosten
Copy link
Collaborator

erikvanoosten commented Jun 24, 2023

receive.buffer.bytes = 1048576
That is a gigantic receive buffer: 1GiB! The default is just 32KiB.

send.buffer.bytes = 1048576
Also, much larger than the default (128KiB).

Same for max.partition.fetch.bytes and fetch.max.bytes. These values are all way too high.

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo) is a good candidate for causing high latency. The kafka consumer basically tries to download the entire topic before going to the next.

I really recommend you stay a lot closer to the default values as documented on https://kafka.apache.org/documentation/#consumerconfigs.

@guizmaii
Copy link
Member

guizmaii commented Jun 24, 2023

I close the issue as, I think, we fixed the issue.

@myazinn Feel free to ask for us to reopen it if you feel we didn't address your issue.
Thanks for your report and help 🙂

@myazinn
Copy link
Contributor Author

myazinn commented Jun 27, 2023

Hi @erikvanoosten. Thank you for reply :) Just to clarify a few things

receive.buffer.bytes = 1048576
That is a gigantic receive buffer: 1GiB! The default is just 32KiB.

But it's not 1 GiB 🤔 1048576 bytes = 1_048_576 bytes = 1 MiB.

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo)

I believe you meant fetch.max.bytes, so it is a typo 😄. Anyway, 52428800 bytes = 52_428_800 bytes = 50 MiB. It is larger than a default value, but it's not that large.
I tried using default values for almost everything and it's actually made latency worse. I believe I could play with them and find a reasonable balance, but it'll take a lot of time and current ones work fine with prefetch disabled.

@erikvanoosten
Copy link
Collaborator

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo)

I believe you meant fetch.max.bytes, so it is a typo 😄. Anyway, 52428800 bytes = 52_428_800 bytes = 50 MiB. It is larger than a default value, but it's not that large. I tried using default values for almost everything and it's actually made latency worse. I believe I could play with them and find a reasonable balance, but it'll take a lot of time and current ones work fine with prefetch disabled.

I believe you meant fetch.max.bytes

You are right @myazinn, it must have been too late when I wrote that. My apologies.

50Mb is still a lot for fetch.max.bytes though. I still recommend you stay close to the default value for fetch.max.bytes to reduce latency.

@myazinn
Copy link
Contributor Author

myazinn commented Jun 27, 2023

No problem, thanks for the advice :) Tomorrow I'll try reducing this parameter and will check how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants