2.3.x increases latency for consumed messages #903

myazinn · 2023-06-05T18:37:53Z

This issue is basically a copy of this comment as was discussed in Discord.

We monitor topic lags (diff between message timestamp and when our service starts processing it), and here's what we've got with several ZIO and zio-kafka versions (different colours - different topic-partitions)

Left part (small lags) - ZIO 2.0.11 + zio-kafka 2.1.3. Middle part (huge lags for some topic partitions) - ZIO 2.0.14 + zio-kafka 2.3.1. Right side (after red tooltip, small lags again) - ZIO 2.0.14 + zio-kafka 2.2.

2.3.1 seems to also consume more CPU (different colours - different pods)

Memory consumption is roughly the same, but 2.3.1 puts much more pressure on GC (ZGC) for some reason

The text was updated successfully, but these errors were encountered:

guizmaii · 2023-06-06T07:24:47Z

Thanks @myazinn!

We made a change (here, not yet released. Will be released in 2.3.2) allowing you to disable an internal optimisation that I suspect to be the origin of your issue.

Would it be possible for you to test if it does fix your issue maybe, please?
For that you'll need to use a snapshot version of zio-kafka.

To do that, you'll have to:

add this in your build.sbt:

ThisBuild / resolvers ++= Resolver.sonatypeOssRepos("snapshots")

change the zio-kafka version to 2.3.1+28-c25aaa34-SNAPSHOT
disable the "optimistic resume" internal optimisation by adding this in your ConsumerSettings:

val settings = 
  ConsumerSettings(
    ...
    enableOptimisticResume = false
  )

I'm interested in seeing what are the latencies registered in your graphs when this optimisation is disabled.

myazinn · 2023-06-06T17:42:53Z

Hi @guizmaii
You are right, enableOptimisticResume = false does solve a problem.

Left part - zio-kafka 2.2
Middle part (after a small spike at 20:22) - zio-kafka 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = false.
Right part (huge lags after 20:35) zio-kafka 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = true. Anything else was the same as in the middle part, switching the flag was the only change. Anything else is the same as in the middle part, switching the flag was the only change.
Lags were growing so I rolled back changes. Left and middle parts (2.2 and 2.3.1+28-c25aaa34-SNAPSHOT with enableOptimisticResume = false respectively) had the same lags.

guizmaii · 2023-06-07T07:27:58Z

Thank you so much for this @myazinn! Super helpful! ❤️

You are right, enableOptimisticResume = false does solve a problem.

When you say "does solve a problem", does that mean you have more problems with zio-kafka?

@erikvanoosten Did you expect such an issue with the optimistic resume optimisation?

myazinn · 2023-06-07T08:23:59Z

When you say "does solve a problem", does that mean you have more problems with zio-kafka?

No, it's just my bad English. It does solve THE problem 😄

erikvanoosten · 2023-06-07T10:05:35Z

@erikvanoosten Did you expect such an issue with the optimistic resume optimisation?

I already knew that the Java client can stick to a partition for a very long time. However, I had never realized that very efficient consumers could be affected by it once we do optimistic resume.

erikvanoosten · 2023-06-07T10:06:20Z

@myazinn What are the consumer properties? It could be that tweaking the consumer properties can help as well.

erikvanoosten · 2023-06-11T06:12:46Z

See also discussion here #844 (comment).

myazinn · 2023-06-12T14:42:08Z

@erikvanoosten
Apologies for making you wait. Here're my settings

  consumer {
    enable.auto.commit = false
    receive.buffer.bytes = 1048576
    group.id = "my-group"
    send.buffer.bytes = 1048576
    max.partition.fetch.bytes = 1048576
    fetch.max.bytes = 52428800
    max.poll.interval.ms = 300000
    max.poll.records = 500
    fetch.max.wait.ms = 100
    default.api.timeout.ms = 60000
    auto.offset.reset = "latest"
    partition.assignment.strategy = "org.apache.kafka.clients.consumer.RoundRobinAssignor"
  }

erikvanoosten · 2023-06-24T14:35:37Z

receive.buffer.bytes = 1048576
That is a gigantic receive buffer: 1GiB! The default is just 32KiB.

send.buffer.bytes = 1048576
Also, much larger than the default (128KiB).

Same for max.partition.fetch.bytes and fetch.max.bytes. These values are all way too high.

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo) is a good candidate for causing high latency. The kafka consumer basically tries to download the entire topic before going to the next.

I really recommend you stay a lot closer to the default values as documented on https://kafka.apache.org/documentation/#consumerconfigs.

guizmaii · 2023-06-24T17:02:07Z

I close the issue as, I think, we fixed the issue.

@myazinn Feel free to ask for us to reopen it if you feel we didn't address your issue.
Thanks for your report and help 🙂

myazinn · 2023-06-27T13:42:18Z

Hi @erikvanoosten. Thank you for reply :) Just to clarify a few things

receive.buffer.bytes = 1048576
That is a gigantic receive buffer: 1GiB! The default is just 32KiB.

But it's not 1 GiB 🤔 1048576 bytes = 1_048_576 bytes = 1 MiB.

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo)

I believe you meant fetch.max.bytes, so it is a typo 😄. Anyway, 52428800 bytes = 52_428_800 bytes = 50 MiB. It is larger than a default value, but it's not that large.
I tried using default values for almost everything and it's actually made latency worse. I believe I could play with them and find a reasonable balance, but it'll take a lot of time and current ones work fine with prefetch disabled.

erikvanoosten · 2023-06-27T14:02:04Z

Especially max.partition.fetch.bytes of 50 GiB (no, not a typo)

I believe you meant fetch.max.bytes, so it is a typo 😄. Anyway, 52428800 bytes = 52_428_800 bytes = 50 MiB. It is larger than a default value, but it's not that large. I tried using default values for almost everything and it's actually made latency worse. I believe I could play with them and find a reasonable balance, but it'll take a lot of time and current ones work fine with prefetch disabled.

I believe you meant fetch.max.bytes

You are right @myazinn, it must have been too late when I wrote that. My apologies.

50Mb is still a lot for fetch.max.bytes though. I still recommend you stay close to the default value for fetch.max.bytes to reduce latency.

myazinn · 2023-06-27T17:18:05Z

No problem, thanks for the advice :) Tomorrow I'll try reducing this parameter and will check how it goes.

guizmaii closed this as completed Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.3.x increases latency for consumed messages #903

2.3.x increases latency for consumed messages #903

myazinn commented Jun 5, 2023

guizmaii commented Jun 6, 2023 •

edited

myazinn commented Jun 6, 2023 •

edited

guizmaii commented Jun 7, 2023

myazinn commented Jun 7, 2023 •

edited

erikvanoosten commented Jun 7, 2023 •

edited

erikvanoosten commented Jun 7, 2023

erikvanoosten commented Jun 11, 2023

myazinn commented Jun 12, 2023

erikvanoosten commented Jun 24, 2023 •

edited

guizmaii commented Jun 24, 2023 •

edited

myazinn commented Jun 27, 2023 •

edited

erikvanoosten commented Jun 27, 2023

myazinn commented Jun 27, 2023

2.3.x increases latency for consumed messages #903

2.3.x increases latency for consumed messages #903

Comments

myazinn commented Jun 5, 2023

guizmaii commented Jun 6, 2023 • edited

myazinn commented Jun 6, 2023 • edited

guizmaii commented Jun 7, 2023

myazinn commented Jun 7, 2023 • edited

erikvanoosten commented Jun 7, 2023 • edited

erikvanoosten commented Jun 7, 2023

erikvanoosten commented Jun 11, 2023

myazinn commented Jun 12, 2023

erikvanoosten commented Jun 24, 2023 • edited

guizmaii commented Jun 24, 2023 • edited

myazinn commented Jun 27, 2023 • edited

erikvanoosten commented Jun 27, 2023

myazinn commented Jun 27, 2023

guizmaii commented Jun 6, 2023 •

edited

myazinn commented Jun 6, 2023 •

edited

myazinn commented Jun 7, 2023 •

edited

erikvanoosten commented Jun 7, 2023 •

edited

erikvanoosten commented Jun 24, 2023 •

edited

guizmaii commented Jun 24, 2023 •

edited

myazinn commented Jun 27, 2023 •

edited