Change the `kafka_latency_fetch_latency` metric #17720

ballard26 · 2024-04-10T01:07:34Z

The kafka_latency_fetch_latency metric originally measured the time
it'd take to complete one fetch poll. A fetch poll would create a fetch
plan then execute it in parallel on every shard. On a given shard
fetch_ntps_in_parallel would account for the majority of the execution time
of the plan.

Since fetches are no longer implemented by polling there isn't an
exactly equivalent measurement that can be assigned to the metric.

This commit instead records the duration of the first call to
fetch_ntps_in_parallel on every shard to the metric. This first call takes
as long as it would during a fetch poll. Hence the resulting measurement
should be close to the duration of a fetch poll.

Backports Required

Release Notes

Improvements

Changes what the kafka_latency_fetch_latency metric measures to be the time the first fetch_ntps_in_parallel takes.

The `kafka_latency_fetch_latency` metric originally measured the time it'd take to complete one fetch poll. A fetch poll would create a fetch plan then execute it in parallel on every shard. On a given shard `fetch_ntps_in_parallel` would account for the majority of the execution time of the plan. Since fetches are no longer implemented by polling there isn't an exactly equivalent measurement that can be assigned to the metric. This commit instead records the duration of the first call to `fetch_ntps_in_parallel` on every shard to the metric. This first call takes as long as it would during a fetch poll. Hence the resulting measurement should be close to the duration of a fetch poll.

vbotbuildovich · 2024-04-10T04:11:47Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47588#018ec5e6-fdcd-4cbe-b9d3-590b33dc235f

StephanDollberg

I guess the latency will overall still go up because we are not taking all the empty polls into account now?

ballard26 · 2024-04-12T16:55:34Z

I guess the latency will overall still go up because we are not taking all the empty polls into account now?

Good point, I guess the question is whether subsequent polls skew the distribution towards empty polls or not. My guess would be that it does since empirically we tend to poll more than once on fetches that required polling.

Not sure how to get around this though. I could measure every call to fetch_ntps_in_parallel. But I think that would make the metric less useful.

StephanDollberg · 2024-04-12T17:18:09Z

Yeah it's tricky, maybe have a quick look a classic high throughput OMB scenario and also a high polling one and see how it changes?

travisdowns

The code looks fine. If we can't find something reasonable we can just delete this metric (though we will have to update the dashboards).

Whatever way we go please update the description of the metric in the source to reflect the new reality.

ballard26 · 2024-04-19T21:55:11Z

Measured this new metric's value against the old one in the graphs below. The drop in both graphs show the point where I changed the cluster config to use the non-polling fetch impl instead of the polling fetch impl. The non-polling fetch impl uses the new measure for the metric introduced in this PR while the polling one use the existing measure.

An OMB test was running with a pretty common workload of 50MB/s in/out, 10 producers/consumers, and on a 3x i3en.xlarge cluster.

vbotbuildovich · 2024-04-22T15:22:57Z

/backport v23.3.x

ballard26 added 2 commits April 9, 2024 17:15

kafka: remove existing fetch latency measurement

dd4676a

ballard26 requested review from travisdowns and StephanDollberg April 10, 2024 01:07

github-actions bot added the area/redpanda label Apr 10, 2024

StephanDollberg approved these changes Apr 10, 2024

View reviewed changes

travisdowns reviewed Apr 16, 2024

View reviewed changes

piyushredpanda added this to the 24.1.1-GA milestone Apr 21, 2024

ballard26 merged commit 1b0f186 into redpanda-data:dev Apr 22, 2024
18 of 19 checks passed

vbotbuildovich mentioned this pull request Apr 22, 2024

[v23.3.x] Change the kafka_latency_fetch_latency metric #17977

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change the `kafka_latency_fetch_latency` metric #17720

Change the `kafka_latency_fetch_latency` metric #17720

ballard26 commented Apr 10, 2024 •

edited

vbotbuildovich commented Apr 10, 2024

StephanDollberg left a comment

ballard26 commented Apr 12, 2024

StephanDollberg commented Apr 12, 2024 •

edited

travisdowns left a comment

ballard26 commented Apr 19, 2024

vbotbuildovich commented Apr 22, 2024

Change the kafka_latency_fetch_latency metric #17720

Change the kafka_latency_fetch_latency metric #17720

Conversation

ballard26 commented Apr 10, 2024 • edited

Backports Required

Release Notes

Improvements

vbotbuildovich commented Apr 10, 2024

StephanDollberg left a comment

Choose a reason for hiding this comment

ballard26 commented Apr 12, 2024

StephanDollberg commented Apr 12, 2024 • edited

travisdowns left a comment

Choose a reason for hiding this comment

ballard26 commented Apr 19, 2024

vbotbuildovich commented Apr 22, 2024

Change the `kafka_latency_fetch_latency` metric #17720

Change the `kafka_latency_fetch_latency` metric #17720

ballard26 commented Apr 10, 2024 •

edited

StephanDollberg commented Apr 12, 2024 •

edited