-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the kafka_latency_fetch_latency
metric
#17720
Conversation
The `kafka_latency_fetch_latency` metric originally measured the time it'd take to complete one fetch poll. A fetch poll would create a fetch plan then execute it in parallel on every shard. On a given shard `fetch_ntps_in_parallel` would account for the majority of the execution time of the plan. Since fetches are no longer implemented by polling there isn't an exactly equivalent measurement that can be assigned to the metric. This commit instead records the duration of the first call to `fetch_ntps_in_parallel` on every shard to the metric. This first call takes as long as it would during a fetch poll. Hence the resulting measurement should be close to the duration of a fetch poll.
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47588#018ec5e6-fdcd-4cbe-b9d3-590b33dc235f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the latency will overall still go up because we are not taking all the empty polls into account now?
Good point, I guess the question is whether subsequent polls skew the distribution towards empty polls or not. My guess would be that it does since empirically we tend to poll more than once on fetches that required polling. Not sure how to get around this though. I could measure every call to |
Yeah it's tricky, maybe have a quick look a classic high throughput OMB scenario and also a high polling one and see how it changes? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine. If we can't find something reasonable we can just delete this metric (though we will have to update the dashboards).
Whatever way we go please update the description of the metric in the source to reflect the new reality.
Measured this new metric's value against the old one in the graphs below. The drop in both graphs show the point where I changed the cluster config to use the non-polling fetch impl instead of the polling fetch impl. The non-polling fetch impl uses the new measure for the metric introduced in this PR while the polling one use the existing measure. An OMB test was running with a pretty common workload of 50MB/s in/out, 10 producers/consumers, and on a 3x i3en.xlarge cluster. |
/backport v23.3.x |
The
kafka_latency_fetch_latency
metric originally measured the timeit'd take to complete one fetch poll. A fetch poll would create a fetch
plan then execute it in parallel on every shard. On a given shard
fetch_ntps_in_parallel
would account for the majority of the execution timeof the plan.
Since fetches are no longer implemented by polling there isn't an
exactly equivalent measurement that can be assigned to the metric.
This commit instead records the duration of the first call to
fetch_ntps_in_parallel
on every shard to the metric. This first call takesas long as it would during a fetch poll. Hence the resulting measurement
should be close to the duration of a fetch poll.
Backports Required
Release Notes
Improvements
kafka_latency_fetch_latency
metric measures to be the time the firstfetch_ntps_in_parallel
takes.