Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the kafka_latency_fetch_latency metric #17720

Merged
merged 2 commits into from
Apr 22, 2024

Conversation

ballard26
Copy link
Contributor

@ballard26 ballard26 commented Apr 10, 2024

The kafka_latency_fetch_latency metric originally measured the time
it'd take to complete one fetch poll. A fetch poll would create a fetch
plan then execute it in parallel on every shard. On a given shard
fetch_ntps_in_parallel would account for the majority of the execution time
of the plan.

Since fetches are no longer implemented by polling there isn't an
exactly equivalent measurement that can be assigned to the metric.

This commit instead records the duration of the first call to
fetch_ntps_in_parallel on every shard to the metric. This first call takes
as long as it would during a fetch poll. Hence the resulting measurement
should be close to the duration of a fetch poll.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Improvements

  • Changes what the kafka_latency_fetch_latency metric measures to be the time the first fetch_ntps_in_parallel takes.

The `kafka_latency_fetch_latency` metric originally measured the time
it'd take to complete one fetch poll. A fetch poll would create a fetch
plan then execute it in parallel on every shard. On a given shard
`fetch_ntps_in_parallel` would account for the majority of the execution time
of the plan.

Since fetches are no longer implemented by polling there isn't an
exactly equivalent measurement that can be assigned to the metric.

This commit instead records the duration of the first call to
`fetch_ntps_in_parallel` on every shard to the metric. This first call takes
as long as it would during a fetch poll. Hence the resulting measurement
should be close to the duration of a fetch poll.
@vbotbuildovich
Copy link
Collaborator

Copy link
Member

@StephanDollberg StephanDollberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the latency will overall still go up because we are not taking all the empty polls into account now?

@ballard26
Copy link
Contributor Author

I guess the latency will overall still go up because we are not taking all the empty polls into account now?

Good point, I guess the question is whether subsequent polls skew the distribution towards empty polls or not. My guess would be that it does since empirically we tend to poll more than once on fetches that required polling.

Not sure how to get around this though. I could measure every call to fetch_ntps_in_parallel. But I think that would make the metric less useful.

@StephanDollberg
Copy link
Member

StephanDollberg commented Apr 12, 2024

Yeah it's tricky, maybe have a quick look a classic high throughput OMB scenario and also a high polling one and see how it changes?

Copy link
Member

@travisdowns travisdowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine. If we can't find something reasonable we can just delete this metric (though we will have to update the dashboards).

Whatever way we go please update the description of the metric in the source to reflect the new reality.

@ballard26
Copy link
Contributor Author

Measured this new metric's value against the old one in the graphs below. The drop in both graphs show the point where I changed the cluster config to use the non-polling fetch impl instead of the polling fetch impl. The non-polling fetch impl uses the new measure for the metric introduced in this PR while the polling one use the existing measure.

An OMB test was running with a pretty common workload of 50MB/s in/out, 10 producers/consumers, and on a 3x i3en.xlarge cluster.

image
image

@piyushredpanda piyushredpanda added this to the 24.1.1-GA milestone Apr 21, 2024
@ballard26 ballard26 merged commit 1b0f186 into redpanda-data:dev Apr 22, 2024
18 of 19 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants