Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect consumer metrics #1143

Merged
merged 28 commits into from
Jan 21, 2024
Merged

Collect consumer metrics #1143

merged 28 commits into from
Jan 21, 2024

Conversation

erikvanoosten
Copy link
Collaborator

@erikvanoosten erikvanoosten commented Dec 24, 2023

Collect metrics for the consumer using the zio-metrics API. This allows any zio-metrics backend to access and process the observed values.

By default no tags are added, but this can be configured via the new method ConsumerSettings.withMetricsLabels.

The following metrics are collected (kudos to @svroonland for most of the ideas):

  • Poll metrics: poll count (counter), number of records per poll (histogram), poll latency (histogram).
  • Partition stream metrics: queue size per partition (histogram), total queue size per consumer (histogram), number of polls for which records are idle in the queue (histogram).
  • The number of partitions that are paused/resumed (gauge).
  • Rebalance metrics: currently assigned partitions count (gauge), assigned/revoked/lost partitions (counter).
  • Commit metrics: commit count (counter), commit latency (histogram). These metrics measure commit requests issued through zio-kafka's api.
  • Aggregated commit metrics: commit count (counter), commit latency (histogram), commit size (number of offsets per commit) (histogram). After every poll zio-kafka combines all outstanding commit requests into 1 aggregated commit. These metrics are for the aggregated commits.
  • Number of entries in the command and commit queues (histogram).
  • Subscription state, 1 for subscribed, 0 of unsubscribed (gauge).

Like the zio-metrics API we follow Prometheus conventions. This means that:

  • durations are expressed in seconds,
  • counters can only increase,
  • metric names use snake_case and end in the unit where possible.

The histograms each use 10 buckets. To reach a decent range while keeping sufficient accuracy at the low end, most bucket boundaries use an exponential series based on 𝑒.

The following metric ideas were also raised, but these are kept for future work:

  • histogram of number of records returned in a poll, tagged per topic-partition,
  • number of records ignored (see PollResult),
  • number of in-flight records (last fetched offset - last committed offset), per partition or perhaps just the raw fetched and committed offsets per partition.

- add metric `allQueueSizeHistogram`
- better metrics descriptions
- better bucket boundaries
- observe metrics in the background
Copy link
Collaborator

@svroonland svroonland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, we should definitely have this!

I added suggestions for additional metrics. Let's think about the histogram dimension.

As last fallback use hash code of ConsumerSettings instead of random value.
@erikvanoosten
Copy link
Collaborator Author

erikvanoosten commented Jan 6, 2024

@svroonland I would like to stop here and merge the PR. The following metrics from your list were not implemented:

  • histogram of number of records returned in a poll, tagged per topic-partition
    Tagging this per partition generates too many metrics. Therefore, this is only available as a histogram for all partitions combined.
  • number of records ignored (see PollResult)
    We only ignore records just before a seek when doing manual offsets. I am not sure this is an interesting metric.
  • number of in-flight records (last fetched offset - last committed offset), per partition or perhaps just the raw fetched and committed offsets per partition
    An interesting idea, but not easy to implement (the first one), or generating too many metrics (the second one).

Your review and ideas are welcome and valued as always.

@svroonland
Copy link
Collaborator

Of course, we can always implement the other metrics as follow-ups, consider them more ideas. Will have a look!

Copy link
Collaborator

@svroonland svroonland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great additions. More stuff to consider in the comments

Latency of aggregated commits no longer includes the lead time from commit request to start of commit.

Also: use unit in metric name as recommended by Prometheus guide.
maxPollInterval: Duration,
commitTimeout: Duration,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to have less parameters here

Copy link
Collaborator

@svroonland svroonland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny thing to think about, but otherwise looks great!

@erikvanoosten erikvanoosten merged commit 7b2093e into master Jan 21, 2024
14 checks passed
@erikvanoosten erikvanoosten deleted the consumer-metrics branch January 21, 2024 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants