Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scylla_io_queue_consumption plots #2088

Closed
xemul opened this issue Oct 18, 2023 · 5 comments · Fixed by #2116
Closed

Add scylla_io_queue_consumption plots #2088

xemul opened this issue Oct 18, 2023 · 5 comments · Fixed by #2116
Assignees
Labels
enhancement New feature or request

Comments

@xemul
Copy link

xemul commented Oct 18, 2023

There's a convenient metrics called scylla_io_queue_consumption which shows the "fraction" of the maxumum (configured with io-properties) disk bandwidth+iops consumed by individual class. It's extremely useful when checking query latencies -- the metrics shows if the delays are due to disk being full or there's still room to be utilized. Also it's good to compare individual classes' consumption fractions with each other.

The reported is the counter type. Resulting rate is in [0.0;1.0] range, so it's good to convert it into percents rather than reporting plain decimals. It's important to sum the numbers by iogroup label, otherwise the reported number will make little sense. Summing/Splitting on shard basis doesn't make any sense either.

It's good to have different io classes (class label) shown on the same plot, not on different ones as it's done for per-class latency/queue-length/bandwidth/etc. Also it's good to have the ability to sum up classes and show the consumption for the instance.

Example (per class for a single instance):
image

@xemul xemul added the enhancement New feature or request label Oct 18, 2023
@mykaul
Copy link
Contributor

mykaul commented Oct 18, 2023

"are due to disk being full" - you mean disk queues?

@xemul
Copy link
Author

xemul commented Oct 18, 2023

Well, not quite. Let me explain it differently.

If the IO latency is high it's always coupled with "io scheduler queues are long" almost without any exceptions. And then the question is -- why are sched. queues are that long? Shouldn't the scheduler dispatch more requests into the disk instead of keeping them in its queues? This metrics helps answering it.

In case it is close to 1.0, the answer is -- scheduler sees that the disk capacity is over (the configured capacity, seastar doesn't estimate the capacity runtime) and it should hold requests in the queue for longer. In that case the next question to answer is -- why does scylla put that many requests into io queue.

If this metrics shows some low numbers, say -- all classes sum up to 50% per iogroup, then scheduler indeed mustn't keep requests in queue and the scheduler does contributes to high latency. For example, we've discovered scylladb/seastar#1641 as one of the problems some time ago, had we this metrics earlier it would have been instantly obvious.

@xemul
Copy link
Author

xemul commented Oct 18, 2023

BTW, comparing classes' consumption to each other is also of great interest. We expect that query class with its 1k shares consumes all it wants and compaction class with ~100 shares doesn't dominate. This metrics will show if our expectations are met.

@amnonh
Copy link
Collaborator

amnonh commented Oct 19, 2023

@xemul in your example query, you've ignore the mountpoint, I've checked a random cluster and saw that mountpoint could be the scylla directory, or none does it make sense to just over all mountpoint like you did in your example?

What would be a good dashboard for this panel, deatiled? advanced?

@xemul
Copy link
Author

xemul commented Oct 19, 2023

@xemul in your example query, you've ignore the mountpoint, I've checked a random cluster and saw that mountpoint could be the scylla directory, or none does it make sense to just over all mountpoint like you did in your example?

It would work, because consumption for "none" is always zero :) But technically, this is incorrect and should be separated

What would be a good dashboard for this panel, deatiled? advanced?

Good question. On one hand it fits naturally into Advanced, next to other IO-related (and sched-related) metrics. But on the other hand, the grouping for this metrics is different. As I wrote, I'd like the ability to see different classes on one plot next to each other. Ideally -- as a stacked plot, because these numbers are fractions of 100% and it's natural to stack them on each other. And total consumption too. With stacked plot it would be natural, with non-stacked -- probably just another plot line with "all" label. In any case -- it can be a single plot with "disk consumption for foo mountpoint" title which matches the "OS metrics" dashboard :D

@amnonh amnonh added this to the Monitoring 4.6 milestone Nov 2, 2023
amnonh added a commit to amnonh/scylla-grafana-monitoring that referenced this issue Nov 9, 2023
This patch add panels for io queue consumptions. It's part of a new row
that combines all classes on the same graph.

Fixes scylladb#2088

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
amnonh added a commit to amnonh/scylla-grafana-monitoring that referenced this issue Nov 20, 2023
This patch add panels for io queue consumptions.
It will add a panel per iogroup.

Fixes scylladb#2088

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
amnonh added a commit to amnonh/scylla-grafana-monitoring that referenced this issue Nov 21, 2023
This patch add panels for io queue consumptions.

Fixes scylladb#2088

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants