Add scylla_io_queue_consumption plots #2088

xemul · 2023-10-18T13:55:47Z

There's a convenient metrics called scylla_io_queue_consumption which shows the "fraction" of the maxumum (configured with io-properties) disk bandwidth+iops consumed by individual class. It's extremely useful when checking query latencies -- the metrics shows if the delays are due to disk being full or there's still room to be utilized. Also it's good to compare individual classes' consumption fractions with each other.

The reported is the counter type. Resulting rate is in [0.0;1.0] range, so it's good to convert it into percents rather than reporting plain decimals. It's important to sum the numbers by iogroup label, otherwise the reported number will make little sense. Summing/Splitting on shard basis doesn't make any sense either.

It's good to have different io classes (class label) shown on the same plot, not on different ones as it's done for per-class latency/queue-length/bandwidth/etc. Also it's good to have the ability to sum up classes and show the consumption for the instance.

Example (per class for a single instance):

The text was updated successfully, but these errors were encountered:

mykaul · 2023-10-18T13:59:28Z

"are due to disk being full" - you mean disk queues?

xemul · 2023-10-18T14:17:54Z

Well, not quite. Let me explain it differently.

If the IO latency is high it's always coupled with "io scheduler queues are long" almost without any exceptions. And then the question is -- why are sched. queues are that long? Shouldn't the scheduler dispatch more requests into the disk instead of keeping them in its queues? This metrics helps answering it.

In case it is close to 1.0, the answer is -- scheduler sees that the disk capacity is over (the configured capacity, seastar doesn't estimate the capacity runtime) and it should hold requests in the queue for longer. In that case the next question to answer is -- why does scylla put that many requests into io queue.

If this metrics shows some low numbers, say -- all classes sum up to 50% per iogroup, then scheduler indeed mustn't keep requests in queue and the scheduler does contributes to high latency. For example, we've discovered scylladb/seastar#1641 as one of the problems some time ago, had we this metrics earlier it would have been instantly obvious.

xemul · 2023-10-18T14:21:01Z

BTW, comparing classes' consumption to each other is also of great interest. We expect that query class with its 1k shares consumes all it wants and compaction class with ~100 shares doesn't dominate. This metrics will show if our expectations are met.

amnonh · 2023-10-19T12:05:46Z

@xemul in your example query, you've ignore the mountpoint, I've checked a random cluster and saw that mountpoint could be the scylla directory, or none does it make sense to just over all mountpoint like you did in your example?

What would be a good dashboard for this panel, deatiled? advanced?

xemul · 2023-10-19T12:17:17Z

@xemul in your example query, you've ignore the mountpoint, I've checked a random cluster and saw that mountpoint could be the scylla directory, or none does it make sense to just over all mountpoint like you did in your example?

It would work, because consumption for "none" is always zero :) But technically, this is incorrect and should be separated

What would be a good dashboard for this panel, deatiled? advanced?

Good question. On one hand it fits naturally into Advanced, next to other IO-related (and sched-related) metrics. But on the other hand, the grouping for this metrics is different. As I wrote, I'd like the ability to see different classes on one plot next to each other. Ideally -- as a stacked plot, because these numbers are fractions of 100% and it's natural to stack them on each other. And total consumption too. With stacked plot it would be natural, with non-stacked -- probably just another plot line with "all" label. In any case -- it can be a single plot with "disk consumption for foo mountpoint" title which matches the "OS metrics" dashboard :D

This patch add panels for io queue consumptions. It's part of a new row that combines all classes on the same graph. Fixes scylladb#2088 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

This patch add panels for io queue consumptions. It will add a panel per iogroup. Fixes scylladb#2088 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

This patch add panels for io queue consumptions. Fixes scylladb#2088 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

xemul added the enhancement New feature or request label Oct 18, 2023

xemul assigned amnonh Oct 18, 2023

amnonh added this to the Monitoring 4.6 milestone Nov 2, 2023

amnonh mentioned this issue Nov 9, 2023

scylla-advanced: Add io queue consumptions panels #2116

Merged

amnonh added a commit to amnonh/scylla-grafana-monitoring that referenced this issue Nov 21, 2023

scylla-advanced: Add io queue consumptions panels

87e2f92

This patch add panels for io queue consumptions. Fixes scylladb#2088 Signed-off-by: Amnon Heiman <amnon@scylladb.com>

amnonh closed this as completed in #2116 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scylla_io_queue_consumption plots #2088

Add scylla_io_queue_consumption plots #2088

xemul commented Oct 18, 2023

mykaul commented Oct 18, 2023

xemul commented Oct 18, 2023

xemul commented Oct 18, 2023

amnonh commented Oct 19, 2023

xemul commented Oct 19, 2023

Add scylla_io_queue_consumption plots #2088

Add scylla_io_queue_consumption plots #2088

Comments

xemul commented Oct 18, 2023

mykaul commented Oct 18, 2023

xemul commented Oct 18, 2023

xemul commented Oct 18, 2023

amnonh commented Oct 19, 2023

xemul commented Oct 19, 2023