Add "IO tick violation time" metrics #1492

xemul · 2023-02-10T16:10:12Z

IO scheduler math is built around the assumption that reactor polls it once per io-latency-goal or more frequently. If reactor fails to do it, disk may become underloaded, IO queues would accumulate over time. Another nasty side effect is that rare polls in seastar non-preemptive model result in phantom jams in IO requests flow.

The new metrics is supposed to help observing this effect takes place.

refs: #1311

This value is known by the time disk_config is prepared and can be assigned per-shard inside allocate_io_queues callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

IO queues expect reactor to poll them at least once every "IO latency goal". If reactor fails to do it, the disk queues may starve for concurrency which is not what scheduler expects. The new metrics would show how may times queues were spend without being polled _above_ the expected latency goal. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Since it's global metrics, it will come in separate section Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

xemul · 2023-03-14T16:50:56Z

@avikivity , ping

avikivity · 2023-03-19T15:45:45Z

src/core/reactor.cc

+        auto delta = now - _last_poll;
+        _last_poll = now;
+        if (delta > _r._io_latency_goal) {
+            _r._io_latency_goal_violation_time += delta - _r._io_latency_goal;


I don't think this is accurate. If we poll at exactly latency_goal intervals, the disk would be underloaded. We stuff the disk with enough requests for calculated io_latency_goal, but some requests can't be executed in parallel, and we'll io-stall on the time it takes for delivery of the last responses.

What we really need is a histogram of io poll times, but we know the problem with that.

If we poll at exactly latency_goal intervals, the disk would be underloaded

Scheduler puts 1.5 times more requests into disk than it expects to complete after latency goal, so it won't be underloaded.

We stuff the disk with enough requests for calculated io_latency_goal, but some requests can't be executed in parallel, and we'll io-stall on the time it takes for delivery of the last responses.

Why stall? Scheduler puts requests in batches, but frees the resource as requests appear completed and puts more data back into disk as the resource gets released.

What we really need is a histogram of io poll times, but we know the problem with that.

Did recent Amnon's improvements over historgrams made enough progress to reconsider that?
Another option -- runtime switch to report (or not) the histograms. Like on-demand tracing on/off, I don't think we'll lose much information if turn on CPU poll rates and IO poll rates histograms for just 5 minutes.
Collecting shouldn't be a big deal. Or is it?

If the execution time of a request is (request_cost / disk_capacity_per_sec) (as defined by our model), then this is accurate. But in practice, the execution time is (request_cost / disk_capacity_per_sec) * some_random_variable, which has its probability distribution centered around 1, but also has outliers >> 1.

If the tail of the probability distribution has low overall probability, then I don't think it has much effect, but if it's high it may stall the queue. For such disks we need higher latency goal.

I think for the long term we need a simulator to test the algorithm on.

Here's the probability distribution from some random disk:

@usecs[271581194, R]: [0] 1 | | [1] 6 | | [2, 4) 16 | | [4, 8) 27 | | [8, 16) 56 | | [16, 32) 104 | | [32, 64) 190 | | [64, 128) 671 |@@ | [128, 256) 12892 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [256, 512) 3861 |@@@@@@@@@@@@@@@ | [512, 1K) 123 | | [1K, 2K) 11 | | [2K, 4K) 1 | |

Requests that take more than the io latency goal will starve the disk for some time. But they're pretty rare.

ping @amnonh see question above

We have multiple ways of handling a distribution that are already in place and a few more that are coming.

Summaries, it would give some data, but I think the quantile will not help and the distribution in this case is more import.

aggregating histograms, will a single histogram per node would suffice? In Scylladb/PR#13293 based on the configuration, Scylla expose aggregate or non aggregate metrics.

Use low resolution histogram. In @avikivity example, a histogram with 2 coeficiant where less than 10 buckets were used was enough, we can use a histogram with a minimal number of buckets that would still expose enough of the distribution.

Future enhencement would add enable and disable metrics in run time and even furthere down the line are better histogram support but that would take longer.

…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in #1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: #1641 refs: #1311 refs: #1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: #1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes #1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

…m Pavel Emelyanov There are three places where IO dispatch loop is throttled * self-throttling with token bucket according to math model * per-shard one-tick threshold * 2-bucket approach when tokens are replenished only after they are released from disk This PR removes the last one, because it leads to self-slowdown in case of reactor stalls. This back-link was introduced to catch the case when the disk suddenly slows down to stop dispatched to over-load it with requests, but effectively this back-link measures not the real disk dispatch rate, but the disk+kernel+reactor dispatch rate. Despite the "kernel" part is tiny, the reactor part can grow large triggering the self slow-down effect. Here's some math. Let's assume that a some point scheduler dispatched N_d requests. It means that it was able to grab N_d tokens in T_d duration, the rate of dispatch is R_d = N_d/T_d. The requests are to be completed by the reactor next tick. Let's assume it takes T_c time until reactor gets there and it completes N_c requests. The rate of completion is thus R_c = N_c/T_c. Apparently, N_c <= N_d, because kernel cannot complete more requests that it was queued. In case reactor experiences a stall during the completion tick, T_c > T_d and since N_c <= N_d consequentially N_d/T_d > N_c/T_c. In case reactor doesn't stall, the number of requests that will complete N_c = N_d/T_d * T_c, because this is how dispatch rate is defined. This is equivalent to N_c/T_c = N_d/T_d. Finally: R_d >= R_c i.e. the dispatch rate is equal of greater than the completion rate where the "equal" part is less likely and is only if reactor clockworks and doesn't stall. The mentioned back-link makes sure that R_d <= R_c, coupled with the stalls (even the small ones) this drives the R_d down each tick, causing the R_c to go down as well, then again. The removed fuse is replaced with the flow-monitor based on dispatch-to-completion rate. Normally, the number of requests dispatched for a certain duration divided by the number of requests completed for the same duration must be 1.0. Otherwise that would mean that requests accumulate in disk. However, this ratio cannot be such immediately and in the longer run it tends to be slightly greater that 1.0, because if reactor polls kernel for IO completions more often, it won't get more requests that it was dispatched. But even a small delay in polling would make Nr_completed / duration less because of the larger denominator value. Having said that, the new backlink is based on the flow-ratio. When the "average" value of dispatched/completed rates exceeds some threshold (configurable, 1.5 by default) the "cost" of individual requests increases thus reducing the dispatch rate. The main difference from the current implementation is that the new backlink is not "immediate". The averaging is the exponential moving average filter with 100ms updates and 0.95 smoothing factor. Current backlink is immediate in a sense that delay to deliver a completion immediately slows down the next tick dispatch thus accumulating spontaneous reactor micro-stalls. This can be reproduced by the test introduced in scylladb#1724 . It's not (yet) in the PR, but making the tokens release loop artificially release ~1% more tokens fixes this case, which also supports the theory of reduced completion rate being the culprit. BTW, it cannot be the fix, because the ... over-release factor is not constant and it's hard to calculate it. fixes: scylladb#1641 refs: scylladb#1311 refs: scylladb#1492 (*) in fact, _this_ is the metrics that correlates with the flow ratio to grow above 1.0, but this metrics is sort of look at quota-violation from the IO angle refs: scylladb#1774 this PR has attached metrics screenshots demonstrating the effect on stressed scylla Closes scylladb#1766 * github.com:scylladb/seastar: doc: Add document describing all the math behind IO scheduler io_queue: Add flow-rate based self slowdown backlink io_queue: Make main throttler uncapped io_queue: Add queue-wide metrics io_queue: Introduce "flow monitor" io_queue: Count total number of dispatched and completed requests so far io_queue: Introduce io_group::io_latency_goal()

xemul added 3 commits February 10, 2023 19:00

reactor: Keep io-latency-goal on reactor

9eb3c49

This value is known by the time disk_config is prepared and can be assigned per-shard inside allocate_io_queues callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

io_tester: Report violation time in final output

fed8561

Since it's global metrics, it will come in separate section Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

xemul requested review from avikivity and vladzcloudius February 10, 2023 16:10

avikivity reviewed Mar 19, 2023

View reviewed changes

bhalevy mentioned this pull request Apr 11, 2023

Disk stall detector #1311

Open

xemul mentioned this pull request Apr 11, 2023

io_queue: Maintain per-shard-class stall threshold #1609

Draft

xemul mentioned this pull request Aug 2, 2023

Slowdown IO scheduler based on dispatched/completed ratio #1766

Merged

xemul mentioned this pull request Oct 4, 2023

raft: large delays between io_fiber iterations in schema change test scylladb/scylladb#15622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add "IO tick violation time" metrics #1492

Add "IO tick violation time" metrics #1492

xemul commented Feb 10, 2023

xemul commented Mar 14, 2023

avikivity Mar 19, 2023

xemul Mar 30, 2023

avikivity Mar 30, 2023

avikivity Mar 30, 2023

avikivity Mar 30, 2023

avikivity Mar 30, 2023

bhalevy Apr 11, 2023

amnonh Apr 11, 2023 •

edited

Add "IO tick violation time" metrics #1492

Are you sure you want to change the base?

Add "IO tick violation time" metrics #1492

Conversation

xemul commented Feb 10, 2023

xemul commented Mar 14, 2023

avikivity Mar 19, 2023

Choose a reason for hiding this comment

xemul Mar 30, 2023

Choose a reason for hiding this comment

avikivity Mar 30, 2023

Choose a reason for hiding this comment

avikivity Mar 30, 2023

Choose a reason for hiding this comment

avikivity Mar 30, 2023

Choose a reason for hiding this comment

avikivity Mar 30, 2023

Choose a reason for hiding this comment

bhalevy Apr 11, 2023

Choose a reason for hiding this comment

amnonh Apr 11, 2023 • edited

Choose a reason for hiding this comment

amnonh Apr 11, 2023 •

edited