Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cross-shard consumption fairness #2294

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

xemul
Copy link
Contributor

@xemul xemul commented Jun 13, 2024

Current IO queues design assumes that IO workload is very uniform on different shards in a sense that -- all classes are more-or-less equally loaded on different shards. In reality that's not true and some shards can easily get more requests in one of its class than the others.

This leads to a very nasty consequence. Consider a corner case -- two classes, A and B, with shares 100 and 1000 respectively, two shards. Class A is active on shard-0 only, while class B is active on shard-1 only. We expect, that they share disk bandwidth capacity in 1:10 proportion, but in reality it's going to be 1:1, because cross-shard queue doesn't preempt (it doesn't, because load is expected to be even on all shards).

The solution here is to implement the cross-class fairness approach on the shard-level. For that, each fair_queue accumulates its total amount of requests costs dispatched (cost = request.capacity / class.shares), and on every poll -- check its accumulator against those of other shards. If the local value is "somewhat ahead of" the others, the poll is skipped until later.

refs: #1430
fixes: #1083

Tested with io_tester on 2-shards setup from #2289

- name: shard_0
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 100
  shards:
  - '0'
  type: randread
- name: shard_1
  shard_info:
    parallelism: 32
    reqsize: 4kB
    shares: 800
  shards:
  - '1'
  type: randread

The result is

---
- shard: 0
  shard_0:
    throughput: 89627.8828  # kB/s
    IOPS: 22407.0703
    latencies:  # usec
      average: 1427
      p0.5: 1627
      p0.95: 1653
      p0.99: 1674
      p0.999: 1776
      max: 73161
    stats:
      total_requests: 672223
      io_queue_total_exec_sec: 38.740494826001402
      io_queue_total_delay_sec: 920.08525591999751
      io_queue_total_operations: 672255
      io_queue_starvation_time_sec: 0.86344176700005526
      io_queue_consumption: 3.7791568234562876
      io_queue_adjusted_consumption: 0.038457549512386321
- shard: 1
  shard_1:
    throughput: 621912.375  # kB/s
    IOPS: 155478.797
    latencies:  # usec
      average: 205
      p0.5: 199
      p0.95: 227
      p0.99: 271
      p0.999: 391
      max: 1470
    stats:
      total_requests: 4664398
      io_queue_total_exec_sec: 278.93417872600429
      io_queue_total_delay_sec: 665.67777301317381
      io_queue_total_operations: 4664430
      io_queue_starvation_time_sec: 5.5724000000000004e-05
      io_queue_consumption: 26.221615996956825
      io_queue_adjusted_consumption: 0.03324427980184555
...

@avikivity
Copy link
Member

This leads to a very nasty consequence. Consider a corner case -- two classes, A and B, with shares 100 and 1000 respectively, two shards. Class A is active on shard-0 only, while class B is active on shard-1 only. We expect, that they share disk bandwidth capacity in 1:10 proportion, but in reality it's going to be 1:1, because cross-shard queue doesn't preempt (it doesn't, because load is expected to be even on all shards).

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration. A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

I suppose with a larger number of shards and only a small number of sg/shard combinations active the problem is more interesting. Especially when the disk isn't too fast.

@avikivity
Copy link
Member

Looks reasonable (I don't claim to understand it deeply).

How can we assure this doesn't cause regressions?

@xemul
Copy link
Contributor Author

xemul commented Jun 13, 2024

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration.

Probably

A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

😄

Not quite. Provided A is query class and B is compaction class, 0.44 reads vs 0.04 writes is not the same as 0.5 reads vs 0.5 writes, from diskplorer plots only i4i can handle it.

How can we assure this doesn't cause regressions?

I will :( Currently, if one shard gets stuck, other shards just continue moving. With this patch, they all will get stuck.

@avikivity
Copy link
Member

I don't mean to ignore the problem, but I think "very nasty" is an exaggeration.

Probably

A/0 could have expected 0.44 share of the capacity (1000/2200) and got 0.5. B/1 could expect 0.04 share of the capacity, and got 0.5. So neither has anything to complain about.

😄

Not quite. Provided A is query class and B is compaction class, 0.44 reads vs 0.04 writes is not the same as 0.5 reads vs 0.5 writes, from diskplorer plots only i4i can handle it.

But isn't the 0.5 normalized? So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

@xemul
Copy link
Contributor Author

xemul commented Jun 17, 2024

So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

Yes, from the pure model perspective it's all OK, but in reality some plots are convex, so the 0.5/0.5 point happens to be "more purple" (larger read latency), while 0.4/0.04 is "more cyan" (smaller read latency)

model

@xemul xemul force-pushed the br-io-queue-cross-shard-fairness branch from 207c14a to 715d4a0 Compare June 17, 2024 08:33
@xemul
Copy link
Contributor Author

xemul commented Jun 17, 2024

upd:

  • scale tokens 8 times to avoid overflows
  • fix f.q. tests (hopefully) to remember that fair queue does time-based dispatching

@xemul xemul force-pushed the br-io-queue-cross-shard-fairness branch from 715d4a0 to a4bc3e0 Compare June 17, 2024 08:59
@xemul
Copy link
Contributor Author

xemul commented Jun 17, 2024

upd:

  • lost unused variable caused compilation to fail

@xemul
Copy link
Contributor Author

xemul commented Jun 17, 2024

CI fails due to #2296

@xemul
Copy link
Contributor Author

xemul commented Jun 17, 2024

upd:

  • add manual (quite long) test for check cross-shard fairness

@xemul xemul force-pushed the br-io-queue-cross-shard-fairness branch from 26a8d08 to a2b8ec6 Compare June 18, 2024 14:53
@xemul
Copy link
Contributor Author

xemul commented Jun 18, 2024

upd:

  • rebased to check CI passes

The vaulue is used to convert request tokens from float-point number to
some "sane" integer. Currently on ~200k IOPS ~2GBs disk it produces the
following token values for requests of different sizes

    512:    150k
    1024:   160k
    2048:   170k
    ...
    131072: 1.4M

These values are pretty huge an when accumulated even in 64-bit counter
can overflow it in months time scale. Current code sort of accounts for
that by checking the overflow and resetting the counters, but in the
future there will be the need to reset counters on different shards, and
that's going to be problematic.

This patch reduces the factor 8 times, so that the costs are now

    512:      19k
    1024:     20k
    2048:     21k
    ...
    131072:  170k

That's much more friendly to accumulating counters (the overflow is now
at the year's scale which is pretty comfortable). Reducing it even
further is problematic, here's why.

In order to provide cross-class fairness the costs are divided by class
shares for accumulation. Given a class of 1000 shares, the 512-bytes
request becomes indistinguishable from 1k one with smaller factor. Said
that, even with the new factor it's worth taking more care when dividing
the cost at shares use div-roundup math.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current tests on fair queue try to make the queue submit requests in
extremely controllable way -- one-by-one. However, the fair queue
nowadays is driven by rated token bucket and is very sensitive to time
and durations. It's better to teach the test accept the fact that it
cannot control fair-queue requests submissions on per-request
granularity and tunes its accounting instead.

The change affects two places.

Main loop. Before the change it called fair_queue::dispatch_requests()
as many times are the number of requests test case wants to pass, then
performed the necessary checks. Now, the method is called infinitely,
and the handling only processes the requested amount of requests. The
rest is ignored.

Drain. Before the change it called dispatch_requests() in a loop until
it returned anything. Now it's called in a loop until fair queue
explicitly reports that it's empty.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
For convenience

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
On each shard classes compete with each other by accumulating the sum of
request costs that had been dispatched from them so far. Cost is the
request capacity divided by the class shares. Dispatch loop then selects
the class with the smallest accumulated value, thus providing
shares-aware fairless -- the larger the shares value is, the slower the
accumulator gorws, the more requests are picked from the class for
dispatch.

This patch implements similar approach across shards. For that, each
shard accumnulates the dispatched cost from all classes. IO group keeps
track of a vector of accumulated costs for each shard. When a shard
wants to dispatch it first checks if it has run too far ahead of all
other shards, and if it does, it skips the dispatch loop.

Corner case -- when a queue gets drained, it "withdraws" itself from
other shards' decisions by advancing its group counter to infinity.
Respectively, when a group comes back it may forward its accumulator not
to get too large advantage over other shards.

When scheduling classes, shard has exclusive access to them and uses
log-complex heap to pick the one with smallest consumption counter.
Cross-shard balancing cannot afford it. Instead, each shard manipulates
its own counter only, and to compare it with other shards' it scans the
whole vector, which is not very cache-friendly and race-prone.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The value is used to limit one shard in the amount of requests it's
allowed to dispatch in one poll. This is to prevent it from consuming
the whole capacity in one go and let other shards get their portion.

Group-wide balancing (previous patch) made this fuse obsotele.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Looking at group balance counters is not very lightweight and is better
be avoided when possible. For that -- when balance is achieved, arm a
timer for quiscent period, and check only after it expires. When the
group is not balanced, check balance more frequently.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's pretty long, so not for automatic execition

2-shards tests:

{'shard_0': {'iops': 88204.3828, 'shares': 100}, 'shard_1': {'iops': 89686.5156, 'shares': 100}}
IOPS ratio 1.02, expected 1.0, deviation 1%

{'shard_0': {'iops': 60321.3125, 'shares': 100}, 'shard_1': {'iops': 117566.406, 'shares': 200}}
IOPS ratio 1.95, expected 2.0, deviation 2%

{'shard_0': {'iops': 37326.2422, 'shares': 100}, 'shard_1': {'iops': 140555.062, 'shares': 400}}
IOPS ratio 3.77, expected 4.0, deviation 5%

{'shard_0': {'iops': 21547.6152, 'shares': 100}, 'shard_1': {'iops': 156309.891, 'shares': 800}}
IOPS ratio 7.25, expected 8.0, deviation 9%

3-shards tests:

{'shard_0': {'iops': 45211.9336, 'shares': 100}, 'shard_1': {'iops': 45211.9766, 'shares': 100}, 'shard_2': {'iops': 87412.9453, 'shares': 200}}
shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
shard-2 IOPS ratio 1.93, expected 2.0, deviation 3%

{'shard_0': {'iops': 30992.2188, 'shares': 100}, 'shard_1': {'iops': 30992.2812, 'shares': 100}, 'shard_2': {'iops': 115887.609, 'shares': 400}}
shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
shard-2 IOPS ratio 3.74, expected 4.0, deviation 6%

{'shard_0': {'iops': 19279.6348, 'shares': 100}, 'shard_1': {'iops': 19279.6934, 'shares': 100}, 'shard_2': {'iops': 139316.828, 'shares': 800}}
shard-1 IOPS ratio 1.0, expected 1.0, deviation 0%
shard-2 IOPS ratio 7.23, expected 8.0, deviation 9%

{'shard_0': {'iops': 26505.9082, 'shares': 100}, 'shard_1': {'iops': 53011.9922, 'shares': 200}, 'shard_2': {'iops': 98369.4453, 'shares': 400}}
shard-1 IOPS ratio 2.0, expected 2.0, deviation 0%
shard-2 IOPS ratio 3.71, expected 4.0, deviation 7%

{'shard_0': {'iops': 17461.8145, 'shares': 100}, 'shard_1': {'iops': 34923.8438, 'shares': 200}, 'shard_2': {'iops': 125470.43, 'shares': 800}}
shard-1 IOPS ratio 2.0, expected 2.0, deviation 0%
shard-2 IOPS ratio 7.19, expected 8.0, deviation 10%

{'shard_0': {'iops': 14812.3037, 'shares': 100}, 'shard_1': {'iops': 58262, 'shares': 400}, 'shard_2': {'iops': 104794.633, 'shares': 800}}
shard-1 IOPS ratio 3.93, expected 4.0, deviation 1%
shard-2 IOPS ratio 7.07, expected 8.0, deviation 11%

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
@xemul xemul force-pushed the br-io-queue-cross-shard-fairness branch from a2b8ec6 to 3ef9817 Compare June 20, 2024 15:32
@xemul xemul changed the title [RFC] Implement cross-shard consumption fairness Implement cross-shard consumption fairness Jun 20, 2024
@xemul
Copy link
Contributor Author

xemul commented Jun 20, 2024

upd:

  • added (manual) test for two shards running at different RPS to show that cross-shard balancing doesn't break those two (spoiler: no breakage 🎉 )
  • added amortization for group's balance counters scan

@avikivity
Copy link
Member

So if the disk can do 6GB/s read and 1GB/s write, 0.5r / 0.5w = 3GB/s reads and 0.5GB/s writes, which should work according to the model.

Yes, from the pure model perspective it's all OK, but in reality some plots are convex, so the 0.5/0.5 point happens to be "more purple" (larger read latency), while 0.4/0.04 is "more cyan" (smaller read latency)

Then we should either adjust the model (and teach iotune to learn it) or add a safety factor that pushes the entire line down.

We should match the scheduler to the model and the model to the disk. If we skip a step we'll get lost.

@xemul
Copy link
Contributor Author

xemul commented Jun 21, 2024

Then we should either adjust the model (and teach iotune to learn it) or add a safety factor that pushes the entire line down.

It's already there -- the rate_factor parameter in io-properties.yaml is this K parameter from the model equation

     * Throttling formula:
     *
     *    Bw/Bw_max + Br/Br_max + Ow/Ow_max + Or/Or_max <= K

We should match the scheduler to the model and the model to the disk. If we skip a step we'll get lost.

I don't disagree. My "not quite" observation wasn't the justification of this PR, as it doesn't touch the model

@xemul
Copy link
Contributor Author

xemul commented Jun 21, 2024

updated the PR description not to fix #1430 , as it still doesn't help high-prio requests to advance in the queue in case of symmetrical (across shards) workload -- shards forward their accumulators equally not giving any advantage to each other.

However, the workload from #1430 clearly renders ~10% smaller tail latency for high-prio interactive class (4 shards)

On master

    IOPS: 244.480713
      p0.5: 318
      p0.95: 532
      p0.99: 1873
      p0.999: 4863
    IOPS: 252.025391
      p0.5: 318
      p0.95: 511
      p0.99: 1882
      p0.999: 4186
    IOPS: 250.350845
      p0.5: 319
      p0.95: 537
      p0.99: 2584
      p0.999: 4942
    IOPS: 245.260681
      p0.5: 315
      p0.95: 555
      p0.99: 2164
      p0.999: 4563

this PR

    IOPS: 245.461426
      p0.5: 323
      p0.95: 488
      p0.99: 1555
      p0.999: 3950
    IOPS: 244.498688
      p0.5: 320
      p0.95: 526
      p0.99: 1562
      p0.999: 3649
    IOPS: 250.401276
      p0.5: 321
      p0.95: 501
      p0.99: 2013
      p0.999: 4083
    IOPS: 247.366104
      p0.5: 318
      p0.95: 513
      p0.99: 2093
      p0.999: 4005

Possible explanation -- high-prio class is low-rate (250 RPS) and the queue is empty, so when one shard emits a request, it forwards its accumulator thus reducing its chance on dispatching requests in the next tick, so the sink has more room for other shards.

Still, that's not true preemption #1430 implies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IO capacity balancing is not well balanced
2 participants