rusty: Implement NUMA-aware load balancing #178

Byte-Lab · 2024-03-08T18:33:50Z

Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

In the first pass, we fix imbalances across NUMA nodes by moving tasks
between domains across those NUMA node boundaries. We require a load
imbalance of at least 17% in order to move load at this stage. The ratio of
load imbalance we attempt to adjust (50%) and the maximum amount of load
we're allowed to push out of a domain (50%) is still the same as when
balancing between domains inside a NUMA node, but this is easy to tune with
the current setup.
Once we've balanced across NUMA nodes, we iterate over all nodes and balance
between the domains within each NUMA node. The cost function here is the
same as what it has been thus far: we require at least a 5% imbalance in
order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

NUMA nodes and domains are now ordered according to their load by using
SortedVec objects. We were previously using BTreeMap keyed by load, but this
was suboptimal due to the fact that it doesn't allow duplicate entries.
We're no longer exporting load balancing statistics as a vector of data such
as load sums, averages, and imbalances. This is instead all encapsulated in
the load balancing hierarchy we setup in lb.load_balance(). These statistics
are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

make allyesconfig
make -j $(nproc) built-in.a
make -j clean
goto 2

Runtime

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 562.688s | 566.085s | -.6% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.72431 | -24.9% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 562.688s | 563.209s | -.092% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.42038 | 29.38% |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 7654.25% | 7551.67% | 1.11% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333 | 4.436% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 7654.25% | 7641.57% | 0.1659% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619 | -86.5% |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 5332 | 3950 | 36.566% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 5986.333 | 16.5237% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 5332 | 5336.5 | -.084% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 955.5 | 630.03% |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

scheds/rust/scx_rusty/src/domain.rs

scheds/rust/scx_rusty/src/load_balance.rs

rusty.rs is growing a bit unwieldy. We're going to want to update its load balancing logic somewhat significantly to account for multi-NUMA and other cost functions, so let's start cleaning the code up so that things are more logically segmented and easier to work with. To start, we move the Tuner and DomainGroup/Domain objects into their own modules. Signed-off-by: David Vernet <void@manifault.com>

More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into its own file. It will soon be extended quite a bit to support multi-NUMA and other multivariate LB cost functions, so it's time to clean things up and split it out. Signed-off-by: David Vernet <void@manifault.com>

Let's just query self.tuner.fully_utilized directly and save a few lines of code. Signed-off-by: David Vernet <void@manifault.com>

Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no distinction when making load balancing decisions, or work stealing. This can cause problems on multi-NUMA machines, as load balancing and work stealing across NUMA nodes has significantly different cost from across L3 cache boundaries. In order to better support multi-NUMA machines, this commit adds another layer to the rusty load balancer, which balances across NUMA nodes using a different cost function from balancing across domains. Load balancing now takes place over the span of two passes: 1. In the first pass, we fix imbalances across NUMA nodes by moving tasks between domains across those NUMA node boundaries. We require a load imbalance of at least 17% in order to move load at this stage. The ratio of load imbalance we attempt to adjust (50%) and the maximum amount of load we're allowed to push out of a domain (50%) is still the same as when balancing between domains inside a NUMA node, but this is easy to tune with the current setup. 2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance between the domains within each NUMA node. The cost function here is the same as what it has been thus far: we require at least a 5% imbalance in order to trigger load balancing. There are a few additional changes / improvements to load balancing in this commit: 1. NUMA nodes and domains are now ordered according to their load by using SortedVec objects. We were previously using BTreeMap keyed by load, but this was suboptimal due to the fact that it doesn't allow duplicate entries. 2. We're no longer exporting load balancing statistics as a vector of data such as load sums, averages, and imbalances. This is instead all encapsulated in the load balancing hierarchy we setup in lb.load_balance(). These statistics are not yet exported, but they will be in a subsequent commit. One of the issues with this commit is that it does introduce some almost-identical logic that somehow begs to be deduplicated. For example, when we balance between NUMA nodes, the logic for iterating over push nodes and pushing to pull nodes is very similar to the logic of iterating over push domains and pull domains when balancing within a node. It may be that this can be improved. The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core processor): kcompile -------- On Commit a27648c74210 ("afs: Fix setting of mtime when creating a file/dir/symlink"): 1. make allyesconfig 2. make -j $(nproc) built-in.a 3. make -j clean 4. goto 2 Runtime ------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 562.688s | 566.085s | -.6% | ---------o-----------o-----------o----------o Variance | 0.54387 | 0.72431 | -24.9% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 562.688s | 563.209s | -.092% | ---------o-----------o-----------o----------o Variance | 0.54387 | 0.42038 | 29.38% | ---------o-----------o-----------o----------o scx_rusty with NUMA awareness clearly beats CFS, but only barely beats scx_rusty without it. This isn't necessarily super surprising given that this is kcompile, which has very poor front-end CPU locality. Further experimentation with toggling the cost function for performing migrations may improve this further. CPU util -------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 7654.25% | 7551.67% | 1.11% | ---------o-----------o-----------o----------o Variance | 165.35714 | 158.3333 | 4.436% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 7654.25% | 7641.57% | 0.1659% | ---------o-----------o-----------o----------o Variance | 165.35714 | 1230.619 | -86.5% | ---------o-----------o-----------o----------o As expected, CPU util is quite a bit higher with scx_rusty than it is with CFS. Further experiments that could be interesting are always enabling direct-greedy stealing between domains within a NUMA node, and then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing between NUMA nodes, so this would show whether the locality introduced by NUMA awareness appropriately offsets the loss of work conservation. Major PFs --------- o-----------o-----------o----------o | scx_rusty | CFS | Delta | ---------o-----------o-----------o----------o Mean | 5332 | 3950 | 36.566% | ---------o-----------o-----------o----------o Variance | 6975.5 | 5986.333 | 16.5237% | ---------o-----------o-----------o----------o o-----------o-----------o----------o | rusty NUMA| rusty ORIG| Delta | ---------o-----------o-----------o----------o Mean | 5332 | 5336.5 | -.084% | ---------o-----------o-----------o----------o Variance | 6975.5 | 955.5 | 630.03% | ---------o-----------o-----------o----------o Also as expected, major page faults are far highe higher with scx_rusty than with CFS. This is expected even with NUMA awareness, given that scx_rusty is still less sticky than CFS. Further experiments that could be interesting are tuning the threshold for which we perform x NUMA migrations to try and keep this value even lower. The rate of major page faults between rusty NUMA and rusty ORIG were very close, though rusty NUMA was a bit lower. Signed-off-by: David Vernet <void@manifault.com>

The cpumask print formatter doesn't look great in its current form, which uses the BitVec formatter under the hood: [INFO] NUMA[00 32:<[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]>] [INFO] DOM[00] 32:<[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]> [INFO] DOM[01] 32:<[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]> Let's just iterate over the mask and manually format the string using the binary formatter over the slice of u64's, which renders like this: [INFO] NUMA[00] 0b11111111111111111111111111111111] [INFO] DOM[00] 0b00000000111111110000000011111111 [INFO] DOM[01] 0b11111111000000001111111100000000 Signed-off-by: David Vernet <void@manifault.com>

scx_rusty currently pushes tasks to idle cores if the direct greedy threshold is exceeded, even if the core is on a remote NUMA node. This behavior is probably not desired in most scenarios. The worst that will happen if a task is pushed to an idle core in the same node is some L3 cache miss traffic, but for multiple NUMA nodes, it could cause the task to have its working set span multiple nodes. Let's disable direct greedy work stealing across NUMA nodes. A future commit will add a flag that's disabled by default, and let's users turn this on if they really want to encourage work conservation. Signed-off-by: David Vernet <void@manifault.com>

Users may want to toggle whether tasks can be temporarily sent to idle CPUs on remote NUMA nodes. By default, we want it to be disabled as a task spanning multiple NUMA nodes will end up having its working set spanning both nodes, which is probably not desirable. However, in case a workload really wants to encourage work conservation, let's add a flag that allows them to toggle it. Signed-off-by: David Vernet <void@manifault.com>

The scx_rusty load balancer is currently no longer exporting statistics such as domain load averages, load sums, etc. Now that we're also balancing by NUMA, we'll need a way to hierarchically illustrate load balancing statistics. This patch adds support for that. Signed-off-by: David Vernet <void@manifault.com> updating stats printing Signed-off-by: David Vernet <void@manifault.com>

We removed the debug!() output that was previously present in main.rs. Let's add more debug!() output that helps debug the current LB hierarchy. Signed-off-by: David Vernet <void@manifault.com>

The current topology.rs crate assumes that all cores have unique core IDs in a system. This need not be the case, such as in certain Intel Xeon processors which reuse core IDs in different NUMA nodes. Let's update the crate to assume unique core IDs only per socket. Signed-off-by: David Vernet <void@manifault.com>

In scx_rusty, a CPU that is going to go idle will attempt to steal tasks from remote domains when its domain has no tasks to run, and a remote domain has at least greedy_threshold enqueued tasks. This stealing is temporary, but of course has a cost in that the CPU that's stealing the task may cause it to suffer from cache misses, or in the case of multi-node machines, remote NUMA accesses and working sets split across multiple domains. Given the higher cost of x NUMA work stealing, let's add a separate flag that lets users tune the threshold for doing cross NUMA greedy task stealing. Signed-off-by: David Vernet <void@manifault.com>

Tejun pointed out that a possible issue exists in the current implementation, wherein if you have two NUMA nodes that are imbalanced, but their domains are internally balanced, we'll fail to migrate between them if all nodes are in the balanced_nodes list. To address this, let's just use a single global list for all types of domains, and do checking internally for imbalances. The reason it was done this way in the first place was to allow me to mutably iterate over both vectors in a nested loop. The way around that is to just use loop {} and push/pop domains from the list. We could do the same thing for the NUMA nodes themselves, which are also in 3 separate lists in the LoadBalancer. We'll do that in a subsequent commit. Signed-off-by: David Vernet <void@manifault.com>

As Tejun pointed out in review, the disadvantage of using push/pull/balanced lists is that if the domains inside the nodes are balanced, we won't be able to push load between them. I'd originally done it that way both as an optimization, but also to allow me to iterate over the lists of pushable and pullable domains mutably. That was addressed in the prior commit, but the nodes themselves were still put into 3 buckets. I think this is generally just a cleaner way of doing things, so let's just collapse the nodes into a flat list as well. This prevents us from having to coalesce the lists, std::mem::swap them, etc. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab · 2024-03-12T04:25:37Z

@htejun / @arighi Addressed your feedback items. I also added another flag to enable tuning a cross-NUMA greedy task stealing threshold. Longer term I want to make scx_rusty automatically adjust these settings, but for now it's a knob that users can tune while running experiments.

scheds/rust/scx_rusty/src/load_balance.rs

Fixing alignment, moving a couple bail! calls around, and adding a missing break from move_between_nodes() that lets us bail out of a loop early. Signed-off-by: David Vernet <void@manifault.com>

Given the complexity of migrating load between nodes (we're doing four nested loops), we should add a comment explaining what we're doing. This commit does that. In addition, we use a VecDeque to store (and then restore) push nodes and push domains so that we can re-add them to their respective lists in load-sorted order rather than reverse-load-sorted order. This allows us to avoid having to do unnecessary right-shifts every time a push object is re-added to its containing list. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab · 2024-03-12T20:50:25Z

Ok, cleaned the code up a bit more and added some comments to explain what we're doing. I'm going to merge this as is so we can iterate on future changes.

Byte-Lab requested review from arighi and htejun March 8, 2024 18:33

htejun approved these changes Mar 8, 2024

View reviewed changes

scheds/rust/scx_rusty/src/domain.rs Show resolved Hide resolved

scheds/rust/scx_rusty/src/load_balance.rs Outdated Show resolved Hide resolved

arighi reviewed Mar 8, 2024

View reviewed changes

scheds/rust/scx_rusty/src/load_balance.rs Outdated Show resolved Hide resolved

Byte-Lab added 10 commits March 8, 2024 15:10

rusty: Remove lb_apply_weight param from lb_step()

0b1c371

Let's just query self.tuner.fully_utilized directly and save a few lines of code. Signed-off-by: David Vernet <void@manifault.com>

rusty: Add debug! logging to load_balance.rs

26a94b1

We removed the debug!() output that was previously present in main.rs. Let's add more debug!() output that helps debug the current LB hierarchy. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab force-pushed the multi_numa_rusty branch 2 times, most recently from e481eea to 8a8438f Compare March 12, 2024 03:51

Byte-Lab added 3 commits March 11, 2024 21:02

Byte-Lab force-pushed the multi_numa_rusty branch from 8a8438f to 3e0a17f Compare March 12, 2024 04:04

arighi approved these changes Mar 12, 2024

View reviewed changes

htejun reviewed Mar 12, 2024

View reviewed changes

scheds/rust/scx_rusty/src/load_balance.rs Show resolved Hide resolved

Byte-Lab force-pushed the multi_numa_rusty branch from 3e0a17f to f70192d Compare March 12, 2024 16:18

rusty: Fix a few remaining issues

03f6809

Fixing alignment, moving a couple bail! calls around, and adding a missing break from move_between_nodes() that lets us bail out of a loop early. Signed-off-by: David Vernet <void@manifault.com>

Byte-Lab force-pushed the multi_numa_rusty branch from f70192d to 03f6809 Compare March 12, 2024 19:48

Byte-Lab merged commit 91cb5ce into main Mar 12, 2024
1 check passed

Byte-Lab deleted the multi_numa_rusty branch March 12, 2024 20:50

Byte-Lab mentioned this pull request Mar 12, 2024

Unable to start scx_rusty after topology commit #147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rusty: Implement NUMA-aware load balancing #178

rusty: Implement NUMA-aware load balancing #178

Byte-Lab commented Mar 8, 2024

Byte-Lab commented Mar 12, 2024 •

edited

Byte-Lab commented Mar 12, 2024

rusty: Implement NUMA-aware load balancing #178

rusty: Implement NUMA-aware load balancing #178

Conversation

Byte-Lab commented Mar 8, 2024

kcompile

Runtime

CPU util

Major PFs

Byte-Lab commented Mar 12, 2024 • edited

Byte-Lab commented Mar 12, 2024

Byte-Lab commented Mar 12, 2024 •

edited