Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rusty: Implement NUMA-aware load balancing #178

Merged
merged 15 commits into from
Mar 12, 2024
Merged

rusty: Implement NUMA-aware load balancing #178

merged 15 commits into from
Mar 12, 2024

Conversation

Byte-Lab
Copy link
Contributor

@Byte-Lab Byte-Lab commented Mar 8, 2024

Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

  1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
    between domains across those NUMA node boundaries. We require a load
    imbalance of at least 17% in order to move load at this stage. The ratio of
    load imbalance we attempt to adjust (50%) and the maximum amount of load
    we're allowed to push out of a domain (50%) is still the same as when
    balancing between domains inside a NUMA node, but this is easy to tune with
    the current setup.

  2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
    between the domains within each NUMA node. The cost function here is the
    same as what it has been thus far: we require at least a 5% imbalance in
    order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

  1. NUMA nodes and domains are now ordered according to their load by using
    SortedVec objects. We were previously using BTreeMap keyed by load, but this
    was suboptimal due to the fact that it doesn't allow duplicate entries.

  2. We're no longer exporting load balancing statistics as a vector of data such
    as load sums, averages, and imbalances. This is instead all encapsulated in
    the load balancing hierarchy we setup in lb.load_balance(). These statistics
    are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

  1. make allyesconfig
  2. make -j $(nproc) built-in.a
  3. make -j clean
  4. goto 2

Runtime

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 562.688s | 566.085s | -.6% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.72431 | -24.9% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 562.688s | 563.209s | -.092% |
---------o-----------o-----------o----------o
Variance | 0.54387 | 0.42038 | 29.38% |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 7654.25% | 7551.67% | 1.11% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333 | 4.436% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 7654.25% | 7641.57% | 0.1659% |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619 | -86.5% |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs

     o-----------o-----------o----------o
     | scx_rusty |     CFS   |   Delta  |

---------o-----------o-----------o----------o
Mean | 5332 | 3950 | 36.566% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 5986.333 | 16.5237% |
---------o-----------o-----------o----------o

     o-----------o-----------o----------o
     | rusty NUMA| rusty ORIG|   Delta  |

---------o-----------o-----------o----------o
Mean | 5332 | 5336.5 | -.084% |
---------o-----------o-----------o----------o
Variance | 6975.5 | 955.5 | 630.03% |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

@Byte-Lab Byte-Lab requested review from arighi and htejun March 8, 2024 18:33
scheds/rust/scx_rusty/src/domain.rs Show resolved Hide resolved
scheds/rust/scx_rusty/src/load_balance.rs Outdated Show resolved Hide resolved
rusty.rs is growing a bit unwieldy. We're going to want to update its load
balancing logic somewhat significantly to account for multi-NUMA and other cost
functions, so let's start cleaning the code up so that things are more
logically segmented and easier to work with.

To start, we move the Tuner and DomainGroup/Domain objects into their own
modules.

Signed-off-by: David Vernet <void@manifault.com>
More cleanup of scx_rusty. Let's move the LoadBalancer out of rusty.rs and into
its own file. It will soon be extended quite a bit to support multi-NUMA and
other multivariate LB cost functions, so it's time to clean things up and split
it out.

Signed-off-by: David Vernet <void@manifault.com>
Let's just query self.tuner.fully_utilized directly and save a few lines of
code.

Signed-off-by: David Vernet <void@manifault.com>
Right now, scx_rusty has no notion of domains spanning NUMA nodes, and makes no
distinction when making load balancing decisions, or work stealing. This can
cause problems on multi-NUMA machines, as load balancing and work stealing
across NUMA nodes has significantly different cost from across L3 cache
boundaries.

In order to better support multi-NUMA machines, this commit adds another layer
to the rusty load balancer, which balances across NUMA nodes using a different
cost function from balancing across domains. Load balancing now takes place
over the span of two passes:

1. In the first pass, we fix imbalances across NUMA nodes by moving tasks
   between domains across those NUMA node boundaries. We require a load
   imbalance of at least 17% in order to move load at this stage. The ratio of
   load imbalance we attempt to adjust (50%) and the maximum amount of load
   we're allowed to push out of a domain (50%) is still the same as when
   balancing between domains inside a NUMA node, but this is easy to tune with
   the current setup.

2. Once we've balanced across NUMA nodes, we iterate over all nodes and balance
   between the domains within each NUMA node. The cost function here is the
   same as what it has been thus far: we require at least a 5% imbalance in
   order to trigger load balancing.

There are a few additional changes / improvements to load balancing in this
commit:

1. NUMA nodes and domains are now ordered according to their load by using
   SortedVec objects. We were previously using BTreeMap keyed by load, but this
   was suboptimal due to the fact that it doesn't allow duplicate entries.

2. We're no longer exporting load balancing statistics as a vector of data such
   as load sums, averages, and imbalances. This is instead all encapsulated in
   the load balancing hierarchy we setup in lb.load_balance(). These statistics
   are not yet exported, but they will be in a subsequent commit.

One of the issues with this commit is that it does introduce some
almost-identical logic that somehow begs to be deduplicated. For example, when
we balance between NUMA nodes, the logic for iterating over push nodes and
pushing to pull nodes is very similar to the logic of iterating over push
domains and pull domains when balancing within a node. It may be that this can
be improved.

The following are some benchmarks run on an Intel Xeon Gold 6138 (2 x 40 core
processor):

kcompile
--------

On Commit a27648c74210 ("afs: Fix setting of mtime when creating a
file/dir/symlink"):

1. make allyesconfig
2. make -j $(nproc) built-in.a
3. make -j clean
4. goto 2

Runtime
-------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 566.085s  | -.6%     |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.72431   | -24.9%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 562.688s  | 563.209s  | -.092%   |
---------o-----------o-----------o----------o
Variance | 0.54387   | 0.42038   | 29.38%   |
---------o-----------o-----------o----------o

scx_rusty with NUMA awareness clearly beats CFS, but only barely beats
scx_rusty without it. This isn't necessarily super surprising given that
this is kcompile, which has very poor front-end CPU locality. Further
experimentation with toggling the cost function for performing
migrations may improve this further.

CPU util
--------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7551.67%  | 1.11%    |
---------o-----------o-----------o----------o
Variance | 165.35714 | 158.3333  | 4.436%   |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 7654.25%  | 7641.57%  | 0.1659%  |
---------o-----------o-----------o----------o
Variance | 165.35714 | 1230.619  | -86.5%   |
---------o-----------o-----------o----------o

As expected, CPU util is quite a bit higher with scx_rusty than it is
with CFS. Further experiments that could be interesting are always
enabling direct-greedy stealing between domains within a NUMA node, and
then comparing rusty NUMA and rusty ORIG. rusty NUMA prevents stealing
between NUMA nodes, so this would show whether the locality introduced
by NUMA awareness appropriately offsets the loss of work conservation.

Major PFs
---------

         o-----------o-----------o----------o
         | scx_rusty |     CFS   |   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 3950      | 36.566%  |
---------o-----------o-----------o----------o
Variance | 6975.5    | 5986.333  | 16.5237% |
---------o-----------o-----------o----------o

         o-----------o-----------o----------o
         | rusty NUMA| rusty ORIG|   Delta  |
---------o-----------o-----------o----------o
Mean     | 5332      | 5336.5    | -.084%   |
---------o-----------o-----------o----------o
Variance | 6975.5    | 955.5     | 630.03%  |
---------o-----------o-----------o----------o

Also as expected, major page faults are far highe higher with scx_rusty
than with CFS. This is expected even with NUMA awareness, given that
scx_rusty is still less sticky than CFS.

Further experiments that could be interesting are tuning the threshold
for which we perform x NUMA migrations to try and keep this value even
lower. The rate of major page faults between rusty NUMA and rusty ORIG
were very close, though rusty NUMA was a bit lower.

Signed-off-by: David Vernet <void@manifault.com>
The cpumask print formatter doesn't look great in its current form, which uses
the BitVec formatter under the hood:

[INFO] NUMA[00 32:<[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]>]
[INFO]         DOM[00] 32:<[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]>
[INFO]         DOM[01] 32:<[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]>

Let's just iterate over the mask and manually format the string using the
binary formatter over the slice of u64's, which renders like this:

[INFO] NUMA[00] 0b11111111111111111111111111111111]
[INFO]         DOM[00] 0b00000000111111110000000011111111
[INFO]         DOM[01] 0b11111111000000001111111100000000

Signed-off-by: David Vernet <void@manifault.com>
scx_rusty currently pushes tasks to idle cores if the direct greedy threshold
is exceeded, even if the core is on a remote NUMA node. This behavior is
probably not desired in most scenarios. The worst that will happen if a task is
pushed to an idle core in the same node is some L3 cache miss traffic, but for
multiple NUMA nodes, it could cause the task to have its working set span
multiple nodes.

Let's disable direct greedy work stealing across NUMA nodes. A future commit
will add a flag that's disabled by default, and let's users turn this on if
they really want to encourage work conservation.

Signed-off-by: David Vernet <void@manifault.com>
Users may want to toggle whether tasks can be temporarily sent to idle CPUs on
remote NUMA nodes. By default, we want it to be disabled as a task spanning
multiple NUMA nodes will end up having its working set spanning both nodes,
which is probably not desirable. However, in case a workload really wants to
encourage work conservation, let's add a flag that allows them to toggle it.

Signed-off-by: David Vernet <void@manifault.com>
The scx_rusty load balancer is currently no longer exporting statistics such as
domain load averages, load sums, etc. Now that we're also balancing by NUMA,
we'll need a way to hierarchically illustrate load balancing statistics. This
patch adds support for that.

Signed-off-by: David Vernet <void@manifault.com>

updating stats printing

Signed-off-by: David Vernet <void@manifault.com>
We removed the debug!() output that was previously present in main.rs. Let's
add more debug!() output that helps debug the current LB hierarchy.

Signed-off-by: David Vernet <void@manifault.com>
The current topology.rs crate assumes that all cores have unique core
IDs in a system. This need not be the case, such as in certain Intel
Xeon processors which reuse core IDs in different NUMA nodes. Let's
update the crate to assume unique core IDs only per socket.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab Byte-Lab force-pushed the multi_numa_rusty branch 2 times, most recently from e481eea to 8a8438f Compare March 12, 2024 03:51
In scx_rusty, a CPU that is going to go idle will attempt to steal tasks
from remote domains when its domain has no tasks to run, and a remote
domain has at least greedy_threshold enqueued tasks. This stealing is
temporary, but of course has a cost in that the CPU that's stealing the
task may cause it to suffer from cache misses, or in the case of
multi-node machines, remote NUMA accesses and working sets split across
multiple domains.

Given the higher cost of x NUMA work stealing, let's add a separate flag
that lets users tune the threshold for doing cross NUMA greedy task
stealing.

Signed-off-by: David Vernet <void@manifault.com>
Tejun pointed out that a possible issue exists in the current
implementation, wherein if you have two NUMA nodes that are imbalanced,
but their domains are internally balanced, we'll fail to migrate between
them if all nodes are in the balanced_nodes list.

To address this, let's just use a single global list for all types of
domains, and do checking internally for imbalances. The reason it was
done this way in the first place was to allow me to mutably iterate over
both vectors in a nested loop. The way around that is to just use loop
{} and push/pop domains from the list.

We could do the same thing for the NUMA nodes themselves, which are also
in 3 separate lists in the LoadBalancer. We'll do that in a subsequent
commit.

Signed-off-by: David Vernet <void@manifault.com>
As Tejun pointed out in review, the disadvantage of using
push/pull/balanced lists is that if the domains inside the nodes are
balanced, we won't be able to push load between them. I'd originally
done it that way both as an optimization, but also to allow me to
iterate over the lists of pushable and pullable domains mutably. That
was addressed in the prior commit, but the nodes themselves were still
put into 3 buckets.

I think this is generally just a cleaner way of doing things, so let's
just collapse the nodes into a flat list as well. This prevents us from
having to coalesce the lists, std::mem::swap them, etc.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab
Copy link
Contributor Author

Byte-Lab commented Mar 12, 2024

@htejun / @arighi Addressed your feedback items. I also added another flag to enable tuning a cross-NUMA greedy task stealing threshold. Longer term I want to make scx_rusty automatically adjust these settings, but for now it's a knob that users can tune while running experiments.

Fixing alignment, moving a couple bail! calls around, and adding a
missing break from move_between_nodes() that lets us bail out of a loop
early.

Signed-off-by: David Vernet <void@manifault.com>
Given the complexity of migrating load between nodes (we're doing four
nested loops), we should add a comment explaining what we're doing. This
commit does that. In addition, we use a VecDeque to store (and then
restore) push nodes and push domains so that we can re-add them to their
respective lists in load-sorted order rather than reverse-load-sorted
order. This allows us to avoid having to do unnecessary right-shifts
every time a push object is re-added to its containing list.

Signed-off-by: David Vernet <void@manifault.com>
@Byte-Lab
Copy link
Contributor Author

Ok, cleaned the code up a bit more and added some comments to explain what we're doing. I'm going to merge this as is so we can iterate on future changes.

@Byte-Lab Byte-Lab merged commit 91cb5ce into main Mar 12, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the multi_numa_rusty branch March 12, 2024 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants