Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MPMC bounded queue for group freelist #1472

Merged
merged 13 commits into from Jul 4, 2022

Conversation

eugeneia
Copy link
Member

@eugeneia eugeneia commented Mar 7, 2022

This adresses

by employing a smarter data structure for the group freelist (MPMC bounded queue).

It also moves the rebalancing house keeping into packet.free, and keeps limits on the upper bound of work performed in any rebalance/reclaim step.

In the density plots below we have master in red and this branch in blue (green was an alternative wip branch).

branches-latency

This certainly improves latency and thus performance of interlinks, however this does not make interlinks scale without restriction. The benchmarking I did seems to show that sharing memory between cores still turns into a bottleneck, and depending on your CPU architecture you are going to run into that sooner or later. I did compare results between Intel and EPYC machines and they are quite different. But for here and now I’m going to focus only on EPYC as an example.

nreceivers-latency

In the plot above we compare latencies by number of receivers for a single transmitter, and we see a significant blowup of latencies after 1 transmitter + >5 receivers. Now why is that?

If we look at the topology of our CPU as reported by AMDuprof we can get a hint:

CPU Topology:
Socket, CCD, Core(s)
0,0, 0 1 2 3 4 5
0,1, 6 7 8 9 10 11
0,2, 12 13 14 15 16 17
0,3, 18 19 20 21 22 23

Each CCD spans six CPU cores. So while our workload fits a single CCD we get OK perf (~60Mpps) but as soon as we add a receiver running on a distinct CCD latency and perf tanks (~10Mpps).

Screenshot 2022-02-21 151358

The above diagram of the CPUs architecture/topology gives some hints. So each CCD houses two CCXs, and a cores in a CCX share a L3 cache. I am assuming the CCXs in a CCD also have faster interconnects to each other than to a CCX in a remote CCD?

Anyways if I look at some PMU counters using AMDuprof we can maybe see why a workload distributed across CCDs fares worse (take this with some salt, this is me reading the tea leaves):

  • workloads that fit a single CCD can fetch data from shared L3 or other L2 caches in the same CCD or CCX? DCFillsFromL3orDiffL2 is higher
  • they incur less L2DtlbMisses, and have to perform less DCFillsFromLocalMemory
  • hence they can retire more instructions per cycle

pmu_ccd

@eugeneia eugeneia added the merged label Apr 7, 2022
@eugeneia eugeneia linked an issue Jun 22, 2022 that may be closed by this pull request
@eugeneia eugeneia merged commit 2ff0924 into snabbco:master Jul 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Group freelist contention
1 participant