How much bandwidth does the L2 have to give, anyway?

If your core computational loop primarily depends on the data resident in the L2 cache, you might be curious how fast this layer can actually deliver the bytes you need. Choosing the L2 as the target for cache blocking is common in HPC codes since the L1 is often too small, and the L3 is slow and shared between all cores on a chip.

In Table 2-5 of the optimization manual, Intel quotes their L2 cache as having 64 bytes "Peak Bandwidth" and ~29 bytes "Sustained Bandwidth", for Skylake client:

Earlier versions of this document only included the 64-byte bandwidth figure, but this led to some complaints that this figure wasn't achievable. Indeed, just because 64 bytes can travel between the L1 and L2 every cycle, doesn't mean you can get 64 bytes into registers every cycle so you can operate on them! In this scenario the limitation is actually the L1D cache: it can't handle simultaneously servicing two 32-byte read requests from the core, and accepting a 64-line from the L2 cache¹.

So the ~29 byte figure probably represents 1 cycle to accept a 64-byte line from the L2, and then 1 cycle to do the two 32-byte reads. That would lead to an ideal bandwidth of 32 bytes/cycle and you end up with slightly less (29) probably due to interference from the L2 streaming prefetcher.

Can we do better? In fact we can - on Skylake client we will derive a specific read pattern that will get to about 42 bytes/cycle sustained over moderate to large regions, which is a 45% improvement on Intel's 29 bytes/cycle claim.

Let's try to verify the numbers from the Intel guide with a quick test. We can test a straightforward linear read with ./uarch-bench.sh --test-name=memory/bandwidth/*normal* which just iterates linearly over the data, reading 32 bytes at a time (and adding together all the byte values using vpaddb):

vpaddb ymm0,ymm0,YMMWORD PTR [rcx]       ; cache line 0
vpaddb ymm1,ymm1,YMMWORD PTR [rcx+0x20]  ; cache line 0
vpaddb ymm0,ymm0,YMMWORD PTR [rcx+0x40]  ; cache line 1
vpaddb ymm1,ymm1,YMMWORD PTR [rcx+0x60]  ; cache line 1
vpaddb ymm0,ymm0,YMMWORD PTR [rcx+0x80]  ; cache line 2
vpaddb ymm1,ymm1,YMMWORD PTR [rcx+0xa0]  ; cache line 2
vpaddb ymm0,ymm0,YMMWORD PTR [rcx+0xc0]  ; cache line 3
; etc

We get results like:

** Running group memory/bandwidth : Linear AVX2 loads **
                               Benchmark    Cycles     Nanos
                  8-KiB linear bandwidth      1.02      0.39
                 16-KiB linear bandwidth      1.03      0.40
                 32-KiB linear bandwidth      1.17      0.45
                 54-KiB linear bandwidth      2.15      0.83
                 64-KiB linear bandwidth      2.08      0.80
                128-KiB linear bandwidth      2.07      0.80
                256-KiB linear bandwidth      2.11      0.81
                512-KiB linear bandwidth      3.60      1.39

The numbers above are all "per 64 byte cache line". So while the region is less than the 32 KiB L1 size we get about 1 cycle per line, exactly as expected since the Skylake has 2 load units. In the L2 range, we se an average about 2.1 cycles per line, which works out to ~30.5 bytes per cycle, or even slightly better than the sustained bandwidth published by Intel.

Let's turn off all the prefetchers with wrmsr -a 0x1a4 "$((2#1111))", and try again:

** Running group memory/bandwidth : Linear AVX2 loads **
                               Benchmark    Cycles     Nanos
                  8-KiB linear bandwidth      1.02      0.39
                 16-KiB linear bandwidth      1.03      0.40
                 32-KiB linear bandwidth      1.16      0.45
                 54-KiB linear bandwidth      2.02      0.78
                 64-KiB linear bandwidth      2.01      0.78
                128-KiB linear bandwidth      2.00      0.77
                256-KiB linear bandwidth      2.01      0.77
                512-KiB linear bandwidth      4.18      1.61

Now for the 128 KiB size which fits squarely in L2 we have exactly 2.00 cycles per line, or 32 bytes/cycle. So we can pretty sure one of the prefetchers is competing with our accesses and incurring the ~5% slowdown (hint: it's the L2 streamer, which you can test by only turning of that one: wrmsr -a 0x1a4 "$((2#0001))"). From now on, we'll leave the prefetchers off, to better isolate the L1/L2 performance (at the end we'll return to this topic since we want to know what happens in the usual case the prefetchers are on).

Can we do better than one line every two cycles?

Our theory is that the L1D is the bottleneck: it takes 1 cycle to receive a line from L2, and 1 cycle to do the 2 reads of the new data from L1². For example, you can find a similar explanation here. Under this theory, even if we access each line once (only a single read), we'll still take more than a cycle per line. Let's try that with ./uarch-bench.sh --test-name=memory/load-parallel/parallel-load-128:

** Running group memory/load-parallel : Parallel loads from fixed-size regions **
                               Benchmark    Cycles     Nanos
                   128-KiB parallel load      1.00      0.39

This benchmark performs a series of randomized loads³, one per cache line and reports the time per line. On Skylake client we get exactly 1.00 cycles per line. So in fact the L1D and L2 are capable of coordinating in such a way that the L1D is able to accept a line from the the L2, service the associated load request for that line from the core, and kick off a new L1D miss for the most recent load, all within a single cycle.

In some sense, we are now achieving a bandwidth of 64 bytes from the L2, since we know that a full cache line is coming in every cycle. In a more real sense, we aren't, since we only make a single read of each line, which means at most 32 bytes if we do a full-width 256-bit AVX read. So we need a second read to really get to 64 bytes of consumed bandwidth per line. What happens then?

We can run this test with ./uarch-bench.sh --test-name=memory/bandwidth/load/bandwidth-normal-128, which is even simpler than the earlier test as it is a linear read test, 32 bytes at a time. Results;

** Running group memory/bandwidth : Linear AVX2 loads **
                               Benchmark    Cycles     Nanos
                128-KiB linear bandwidth      2.00      0.78

So as soon as we add a second load of the cache line, everything slows down by half. In other words, the second read takes a cycle, even though it doesn't itself cause an additional L2 miss. In principle, we should be able to do these "additional" reads 2 per cycle, since they are in principle similar to L1 hits. Of course, we can imagine that the second reads per line aren't actually handled as efficiently as L1 hits when we scan through the region linearly, since both the first and second reads will miss (to the same line) and so will be less efficient that a true L1 hit (for example, the load will have to consult the L1, determine that it missed, and wait for the data to arrive from the L2, and then access the L1 again or read the value off the bypass network).

So maybe we can make this work by waiting to do the second read: do reads of the first half of each line, which we fully expect to take a full cycle and then come back "later" to read the second half, when it is sitting in L1 so we take the efficient L1 hit path, which we expect to take 0.5 cycles. In total then, we might expect to fully read one line in 1.5 cycles. Let's try it with a loop that does two reads in the main loop: from the first half of a cache line, and from the second half, but where the second half read "trails behind" by a certain number of cache lines with respect to the first. Like so:

vpaddb ymm0, ymm0, [rcx + offset + UNROLLB * 64]
vpaddb ymm1, ymm1, [rcx + offset + 32]

Here, UNROLLB controls how far ahead (in cache lines) the initial read is compared to the second read. We also need a "lead in" and "lead out" part outside the main loop which does the initial UNROLLB first half reads before the loop, and the same number of second half reads outside so that we still read every line - but the impact of this portion is small.

We don't actually know how many lines ahead we need to run to see any impact (if indeed there is any impact), so let's just test a bunch. The following shows the per-cache line performance of the above code, for various offsets (the x axis) between the first and second loads.

It's an abject failure: regardless of offset the performance is the same: flat at just a fraction above 2 cycles.

The code we just tried interleaves the leading and trailing loads one-by-one. Since we expect the leading loads to miss and the trailing loads to miss, it seems plausible that this pattern would result in one cycle servicing the L1 miss (leading load), and then in the next cycle we can service two L1 hits, but there is only one in source order because we interleave one-by-one, so we service only that single L1 hit that cycle. Rinse and repeat. For max performance want to service two L1 hits in any cycle where we service any, so let's try unrolling the loop by two, so we can interleave two-by-two the leading and trailing loads, like so:

vpaddb ymm0, ymm0, [rcx + offset + UNROLLB * 64]
vpaddb ymm0, ymm0, [rcx + offset + UNROLLB * 64 + 64]
vpaddb ymm1, ymm1, [rcx + offset + 32]
vpaddb ymm1, ymm1, [rcx + offset + 96]

If all goes well, we hope that the two trailing loads can "pair up" and both hit in L1 in the same cycle. Each of the two leading loads will still take a cycle each, as they miss in L2. So at best we hope to read two lines in three cycles. Here's the result:

Very nice! Once we delay the trailing reads by about 7 lines ahead of the trailing reads, we drop down to about 1.52 cycles, or very close to the limit of 1.5 cycles. That's about 42 bytes per cycle sustained much better than the ~29 bytes per cycle Intel publishes as the limit on Skylake. Note that we aren't just reading the memory and then dropping it on the floor: we are accumulating the sum into ymm registers. This indicates that you can probably achieve this throughput in real-world kernels that are blocked in L2.

I thought this was pretty cool since it would presumably apply to earlier architectures as well, since I hadn't seen any big documented improvement in the L2 to core path in Skylake. So I tried it on Haswell. Nope. Even with only a single read per cache line (not even trying to read the rest of the line) Broadwell and Haswell appear capped at 2 cycles per line. So the max bandwidth between L2 and the core seems to be 32 bytes/cycle on these architectures and something really did change in Skylake (perhaps in preparation for AVX-512 in Skylake-SP). Unfortunately this limits the use of this trick to Skylake and subsequent Skylake-derived architectures (Kaby Lake, Coffee Lake, Cascade Lake, etc, etc).

Prefetching

Everything above has been with all four prefetchers turned off. In the real world, we are probably running with them on, almost all of the time. How does that change the results?

Let's take a look at the final chart, which includes the result shown above with prefetching on, as well as the same algorithm with prefetching on (2-wide-pfon), as well as the default "linear" algorithm with prefetching on (linear-pfon) and off (linear). The A and B variants are just two runs with the same parameters, to give you an idea of what's noise and what is a reproducible feature⁴.

Prefetching Comparison

The first thing we notice is that all the runs with prefetching are generally much noisier. For our new optimized algorithm, prefetching also somewhat reduces performance: for moderate unroll factors the time per line bounces around between 1.6 and 1.7, versus the 1.5 we were getting with prefetching off. So the technique still helps, but you throw away perhaps a third of the benefit.

The basic problem is that even though almost no actual prefetching occurs (since the test data is entirely resident in L2), the L2 prefetcher is still being triggered by the access pattern and hence still accesses the L2 to check if the lines identified as prefetch candidates are present. They always are, so the prefetch stop there, but the mere access to the L2 competes with the frequent access to the L2 by the core, occasionally delaying things by one cycle.

This same effect presumably occurs in the linear access case as well, but in this case everything is already proceeding at a slower rate, so there are more gaps in L2 and L1 accesses to absorb the interference by the L2 prefetchers.

Turning the four prefetchers on one-at-a-time reveals that the only prefetcher that interferes with the access pattern is the L2 streamer. Unfortunately, this is the most important prefetcher and what people are usually talking about when they say "the prefetcher": it is the only one that runs far ahead of the request stream making requests all the way to main memory if needed. So turning it off isn't generally a good idea, unless somehow your workload really, really wants optimized L2 access.

Future Investigation

Some areas that would make for interesting future investigation include:

Checking if there is some way to "locally" defeat the L2 prefetcher so that it doesn't interfere compete with demand access for the L2 cache. For example, some type of access pattern that doesn't trigger the prefetcher.
Determining if there are benefits to the leading/trailing pattern for smaller L2-resident regions. This test used a 128-KiB region, or half the size of L2, so the edge effects were not very larger. Evidently, the leading/trailing pattern works also for shorter regions, but how short? Is it viable for a copy of a few hundred bytes? A few thousand?
Determine the impact of the leading/trailing pattern when the data is L1-resident, L3-resident or only in main memory. Here we've show that this pattern is very effective for for L2-resident data, but the effects that make it work are absent (for example) for data in L1 data. Only a few types of HPC applications that specifically block in the L2 usually have the privilege of knowing their data is L2 resident, so for everyone else we'd like to check that using this pattern doesn't slow things down (much) if used on data living elsewhere.
What happens when writes are mixed in?
What happens when the accesses are not aligned?

This post was inspired from the discussion on a thread on the real world technologies (RWT) forum, where I had found a weird effect: if two loads that hit in L2 issue in the same cycle, the second load will have a larger than unexpected latency of 17-19 cycles (similar to combined latencies of an L2 hit and an L1 hit). This gave me that idea that there might be special path for providing the data from an L2 both to the L1 cache and to dependent operation in the core in the same cycle, so that the core doesn't need to access L1. This path only supports a single load, so the loads issued in the same cycle conflict and the second one has to retry, incurring an penalty similar to an L1-hit. Taking advantage of this direct L2-to-core path means that a paired series of 2x L2-hit, followed by 2x L1-hit loads might take the most advantage of this behavior and the 2-loads-per-cycle capacity of the L1 cache, and this turned out to be correct.

Questions or comments can be directed travis.downs@gmail.com.

I would like to thank Nathan Kurz, who provided valuable feedback on this article and Daniel Lemire who provided access to test the behavior on hardware other than my laptop.

¹ ... and possibly also serving a tag check for a read that will miss and need to allocate a miss handling buffer (fill buffer).

² Of course the fill from L2 and the subsequent access of the data from L1 don't happen in back-to-back cycles: it takes a dozen or so cycles for the line to arrive from L2 so the associated reads will be satisfied many cycles after the initial request. In a loop with many reads however, you'll reach a steady state where you'll be receiving lines from some previous iteration and satisfying reads from some even earlier iteration, so we can kind of ignore the actual latencies for now: Intel CPUs have enough MLP to largely hide the L2 latency.

³ In this case, "parallel" means that the loads do no depend on one another, so can potentially overlap in execution. memcpy is an example of obviously parallel loads (and stores), while pointer-chasing would be the poster-child for serial loads. In this way, we measure load throughput and not load latency.

⁴ Note, for example, that the linearA and linearB results are almost exactly overlapping (you can't see linearB unless you zoom in and look for brown pixels because linearA is almost completely on top of it). So that particular curvy pattern is in some sense "real" and is probably some execution artifact based on the unrolling.

_{this space left intentionally blank (so that the footnote anchors work)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much bandwidth does the L2 have to give, anyway?

Prefetching

Future Investigation

Clone this wiki locally