Memory Disambiguation on Skylake

Here you'll find a description of the memory disambiguation behavior of the Intel Skylake processor, based on a reverse engineering using micro-benchmarks written in uarch-bench, as well as peeking at some patents.

While this is only definitely applicable to Skylake (client), it is likely the same or similar behavior is shared among nearby-in-time Intel micro-architectures, but I haven't tested it. You'll probably want to know a bit of x86 assembly to get full value, but the prose should still be understandable even if the assembly isn't.

This is a long one, so you if don't have time you can skip to the summary.

Thanks to Nate Kurz for proofreading and suggesting various welcome improvements.

The Basics

I won't describe memory disambiguation in great detail, but here's a brief background on the topic.

First, the naming: this general topic may also be known as store forward prediction, or memory aliasing prediction or memory dependence speculation, or various other terms. The basic idea is that in an out-of-order architecture, a load that follows an earlier store to the same (or overlapping) address needs to be satisfied⁰ from the most recent such store, and not from stale data in the L1 cache. Such loads are said to alias the earlier store, and in high-performance implementations the store is typically forwarded to the load directly from the store buffer.

In a simple implementation with a store buffer that buffers stores before they are committed to the L1 cache, this means that loads cannot execute before all prior store addresses are known, since it isn't possible to determine if any prior in-progress store aliases the load. In high performance implementations, this is a significant limitation for some code since hoisting loads above address-unknown stores may provide a large speedup. Hoisting here refers to allowing the load to execute (to receive its value) before some earlier stores, especially earlier stores whose addresses are unknown.

Since hoisting loads is both important for performance, and potentially incorrect if the load is hoisted past a store that aliases it, it is common for implementations to have machinery to speculate that some load doesn't alias any in-progress store and to hoist it even above earlier address-unknown stores. The speculation is checked before the load retires and if it turns out wrong, execution is typically rolled back to before the offending load and replayed from that point, at a cost somewhat similar to other types of speculation failure such branch mispredictions.

I highly recommend this article by Henry Wong for a deeper look at this specific topic and measurements on various architectures of steady state behavior for non-aliasing, partially-aliasing and fully-aliasing cases, as well as "fast data" and "fast address" cases. Here's a brief description of store buffers. See also this article by David Kanter describing the memory-subsystem of Intel Silvermont CPU in some detail (Skylake is very different, but it provides good coverage of the concepts).

Prediction

Background, Patents

The motivation for prediction is to identify loads that with high likelihood don't alias an earlier address-unknown store, so that they can hoisted. This prediction should probably be conservative (erring on the side of not hoisting loads), since even the occasional misprediction is very costly (typically 10-20 cycles on modern Intel). The basic implementation is pretty simple and follows the same pattern as branch prediction: track the past behavior of loads based on IP and only hoist loads that have a pattern of not aliasing. The details are important though, to the point that WARF has variously sued (update here) and licensed their '752 patent on this topic to the tune of somewhere around one billion USD (my own rough estimate). That patent, which expired in 2016 and is sometimes called the Moshovos patent, also makes good reading on basic predictor designs.

Skylake

Let's try to figure out how Skylake actually works. In fact, I'll jump straight to the conclusion:

Skylake appears to use a 256-entry hashed per-load-address^0.5 memory-aliasing predictor without exact instruction confirmation, in combination with a core-global watchdog component that can override the per-PC predictor when it is "active".

Now we can work backwards to explain piece-by-piece what that abomination of a sentence means, piece by piece. In my own investigation, I did a very basic initial reverse engineering, which allowed me to Google using more specific terms at which point I came across the '263 patent from Intel, which accelerated the process since I now was just checking if the Skylake implementation lined up with the patent (hint: it does, mostly), and validating various parameters. So you can probably learn almost as much by reading the patent, but frankly that's not fun and the below is at least backed up by real-world tests.

Attempt 1

Let's try to write some code that will trigger store-forwarding and hence possibly trigger a store-forwarding misspeculation. That's easy: just read from a location that was recently written to:

mov     DWORD [rsi], 0    # store 0 to location pointed to by rsi
mov     eax, DWORD [rsi]  # load from that same location into eax
mov     DWORD [rsi], 0
mov     eax, DWORD [rsi]
...

On entry to the above code, rsi holds the address of a page-aligned heap allocated buffer (we only ever touch the first DWORD of this buffer).

With this pair of instructions repeated 100 times and run with oneshot mode in uarch-bench using:

./uarch-bench.sh --timer=libpfc --test-name=studies/store-fwd-try/stfwd-try1 --extra-events=MACHINE_CLEARS.MEMORY_ORDERING

... we get the following results¹:

stfwd-try1 @ 0x0x485000
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try1        1   267.00     0.00
                    stfwd-try1        2    98.00     0.00
                    stfwd-try1        3    98.00     0.00
                    stfwd-try1        4    98.00     0.00
                    stfwd-try1        5    98.00     0.00
                    stfwd-try1        6    98.00     0.00
                    stfwd-try1        7    98.00     0.00
                    stfwd-try1        8    98.00     0.00
                    stfwd-try1        9    98.00     0.00
                    stfwd-try1       10    98.00     0.00
                    stfwd-try1       11    98.00     0.00
                    stfwd-try1       12    98.00     0.00
                    stfwd-try1       13    98.00     0.00
                    stfwd-try1       14    98.00     0.00
                    stfwd-try1       15    98.00     0.00
                    stfwd-try1       16    98.00     0.00
                    stfwd-try1       17    98.00     0.00
                    stfwd-try1       18    98.00     0.00
                    stfwd-try1       19    98.00     0.00
                    stfwd-try1       20    98.00     0.00

The last column, with heading MACHIN is the number of MACHINE_CLEARS.MEMORY_ORDERING events which occurred during the sample - this applies for all the remaining tests as well. Yes, the name could be better.

The first execution of the code shows a large number of cycles compared to the subsequent runs, probably due to missing the L1I cache and other effects of executing cold code^1.5. The remaining 19 iterations all execute in exactly 98 iterations, reflecting the fact that stores can occur at one per cycle², and that store-forwarding can happen in parallel with a throughput of at least 1 per cycle. More importantly, we see zero memory ordering clears caused by mispredicted store forwarding, even on the first run, where the predictors are necessarily cold for this code.

Now perhaps this isn't entirely surprising. The store address in the code above will generally always be available at the moment the store is executed, so we can expect the store data to be available pretty much right away to the following load. In other words, it isn't likely the CPU is even going to hoist the load above the store before its address is known, since the address is always known immediately - there is no need to speculate at all here.

Attempt 2

So let's delay the availability of the store address, but not the load address so that we might actually see loads being hoisted. We'll do that like this:

; setup: copy rdx into rsi so we can have independent-but-equal store and load addresses
mov     rdx, rsi 

imul    rdx, 1
mov     DWORD [rdx], 0
mov     eax, DWORD [rsi]
imul    rdx, 1 
mov     DWORD [rdx], 0
mov     eax, DWORD [rsi]
... ; repeat the imul/load/store triplet 98 more times

It's similar to the first test, but we use rdx as the store address, and in between each store we multiply rdx by 1: this is a logical no-op, but it adds 3 cycles of latency to the calculation of the address for the next store, so the store addresses readiness will always be lagging the load address readiness which is identical but uses rsi which isn't part of the long imul dependency chain.

Here's a typical result:

stfwd-try2 @ 0x0x485400
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try2        1   447.00     4.00
                    stfwd-try2        2   304.00     0.00
                    stfwd-try2        3   304.00     0.00
                    stfwd-try2        4   304.00     0.00
                    stfwd-try2        5   304.00     0.00
                    stfwd-try2        6   304.00     0.00
                    stfwd-try2        7   304.00     0.00
                    stfwd-try2        8   303.00     0.00
                    stfwd-try2        9   304.00     0.00
                    stfwd-try2       10   304.00     0.00
                    stfwd-try2       11   304.00     0.00
                    stfwd-try2       12   304.00     0.00
                    stfwd-try2       13   304.00     0.00
                    stfwd-try2       14   303.00     0.00
                    stfwd-try2       15   304.00     0.00
                    stfwd-try2       16   304.00     0.00
                    stfwd-try2       17   304.00     0.00
                    stfwd-try2       18   304.00     0.00
                    stfwd-try2       19   304.00     0.00
                    stfwd-try2       20   303.00     0.00

Finally we are getting some memory speculation failures, even if it's only 4. Although it's the most common, that result isn't the only one though, on repeated runs this result was also common:

stfwd-try2 @ 0x0x485400
                     Benchmark   Sample   Cycles   MACHIN
                    stfwd-try2        1   395.00     2.00
                    stfwd-try2        2   404.00     3.00
                    stfwd-try2        3   304.00     0.00
                    stfwd-try2        4   304.00     0.00
                    stfwd-try2        5   303.00     0.00
                    stfwd-try2        6   304.00     0.00
                    stfwd-try2        7   304.00     0.00
... snip

Similarly, runs with 3 and then 1 mispredictions in the first and second samples (followed as usual by all zeros) were also common.

The overall behavior was fairly consistent though: you'd get a few misspeculations limited usually to the first or second pass though the code, and then zero. In particular, you never see more than 4 misspeculations in one iteration. The fact we get zero misspeculations on most subsequent iterations makes sense: at this point the CPU has seen the loads multiple times, and knows they do alias earlier stores³ and so the loads won't be hoisted.

The big question is: Why only 4 (or sometimes 2 or 3) misspeculations on the first iteration? Our trick of delaying the store address to force misspeculations has pretty clearly worked (virtually every run shows misspeculations), but I would expect every load to misspeculate here, since we are delaying every store. We are training the predictors to predict "does alias" for future samples, here all our 100 loads live at separate PCs and should have separate predictors⁴.

Perhaps every misprediction causes a small window where no more mispredictions happen, so we only get a maximum of 1 misprediction every 25 loads or so? Let's test it by varying the number of load store pairs.

So I tried this with 20 load store pairs instead of 100. The results were very similar to the prior case: many runs with 4 mispredictions followed by all-zeros, some with the 3, 2, 0, 0, ... pattern. Some start out with zero for for the first sample, but have 3 or 4 mispredictions in a later sample. In general though we see about the same number of mispredictions as the case with 100 loads. Only when we drop the number of store-load pairs down very low, e.g., to 4 loads, do we see a reduction in mispredictions (typically 2). On the other handing, greatly increasing the number of store-load pairs to 1,000 shows again the same 4, 0, 0, ... pattern (although more consistently: there are fewer 2, 3, ... and 3, 1, ... patterns in this case). You can try these variants yourself with --test-name */stfwd-try2-* on the command line.

One could reasonably conclude that the mispredictions are thus not spread out among all the store-load pairs, but largely occur at the start, in the first few loads, and then stop occurring at all. It seems there is some global mechanism where misspeculations earlier in the instruction stream can affect the behavior of later different loads that have never been executed. We also find that 4 continues to be a magic number: we never see samples with more than 4 misspeculations.

In fact, Intel describes exactly such a global mechanism in the '263 patent, which they call a watchdog. From the first claim:

first logic to predict a load operation will not conflict with older pending store operations if a saturation counter circuit corresponding to the load operation is at least at a threshold value and a maximum rate of mispredictions has not occurred, wherein the saturation counter circuit is to be incremented if the load operation does not conflict with the older pending store operations;

a watchdog flush counter circuit to be decremented if a prediction by the first logic is not correct; and

a watchdog disambiguation counter circuit to be incremented if the prediction by the first logic is correct, wherein the first logic is to be disabled if the ratio of the disambiguation counter circuit value and the flush counter circuit value reaches a minimum value.

If you read the rest of the patent, it becomes clear that the first logic they describe above is the regular PC-based predictor tied to a specific load (let's call this the fine-grained predictor from now on). Then the watchdog described in the last paragraph is the global mechanism we are seeing: if the watchdog is active, loads will be assumed to alias (i.e., will not be hoisted), regardless of whatever prediction is/would be made by the fine-grained predictor.

In fact, we find the likely origin of the magic number 4 in the section describing the detailed operation of the watchdog counter:

Similarly, the prediction mechanism may be disabled after a desired ratio of successful to unsuccessful predictions is met. For example, in one embodiment the prediction mechanism is disabled if 4 (corresponding to a 2 bit flush counter, for example) or more mispredictions occurs for every 1024 (corresponding to a 16 bit disambiguation counter, for example) successful predictions. In other embodiments, other techniques may be used to track the success rate of predictions, such as time-dependent counters, in order to determine when to enable or disable the prediction mechanism.

In one embodiment they use a mechanism where 4 mispredictions in a row⁵ will enable the watchdog, which basically means the CPU is running in a conservative mode where loads are never hoisted. That would exactly explain the behavior above: where we never see more than 4 mispredictions. It also explains the various behaviors where we don't see 4 mispredictions in the first run: the counters described start in some semi-random state: if they are already partway to the threshold it may take less than 4 mispredictions to enable the watchdog. Runs with outcomes 3, 1, 0, 0, ... or 3, 2, 0, 0, ... are probably explained by the fine-grained predictors mostly predicting "aliases", so not enough actual mispredictions occur on the first run to trigger the watchdog, but by the second run the watchdog is enabled.

So our observed results line up exactly with a two-bit counter as the patent describes "in one embodiment". Now about that "one embodiment" thing: it's my experience that many of these CPU technology patents often have that exact pattern: they go into considerable detail about "one embodiment", including specific details like the size of the counters, or number of X, or amount of Y, then finish it off with a general "in other embodiments some other unspecified method might be used". The detailed embodiment mention often aligns exactly with the released hardware, as confirmed by later published details, testing etc. I don't know what legal considerations motivate this behavior, but you can see it repeatedly. All that is just to say that the numbers you find in these patents usually aren't picked out of thin-air, and if the tested behavior matches, it's a strong indication that the numbers and mechanism described in the patent may be the ones that appear in actual hardware.

Now With Training

Now that we have a working theory, let's try to control the watchdog state more explicit and try to make some more measurements to understand its behavior.

One problem we are having is that the watchdog state on entry to the benchmark is unpredictable, affected by the vagaries of whatever happened during the initialization of the uarch-bench process. Even if we just tried to trim the benchmark down to a microscopic executable written in assembly that skips even initializing the C library it probably wouldn't help: the fine-grained and watchdog predictors would then just depend on the state of whatever process (e.g., the shell) was running previously on that CPU.

So let's try to train the predictors into a known state before running our benchmark by running some specially crafted code for that purpose!

The more interesting state is probably to train them into the "doesn't alias" state, since then we expect to see our 4 mispredictions consistently in the first run when we run the benchmark. Ideally, the training will train both the fine-grained and watchdog predictors. The latter, being global, should be easy. The fine-grained (per PC) predictors are another matter: how can we train the predictor for the loads in our benchmark using other, unrelated loads? Well, under the assumption that the load address is hashed into some type of small(ish) predictor structure, if we just train with a lot of loads at different addresses, we can hope to collide into most or all of the predictor entries, effectively training the fine-grained predictors for branches they haven't encountered yet.

This assumes, of course, that the predictors don't "confirm" that the accessed entry corresponds to the current PC. For example a predictor might store the PC (or some hash of it, different than the indexing hash) of the load that corresponds to the prediction stored there, and when consulted for a different load that happens to hash to the same entry (a collision), it could detect this and use a conservative default prediction. On the other hand, such a check may be too expensive or require too much storage in the predictor structure (which otherwise much only needs a few bits worth of counters per entries), so collisions not be detected at all. Let's find out!

Single Training Load, Two Payload Loads

First, we'll try to understand the hash function which maps load instructions to predictor entries. As it turns out, this is quite easy.

We use a function like this:

; separate training and timed regions
define_bench aliasing_loads_raw

    mov     r8, rdx ; rdx:rax are clobbered by rdpmc macros

    readpmc4_start

    mov     ecx, 10

.outer:

    lea     r11, [rsi + 64]
    mov     r10, 10000
    
    ; training loop
.train_top:
%rep 1
    imul    r11, 1
    imul    r11, 1
    imul    r11, 1
    mov     DWORD [r11], 0
    mov     eax, DWORD [rsi]
    nop
%endrep
    dec     r10
    jnz     .train_top

    lea     r9, [rsi + 8]

    ; this lfence is critical because we want to finish all the instructions related to
    ; training before we start the next section, since otherwise our attempt to delay
    ; the store may not work (e.g., because the training section will probably end
    ; with a backlog of imul that will execute and retire 1 every 3 cycles, so we will
    ; just take the instructions in the next cycle one every 3 cycles, so OoO can't do
    ; its magic
    lfence
    jmp .top
ALIGN 256
.top:

    nops
%rep 2
    times 5 imul    r9, 1
    ; payload load(s)
    mov     DWORD [r9], 0
    add     eax, DWORD [rsi + 8]
%endrep

    ;dec     rdi
    ;jnz     .top

    dec     ecx
    jnz     .outer

    readpmc4_end
    store_pfcnow4 r8
    ret
    ud2

When run, the usual observed behavior is that we get exactly 1 aliasing misprediction for every outer iteration. That is, after doing the 10,000 training loops, one of two actually aliasing loads immediately following (the payload) mispredicts. Since we do 10 outer loops, this leads to 10 mispredictions, like this typical run⁷:

./uarch-bench.sh --timer=libpfc --test-name=studies/store-fwd-try/stfwd-raw-trained --extra-events=MACHINE_CLEARS.MEMORY_ORDERING
...
stfwd-raw-trained @ 0x0x43e800
                 Benchmark    Sample    Cycles    MACHIN
         trained loads raw         1 900754.00     10.00
         trained loads raw         2 900734.00     10.00
         trained loads raw         3 900734.00     10.00
         trained loads raw         4 900734.00     10.00
         trained loads raw         5 900734.00     10.00
         trained loads raw         6 901820.00     10.00
         trained loads raw         7 900731.00     10.00
         trained loads raw         8 900734.00     10.00
         trained loads raw         9 900734.00     10.00
         trained loads raw        10 901286.00     12.00
         trained loads raw        11 900732.00     10.00
         trained loads raw        12 900732.00     10.00
         trained loads raw        13 901432.00     11.00
         trained loads raw        14 905594.00      6.00
         trained loads raw        15 906903.00      8.00
         trained loads raw        16 901914.00     10.00
         trained loads raw        17 900732.00     10.00
         trained loads raw        18 900733.00     10.00
         trained loads raw        19 900734.00     10.00
         trained loads raw        20 901742.00     10.00

The takeaway here is that we get one misprediction after training even though the training load is at a different address. This is surprising. I would expect that the 2 loads in the payload section map to different predictor entires than the training load (indeed, we'll see later that this is the case), and while we expect the training section section to set the watchdog to allow hoisting, I would expect the per-load entry to still prevent it.

Three States?

It seems that the global aliasing predictor state may actually have three possible values:

Hoisting always allowed, regardless of what the predictor says (maybe don't even consult the predictor).
Hoisting may be allowed, but depends on value returned by per-load predictor.
Hoisting disallowed globally (watchdog active).

I don't find any mention of state (1) in the Intel patent, only state (2). Still state (1) might make sense when you have a large working set such that the alias predictor tables usually don't have an entry for loads as they are encountered (the number of loads in the working set exceeds the capacity). If loads rarely or never alias in this code, state (1) allows hoisting based on the observed global behavior, even if the per-load resources are insufficient to track all loads.

The existence of state (1) can also be seen as an effort to conserve predictor resources: if you have a stretch of code (e.g., an unrolled memcpy-like loop) where loads never alias earlier stores, you don't really want to use all your per-load resources storing predictions for loads that never alias. Perhaps when you enter state (1) we no longer update the predictor entries, preserving the useful entries from other parts of the code that have more interesting aliasing behavior.

Global Predictor Sensitivity

While we are here, let's check out how many training loads (that don't alias) are needed to put the global predictor back into state (1) during the training phase. Above we use 10,000 loads, but it turns out the tipping point is exactly 1024. When we modify the training count to 1023 (mov r10, 1023), we get results like this:

stfwd-raw-trained @ 0x0x43e800
                     Benchmark    Sample    Cycles    MACHIN
             trained loads raw         1  92659.00      6.00
             trained loads raw         2  92586.00      5.00
             trained loads raw         3  92584.00      5.00
             trained loads raw         4  92584.00      5.00
             trained loads raw         5  92584.00      5.00
             trained loads raw         6  92584.00      5.00
             trained loads raw         7  92584.00      5.00
             ... more lines all like the above

The number of mispredicts is cut in half. What happens it that takes two training loops to get back to state (2) now, so only every other payload suffers a mispredict. This is roughly consistent with the '263 patent, which doesn't directly describe state (1) but does mention a threshold value of 1024 for the global misprediction counter.

Aliasing Loads

Above, I claimed without further proof that the training and payload loads map to different predictor entries, as the evidence for state (1). Isn't it possible, however, for the training load to map to one or more of the payload loads? That would explain why a misprediction occurs (because the per-load predictor used by the payload was also being trained by the training load).

Well let's look exactly this: we'll vary the number of nop instructions inserted by the nops macro which appears between the training loop and the payload, and then look for some behavior change as we increase the number of nops.

I ran it with nop-iterator.sh script, which repeatedly re-assembles the function above, with NOPCOUNT varying from 1 to 4000. Just eyeballing the results, everything looks like the usual 10 mispredictions (along with non-repeatable outliers), except for a few runs that have exactly double: 20.0 mispredictions.

I grep⁶ for the NOPCOUNT (number of nops) for these 20 mispredict runs, and they are as follows:

What's special about this sequence? The gap between successive nop counts is exactly 256. So every time we shift our function by 256 bytes this double mispredict shows up for exactly one alignment. Let's take a look at the code generated for the first interesting nop count of 91. Specifically, let's look at the training load and two payload loads:

> objdump -drwC -Mintel x86_methods2.o | grep 'eax,DWORD'
  98:	8b 06                	mov    eax,DWORD PTR [rsi]      <-- the training load
 17a:	03 46 08             	add    eax,DWORD PTR [rsi+0x8]  <-- the first payload load
 198:	03 46 08             	add    eax,DWORD PTR [rsi+0x8]  <-- the second payload load

There's a bunch more code in that file, but the grep 'eax,DWORD' isolates the three loads. The thing to notice is that the address of the training load at 0x98 and the second payload load at 0x198 end in the same byte, i.e., they are some multiple of 256 bytes apart.

All of the other 20 misprediction cases share this trait (with different addresses for the payload load, like 0x298, etc). We can see what's going here: the mapping to the per-load predictors is apparently very simple: it's just a direct mapping based on the last 8 bits of the address into a 256-entry table, or possibly instead into a smaller table which stores the last 8 bits of the address to add a second layer of confirmation to a hit.

Note also that any cases where the first payload load collides with the training load don't cause an extra mispredict: this load was already mispredicting regardless of the state of the per-load predictor, so we only get the second misprediction when the second load collides. We can add more loads to the payload (e.g., changing %rep 2 to %rep 4) and the same pattern holds: whenever one of the loads ends in the same 8 bits as the training load, a collision occurs and an extra mispredict occurs.

Predictor Table Size

Above we have shown that the load prediction behavior has period 256 with respect to the load address, but this doesn't necessarily mean the predictor has 256 entries: it could instead be that a smaller set of predictor entries exist, say 16 entries with entries tagged with the bottom 8-bits of the load address. When a load address is looked up, if the tag doesn't match, a default prediction is made instead. This would explain the period 256 above without necessarily needing 256 entries.

So let's try to determine the predictor size. This is a bit tricker than it seems at first: the usual approach for this type of reverse engineering is create an increasing sequence of instructions until prediction starts to fail, but the straightforward approaches here don't work due to the various global states.

For example, we might consider creating a series of aliasing loads, and keep increasing the number of distinct loads expecting to see memory order mispredictions when we exceeded the table size. This doesn't actually work though, because as soon as we get a few mispredictions (if any) the global state changes to (3) and no loads will be hoisted, so we will get zero mispredictions.

Alternately, we might create a loop with a non-aliasing store-load pair which has very different performance if loads are hoisted versus if they are not. For example a loop whose store address or data depends on the result of the load: in this case, if the load is not hoisted the store and load form a carried serial dependency and the iteration time will be limited by the sum of the store and load latency. However, if the load can be hoisted, the dependency chain is broken and the pair can execute as fast as one per cycle. We could try an increasing sequence of such pairs until performance started to degrade from the fast "hoisted" speed to the slower "serially dependent" speed, because we had too many loads and they overflowed our predictor table. This also doesn't work, however, since the global state will soon change to (1) where all loads are allowed to be hoisted without even consulting the per-load predictor, so performance won't degrade.

We can, however, combine these two approaches into a single test. One store and one aliasing load and one non-aliasing load. The global state can't stay in (1) or (3) since we have both aliasing and non-aliasing loads. We increase the sequence of store-load triplets until we exceed the predictor tables, at which point something must break: either the non-aliasing loads will stop being hoisted, and performance will tank, or the aliasing loads will be hoisted and mispredicts will skyrocket and performance will tank (if you have followed the global predictor logic up until this point you'll already be able to predict which of the two is likely to occur).

So here's the loop we use (mixed_loads_raw):

    lea     r11, [rsi+ 64]
    mov     rdi, 1000

.top:
%assign r 1
%rep REPCOUNT
    ; rax is always zero in this loop
    times 2 lea    r11, strict [byte r11 + rax*2 + 0] ; delay the store address
    mov     DWORD [r11 + rax], eax ; (A) the store to [rsi + 64]
    mov     r9d, DWORD [rsi + 64]  ; (B) aliasing load
    nop9                           ; (C)
    nop6
    mov     eax, DWORD [rsi]       ; (D) non-aliasing load
    imul    eax, 1                 ; (E) make the eax dep chain longer
%if r == (REPCOUNT/2)
    ; half way through the unroll, we add 19 bytes of nop so that the the aliasing loads
    ; in the second half of the loop collide with the non-aliasing loads in the second
    ; half and vice versa. Without this, we'll never get any slowdown, because even though
    ; we get collisions, the prediction is always exactly correct since aliasing loads collide
    ; with aliasing loads after "wrap around" (and non-aliasing with aliasing)
    nop9
    nop9
    nop1
%endif
%assign r r+1
%endrep

    dec     rdi
    jnz     .top

The key features are a store (A) and two loads: aliasing (B) and non-aliasing (D). Some nop instructions are strategically added so that each load is exactly 19 bytes away from the previous and subsequent loads, and consequently the whole repeated section is 38 bytes. Since 19 is relatively prime to 256, we'll visit all the possible 256 final address bytes before repeating any. That is, we won't get any predictor collisions before all predictor entries have been visited, for a power-of-two sized table and a direct mapping scheme.

When everything is working well the whole repeated section runs in about 6 cycles, limited by the two 3-cycle lea operations in the dependency chain involving r11. The store, loads, nop and imul instructions don't really cost anything since they easily fit into the ample spare execution cycles provided by the 6-cycle minimum latency.

Indeed, for REPCOUNT=10, that's exactly what we see:

stfwd-raw-mixed @ 0x0x43e9c0
                     Benchmark    Sample    Cycles    MACHIN
            mixed aliasing raw         1  61475.00     12.00
            mixed aliasing raw         2  60390.00      9.00
            mixed aliasing raw         3  60431.00     10.00
            mixed aliasing raw         4  60390.00      9.00
            mixed aliasing raw         5  60431.00     10.00
            mixed aliasing raw         6  60431.00     10.00
            mixed aliasing raw         7  60390.00      9.00
            mixed aliasing raw         8  60431.00     10.00
            mixed aliasing raw         9  60389.00      9.00
            mixed aliasing raw        10  60431.00     10.00
            mixed aliasing raw        11  60390.00      9.00
            mixed aliasing raw        12  60431.00     10.00
            mixed aliasing raw        13  60390.00      9.00
            mixed aliasing raw        14  60431.00     10.00
            mixed aliasing raw        15  60389.00      9.00
            mixed aliasing raw        16  60431.00     10.00
            mixed aliasing raw        17  60390.00      9.00
            mixed aliasing raw        18  60431.00     10.00
            mixed aliasing raw        19  60390.00      9.00
            mixed aliasing raw        20  60431.00     10.00
            mixed aliasing raw    median  60431.00     10.00

Divide by 10,000 because there are 10 unrolled load pairs times 1000 iterations, and you get ~6.04 cycles/load pair. Note also there are 9 or 10 memory ordering clears per 1000 iterations. This happens because every ~1024 load pairs, we switch from global state (2) to global state (1) and subsequently get a single misprediction before switching back to state (2). We saw this before in the trained tests, but here the aliasing and non-aliasing loads are interleaved and equal in number, which tells us something new: the counter that tracks successful hoisting in order to change from state (2) to state (1) isn't affected by non-hoisted aliasing loads. This apparently means that any workload of mixed aliasing and non-aliasing loads will get occasional single mispredictions (after ~1024 hoisted loads) as we flip into state (1). This effect actually most of the .04 in the measured 6.04 cycles: 10 mispredictions adds about 360 extra cycles to the runtime.

We're almost there now. Let's use the same nop-iterator.sh script, but this time vary REPCOUNT from 1 to 1,000 to see if there is a phase change in behavior. Note that each REPCOUNT accounts for 2 loads. Here's the command I used:

VAR=REPCOUNT scripts/nop-iterator.sh | tee tmp/out.mixed

I threw the results⁸ into a spreadsheet and plotted REPCOUNT versus cycles/load pair and aliasing mispredictions (shown as mispredictions per 1000 pairs):

The effect here is very clear. Right at a REPCOUNT of 128 (i.e., 256 loads in the loop body), we see a slowdown in performance from the baseline of 6 cycles, increasing smoothly to 16 cycles at REPCOUNT of 256 (512 loads). The slowdown is entirely due to aliasing loads not being hoisted any more as the per-load predictor for a non-aliasing load collides with an aliasing load earlier in the sequence. At the same time, the mispredicted loads start dropping off to zero, since these are caused by entering state (1) which needs 1024 hoisted loads to occur: as hoisted loads become less frequent, state (1) transitions become less frequent and drop off to zero when the hoisting stops completely at REPCOUNT 256.

Given this, we can safely say that the predictor size is 256 entries: since we predict perfectly up to 256 loads. Note also that when aliasing and non-aliasing loads collide, the predictor predicts "aliases" which makes sense: the penalty for an incorrect hoist is large, so the predictors are biased towards not hoisting if any aliasing has been observed (the patents mention 15 consecutive non-aliasing loads before the predictor will allow aliasing, and another test performed but not described here seem roughly consistent with a figure in that range).

Summary

Skylake seems to use a predictor closely aligned with that described in Intel's '263 patent, with both per-load-address predictors as well as a core-global "watchdog" that can override the per-load behavior.
The global watchdog was found to include a state (2) where the per-address predictors are used to determine if any particular load can be hoisted and state (3) where no hoisting occurs for any load, as described in the patent.
The watchdog transitions from (2) to (3) when more than 4 mispredictions occur for every 1024 successfully hoisted loads and vice versa. The implication is that you will rarely see a burst of more than 4 ordering mispredictions in a small section of code, since a transition to state (3) will occur.
The global watchdog also seems to include a state (1), not described in the patent, where all loads are eligible for hoisting, without consulting the per-address predictor. Starting at state (2) the core enters state (1) after 1024 successful load hoistings. A single misprediction is enough to return to state (2). The idea here may be to save power by gating the prediction circuitry, to decrease load latency in the presence of earlier-address-unknown stores by executing without waiting for the predictor, or to allow a quicker transition to hoisting all loads when there is a phase change from a mixed-aliasing situation to a no-aliasing one.
The size of the per-address predictor is 256 entries and the "hash function" from address to entry is simply taking the bottom 8-bits of the address (the first byte of the load instruction).
The aliasing misprediction cost was generally 32 or 33 cycles, considerably worse than say a branch misprediction which usually introduces only a 16-20 cycle front-end bubble.

Future Work

Interesting future work on this subject that I may undertake (or that someone else might undertake in which case I would be very interested in the result) includes:

Determining the exact per-address predictor behavior, e.g., at what ratio of aliasing to non-aliasing loads does the predictor transition to allowing hoisting. This is relatively straightforward now that we know the global predictor behavior: we can just try various runs of aliasing and non-aliasing loads at the same address and check when a transition occurs.
Determining the effect of 4K-aliasing. We know that a store followed by a load that shares the bottom 12 bits of address (i.e., whose address is the same modulo 4096) acts in some ways like an actually aliased load - but how this interacts with the predictor isn't clear (e.g., is the store forwarded and then later there is a machine clear when the error is detected?).
Trying to understand the purpose of state (1), which stops using the per-address predictors and lets all loads be hoisted. On the face of it, this seems mostly pointless, since at this point the per-address predictors should be doing a good job of allowing all loads to hoist anyways, and there are some downsides such as increased mispredictions. Perhaps this lets the CPU turn off or bypass the prediction circuitry entirely which may (a) save power or (b) improve load latency or (c) something else. One could try to test (a) by examining if state (1) is associated with a reduction in power draw via the REPL counters and (b) by trying to observe a decrease in some types of load latency.

Thanks

Thanks to Noah Goldsteinn who pointed out an error in the text.

Footnotes

⁰ Or perhaps partially satisfied from the earlier store, in the case of a load that overlaps partly, but not completely, the earlier store.

^0.5 When we refer to address in "per-load-address" we mean the the address of the load instruction in the executable - not the source address for the loaded value (which varies at runtime).

¹ These results are fairly representative of repeated runs of this benchmark, although you can see the occasional outlier perhaps reflecting an interrupt or other external event. Some runs have a mix of 98 cycles and 99 cycles, rather than all 98 cycles - there are a lot of micro-architectural details that can make a 1-cycle difference!

^1.5 In my testing, L1I misses were the most prominent effect, but other things that might slow down the first iteration include: untrained branch predictors, L1D misses, and use of legacy decoders instead of uop cache.

² This reason this can be less than 100, despite stores executing at a maximum rate of 1 per cycle is itself somewhat interesting: the results are calculating by subtracting measurement overhead based on running the identical measurement code against an empty function. A function call has a minimum latency of about 4 cycles (i.e., back-to-back function calls take at least 4 cycles), but other work can happen in parallel with that cost. The upshot is that an empty function may take the same time as a function with 1-4 cycles of "real work", since the real work cost is hidden by the minimum function call latency. A function that does say 5 cycles of real work will show up as as taking 5 - 4 = 1 cycle of total work, since most of the cost is hidden by the function call latency. There are a couple solutions to this problem, but I'll leave that to another day.

³ I'm speaking a bit loosely here when I say the predictor knows they alias: I should really say if the predictor had earlier thought they didn't alias, it would have corrected itself by now. In this case, however, the vast majority (96 out of 100, usually) the predictor never allowed the load to be hoisted in the first place, and we aren't sure if it's even tracking the aliasing state for loads that it didn't hoist (although we'll investigate it).

⁴ Granted, the number of predictors (or more precisely the size of the predictor state) isn't unlimited, so we might expect distinct loads to alias in the predictor tables, but in general these tables are much larger than 4.

⁵ Of course, the mispredictions don't need to be all in a row: the basic idea is that if the mispredict:predict ratio is worse than 4:1024 the watchdog will eventually become enabled - but 4 in a row should always be enough to trigger the watchdog from any starting state.

⁶ I used grep -A5 ' 20.00' output.txt | gawk '{ if (match($0,/NOPCOUNT=(.*) -f/,m)) print m[1] }' to semi-reliably pick out the values of NOPCOUNT which resulted in counts of 20 mispredictions.

⁷ There are some outlier runs: this is a common feature of this benchmark. By adding HW_INTERRUPTS.RECEIVED I can see that many of the outliers were mispredictions are still 10 or higher are caused by interrupts, but there are also a lot of outliers where mispredictions are less than 10 (yet runtime increases over the nominal behavior) and I don't have an explanation for those.

⁸ This time using this nasty pipeline: egrep 'REPCOUNT| median' tmp/out.mixed | gawk '{ if (match($0,/REPCOUNT=(.*) -f/,m)) print m[1]; else if (match($0,/median (.*)/,m)) print m[1] m[2] }' | gawk 'NR%2{printf "%s ",$0;next;}1' | xsel -b.

_{this space left intentionally blank (so that the footnote anchors work)}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly