Skip to content

Intel Performance Quirks

Travis Downs edited this page Oct 25, 2021 · 35 revisions

Weird performance quirks discovered (not necessarily by me) on modern Intel chips. We don't cover things already mentioned in Agner's guides or the Intel optimization manual.

L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12

It seems to happen only if the two loads issue in the same cycle: if the loads issue in different cycles the penalty appears to be reduced to 1 or 2 cycles. Goes back to Sandy Bridge (probably), but not Nehalem (which can't issue two loads per cycle).

Benchmarks in uarch-bench can be run with --test-name=studies/memory/l2-doubleload/*.

Details at RWT.

adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines

Normally adc is 2 uops to p0156 and 2 cycles of latency, but in the special case the immediate zero is used, it only takes 1 uop and 1 cycle of latency on Haswell machines. This is pretty a important optimization for adc since adc reg, 0 is a common pattern used to accumulate the result of comparisons and other branchless techniques. Presumably the same optimization applies to sbb reg, 0. In Broadwell and beyond, adc is usually a single uop with 1 cycle latency, regardless of the immediate or register source.

Discussion at the bottom of this SO answer and the comments. Test in uarch bench can be run with --test-name=misc/adc*.

Short form adc and sbb using the accumulator (rax, eax, ax, al) are two uops on Broadwell and Skylake

In Broadwell and Skylake, most uses of adc and sbb only take one uop, versus two on prior generations. However, the "short form" specially encoded versions, which use the rax register (and all the sub-registers like eax, ax and al), still take two uops.

This short form is only used with an immediate operand and of the same size as the destination register (or 32 bits for rax), so isn't wouldn't be used (for example) for an addition to eax if the immediate fits in a byte, because the 2-byte opcode form would be 2 bytes shorter (one byte longer opcode, but three bytes shorter immediate).

I first saw it mentioned by Andreas Abel in this comment thread.

Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this

In particular, if the load address is ready to go as soon as the store address is, or at most 1 or 2 cycles later (you have to look at the dependency chains leading into each address to determine this), you'll get a variable store forwarding delay of 4 or 5 cycles (seems to be a 50% chance of each, giving an average of 4.5 cycles). If the load address is available exactly 3 cycles later, you'll get a faster 3 cycle store forwarding (i.e., the load won't be further delayed at all). This can lead to weird effects like adding extra work speeding up a loop, or call/ret sequence, because it delayed the load enough to get the "ideal" store-forwarding latency.

Initial discussion here.

Stores to a cache line that is an L1-miss but L2-hit are unexpectedly slow if interleaved with stores to other lines

When stores that miss in L1 but hit in L2 are interleaved with stores that hit in L1 (for example), the throughput is much lower than one might expect (considering L2 latencies, fill buffer counts, and store buffer prefetching), and also weirdly bi-modal. In particular, each L2 store (paired with an L1 store) might take 9 or 18 cycles. This effect can be eliminated by prefetching the lines into L1 before the store (or using a demand load to achieve the same effect), at which point you'll get the expected performance (similar to what you'd get for independent loads). A corollary of this is that it only happens with "blind stores": stores that write to a location without first reading it: if it is read first, the load determines the caching behavior and the problem goes away.

Discussion at RWT and StackOverflow.

The 4-cycle load-to-load latency applies only in the load-feeds-load case

A load using an addressing mode like [reg + offset] may have a latency of 4 cycles, which is the often-quoted best-case latency for recent Intel L1 caches - but this only applies when the value of reg was the result of an earlier load, e.g, in a pure pointer chasing loop. If the value of reg was instead calculated from an ALU instruction, the fast path is not applicable, so the load will take 5 cycles (in addition to whatever latency the ALU operation adds).

For example, the following pointer-chase runs at 4 cycles per iteration in a loop:

mov rax, [rax + 128]

While the following equivalent code, which adds only a single-cycle latency ALU operation to the dependency chain, runs in 6 cycles per iteration rather than the 5 you might expect:

mov rax, [rax]
add rax, 128

The add instruction feeds the subsequent load which disables the 4-cycle L1-hit latency path.

Peter Cordes mentioned this behavior somewhere on Stack Overflow but I can't find the place at the moment.

The 4-cycle best-case load latency fails and the load must be replayed when the base register points to a different page

Consider again a pointer-chasing loop like the following:

top:
mov  rax, [rax + 128]
test rax, rax
jnz  top

Since this meets the documented conditions for the 4-cycle best case latency (simple addressing mode, non-negaitve offset < 2048) we expect to see it take 4 cycles per iteration. Indeed, it usually does: but in the case that the base pointer [rax] and the full address including offset [rax + 128] fall into different 4k pages the load actually takes 9 or 10 cycles on recent Intel and dispatches twice: i.e., it is replayed after the different page condition is detected. Update: actually, it is likely the load following the different-page load that dispatches twice: it first tries to dispatch 4 cycles after the prior load, but finds its operand not available (since the prior load is taking 9-10 cycles) and has to try again later.

Apparently to achieve the 4-cycle latency the TLB lookup happens based on the base register even before the full address is calculated (which presumably happens in parallel), and this condition is checked and the load has to be replayed if the offset puts the load in a different page. On Haswell loads can "mispredict" continually in this manner, but on Skylake a load being mispredicted will force the next load to use the normal full 5-cycle path which won't mispredict, which leads to alternating 5 and 10 cycle loads when the full address is always on a different page, for an average latency of 7.5).

Originally reported by user harold on StackOverflow and investigated in depth by Peter Cordes.

Lines in L3 are faster to access if their last access by another core was a write

As reported in Evaluating the Cost of Atomic Operations on Modern Architectures, the speed to access a line in the L3 depends on whether the last accesses were read-only by some other cores, or a write by another core. In the read case, the line may be silently evicted from the L1/L2 of the reading core, which doesn't update the "core valid" bits and hence on a new access the L1/L2 of the core(s) that earlier accessed the lines must be snooped and invalidated even if they don't contain the line. In the case of a modified line, it is written back and hence the core valid bits are updated (cleared) and so no invalidation of other core's private caches needs to occur. Quoting from the above paper, pages 6-7:

In the S/E states executing an atomic on the data held by a different core (on the same CPU) is not influenced by the data location (L1, L2 or L3) ... The data is evicted silently, with neither writebacks nor updating the core valid bit in L3. Thus, all [subsequent] accesses snoop L1/L2, making the latency identical .... M cache lines are written back when evicted updating the core valid bits. Thus, there is no invalidation when reading an M line in L3 that is not present in any local cache. This explains why M lines have lower latency in L3 than E lines.

An address that would otherwise be complex may be treated as simple if the index register is zeroed via idiom

General purpose register loads have a best-case latency of either 4 cycles or 5 cycles depending mostly on whether the are simple or complex. In general, a simple address has the form [reg + offset] where offset < 2048, and complex address is anything with a too-large offset, or which involves an index register, like [reg1 + reg2 * 4]. However, in the special case that the index register is zero and has been zeroed, via a zeroing idiom, the address is treated as simple and is eligible for the 4-cycle latency.

So a pointer-chasing loop like this:

xor esi, esi
.loop
mov rdi, [rdi + rsi*4]
test rdi, rdi ;  exit on null ptr
jnz .loop

Can run at 4 cycles per iteration, but the identical loop runs at 5 cycles per iteration if the initial zeroing of rsi is changed from xor esi, esi to mov esi, 0, since former is a zeroing idiom while the latter is not.

This is mostly a curiosity since if you know the index register rsi is always zero (as in the above example), you'd simply omit it from the addressing entirely. However, perhaps you have a scenario where rsi is often but not always zero, in that case a check for zero followed by an explicit xor-zeroing could speed it up by a cycle when the index is zero:

test rsi, rsi
jnz notzero
xor esi, esi  ; semantically redundant, but speeds things up!
notzero:
; loop goes here

Perhaps more interesting that this fairly obscure optimization possibility is the implication for the micro-architecture. It implies that the decision on whether an address generation is simple or complex is decided at least as late as the rename stage, since that is where this zeroed register information is available (dynamically). It means that simple-vs-complex is not decided earlier, near the decode stage, based on the static form of the address - as one might have expected (as I did).

Reported and discussed on RWT.

Registers zeroed via vzeroall are sometimes slower to use as source operands

When the most recent modification of register has been to zero it with vzeroall, it seems to stay in a special state that causes slowdowns when used as a source (this stops as soon as the register is written to). In particular, a dependent series of vpaddq instruction using a vzeroall-zeroed ymm register seems to take 1.67 cycles, rather than the expected 1 cycle, on SKL and SKLX.

More details and initial report at RWT.

Uops from an unlaminated instruction must be part of the same allocation group

Normally, if an instruction requires more than one fused-domain uop, the component uops can be renamed (the rename bottleneck of 4 or 5 per cycle is the one of the narrowest bottlenecks on most Intel CPUs) without regard to instruction boundaries. That is, there is no issue if uops from a single instruction fall into different rename groups. As an example, alternating two and three uop instructions could be renamed as AABB BAAB BBAA BBBA ..., where AA and BBB are two and three uops sequences from each instruction and the spacing shows the 4-uop rename groups: uops from a single instruction span freely across groups.

However, in the case that AA are two uops from an unlaminated instruction (i.e., one which takes 1 slot in the uop cache but two rename slots), they apparently must appear in the same rename (allocation) group. With the same example above, but with A behind an 1:2 unlaminated instruction and B unchanged as a 3-uop-at-decode instruction, the rename pattern would look like: AABB BAAB BBAA BBBx AABB ... where x is a 1-slot rename bubble, not present in the first example, caused by the requirement for unlaminated uops to be in the same rename group.

In some cases with high density of unlaminated ops, this could cause a significant reduction in rename throughput, and this can sometimes be avoided with careful reordering.

First known report by Andreas Abel in this StackOverflow question.

POP r12 is slower than other popping other registers (except rsp)

The usual encoding for pop r12 instruction is slower to decode (goes to the complex decoder, so limited to 1 per cycle) than popping to other registers (except pop rsp which is also slow). The cause seems to be that the short form of pop encodes the register into 3 bits of the opcode byte: those three bits can encode only 8 registers, of course, so the last bit to choose between the classic 32-bit registers (rax et al) and the x86-64 registers (r8 through r15) lives in the REX prefix byte. In this encoding, pop rsp and pop r12 share the same 3 bits and so the same opcode. As pop rsp needs to be handled specially, it gets handled by the complex decoder and pop r12 gets sucked along for the ride (although it does ultimately decode to a single uop, unlike pop rsp).

Reported by Andreas Abel with a detailed explanation of the effect given by Peter Cordes.

Single uop instructions which have other forms that decode to 2 fused-domain uops go to the complex decoder

This is the general case of the pop r12 quirk mentioned above. For example, cdq results in only 1 uop, but it can only be decoded by the complex decoder. It seems like these instructions are mostly (all?) ones where some other variant with the same opcode needs >= 2 fused-domain uops. Presumably what happens is that the predecoder doesn't look beyond the opcode to distinguish the cases and sends them all the to complex decoder, which then ends up only emitting 1 uop in the simple cases.

Reported by Andreas Abel on StackOverflow with the theory above relating to opcodes shared with 2+ uop variants fleshed out by myself, Andreas and Peter Cordes in the comments.

Page walks are cancelled if there are oustanding address-unknown stores

If a page walk starts for a load and the page walker finds that there are address-unknown stores in the store buffer (stores for which the store-address uop has not completed execution, probably because inputs to the addressing calculation are not available), the page walk will be cancelled, on Skylake family and earlier Intel CPUs.

Details and investigation on Stack Overflow and original Twitter thread.

Unconfirmed

I haven't confirmed quirks in this section myself, but they have been reported elsewhere.

Dirty data in the L2 comes into L1 in the dirty state so it needs to be written back when evicted

This seems like it would obviously affect L1 <-> L2 throughput since you would need an additional L1 access to read the line to write back and an additional access to L2 to accept the evicted line (and also probably an L2 -> L3 writeback), but on the linked post the claim was actually that it increased the latency of L2-resident pointer chasing. It isn't clear on what uarch the results were obtained.

I could not reproduce this effect in my tests.

After an integer to FP bypass, latency can be increased indefinitely

When an FP instruction receives an input from a SIMD integer instruction, we expect a bypass delay, but Peter Cordes reports that this delay can affect all operands indefinitely even when the other operands are FP and when bypass definitely not longer occurs:

xsave/rstor can fix the issue where writing a register with a SIMD-integer instruction like paddd creates extra latency indefinitely for reading it with an FP instruction, affecting latency from both inputs. e.g. paddd xmm0, xmm0 then in a loop addps xmm1, xmm0 has 5c latency instead of the usual 4, until the next save/restore. It's bypass latency but still happens even if you don't touch the register until after the paddd has definitely retired (by padding with >ROB uops) before the loop.

Loads have a delay of 4 cycles from allocation to dispatch

Most operations can dispatch in the cycle following allocation, but according to IACA loads can only dispatch in the 4th cycle after allocation. I have not tested to see if real hardware follows this rule or if it an error in IACA's model. Update: I didn't find any evidence for this effect.

Reported by Rivet Amber and discussed on Twitter. My tests seem to indicate that this effect does not occur.

Clone this wiki locally