• L1-miss loads other than the first hitting the same L2 line have a longer latency of 19 cycles vs 12
  • adc with a zero immediate, i.e., adc reg, 0 is twice as fast as with any other immediate or register source on Haswell-ish machines
  • Short form adc and sbb using the accumulator (rax, eax, ax, al) are two uops on Broadwell and Skylake
  • Minimum store-forwarding latency is 3 on new(ish) chips, but the load has to arrive at exactly the right time to achieve this
  • Stores to a cache line that is an L1-miss but L2-hit are unexpectedly slow if interleaved with stores to other lines
  • The 4-cycle load-to-load latency applies only in the load-feeds-load case
  • The 4-cycle best-case load latency fails and the load must be replayed when the base register points to a different page
  • Lines in L3 are faster to access if their last access by another core was a write
  • An address that would otherwise be complex may be treated as simple if the index register is zeroed via idiom
  • Registers zeroed via vzeroall are sometimes slower to use as source operands
  • Uops from an unlaminated instruction must be part of the same allocation group
  • POP r12 is slower than other popping other registers (except rsp)
  • Single uop instructions which have other forms that decode to 2 fused-domain uops go to the complex decoder
  • Page walks are cancelled if there are oustanding address-unknown stores
  • Unconfirmed
  • Dirty data in the L2 comes into L1 in the dirty state so it needs to be written back when evicted
  • After an integer to FP bypass, latency can be increased indefinitely
  • Loads have a delay of 4 cycles from allocation to dispatch