-
Notifications
You must be signed in to change notification settings - Fork 63
Performance limiters
travisdowns edited this page May 28, 2018
·
2 revisions
A non-exhaustive list of things that can limit performance of a series of instructions on modern x86. In general you'll be limited by the strictest of the limits that apply (and actual performance might be worse than that since this list isn't exhaustive).
- Rename limit: at most 4 fused-domain uops per cycle can be renamed. Currently all uops go through rename, even ones that appear not to need rename such as
nop
. This generally puts a cap on IPC of 4, if you could macro-fusedop/jcc
pairs as 1. - ROB limit. The ROB limit of ~200 fused uops means that the CPU cannot execute any instruction further downstream than ~200 uops from the oldest unretired instruction. This is especially relevant when you have long stalls, like a miss to DRAM, since this puts an upper limit on how much further the CPU can run-ahead while waiting for the stalled instruction. At an UPC of 4, a 200 entry ROB will be consumed in 50 cycles, or about 17 ns on a 3 GHz core. This means that you can never hide completely a miss to DRAM (at least 50 ns) through out-of-order execution at 4 UPC (but you could if the UPC was lower). You can barely hide misses to L3. Measuring ROB capacity uses exactly this scenario to measure the capacity of the ROB.
- PRF limit. The number of physical registers available for speculative execution limits execution in a similar way to the ROB limit - the number of physical registers is generally less than the number of ROB entries, so if all your instructions consume PFF entries (i.e., have a destination) of the same type, you'll hit this limit before you'll you hit the ROB limit. In practice, some instructions such as branches, zeroing idioms (and possibly RMW and compare-with-memory - but it depends on the flags implementation) don't consume PRF entries, which limit you hit depends on that ratio. Note that integer and FP PRFs are distinct on recent Intel, so you can consume from each PRF independently: meaning that vectorized code mixed with at least some GP code is unlikely to hit the PFR limit before it hits the ROB limit. The same blog entry covers PRF limits as well, including some non-ideal behaviors that I've glossed over here. This thread includes newer results, indicating also that Skylake has approximately 150 speculative GP registers available, 30 less than the documented size of the PRF (this is interesting in the sense that 30 is a lot of registers to reserve for non-speculative state).
- EU/port limit. Each instruction can only execute on a limited number of ports/execution units, listed in Agner's instruction tables. The number of instructions that need to go to an EU or a combination of EUs puts an upper limit on the throughput of those instructions. This is sometimes called "port pressure". IACA considers this limit and "solves" for port pressure.
- Store limit. Max of 1 store per cycle. This is really just a consequence of the more general EU/port limit described above, but it's worth mentioning by itself since stores are common in practice.
- Load limit. Max of 2 loads per cycle. This is really just a consequence of the more general EU/port limit described above, but it's worth mentioning by itself since stores are common in practice.
- Complex addressing limit. Max of 1 load (any addressing) concurrent with a store with complex addressing per cycle. Here, complex addressing means any indexed addressing mode (i.e., two registers in the addressing calculation) . This occurs because the dedicated store-address unit on port 7 can only handle non-complex addresses. Even if stores have non-complex addressing, it may not be possible to sustain 2 loads/1 store, because the store may sometimes choose one of the port 2 or port 3 AGUs instead, starving a load that cycle.
- Branches in flight. Max of 48 branches in flight. Intel CPUs since Sandy Bridge can apparently only keep 48 branches in flight (i.e., count of unretired branches) before stalling. This might hit you before you reach ROB or PRF limits if you have branch-heavy code. The test used to determine this limit is described here, but the exact stall mode isn't clear: will the front-end end stall on the 49th branch, blocking new instructions of all types from executing, or does it only stall branches from execution, potentially allowing other types of instructions to proceed? The latter case seems unlike to affect performance in most cases since branches don't participate in data-dependent dependency chains.
- Calls in flights. Max of 14 or 15 calls in flight on recent Intel CPUs. This is similar to the branch limit as described above, but with a much lower limit. The number comes from the same place and the same caveat with respect to the unknown stall mechanism applies.