While AVX-512 is most visibly an extension of AVX and AVX2 to a 512 bit width, AVX-512VL instructions are 128 or 256 bits wide. The VL subset comprises 27% of AVX-512 intrinsics and is often of greater interest than 512 bit operation. AMD Zen 4 processors implement AVX-512 at 256 bit width and Intel processors may not be faster at 512 bits than they are at 256 bits. AVX-512 and AVX-512VL's primary advantage over previous instruction sets (AVX, AVX2, and FMA) is arguably reduction in register spilling due to expansion from 16 ymm to 32 zmm registers and the addition of eight mask registers. The number of intrinsics triples to provide mask and maskz versions which make use of the mask registers. Of the 13 current instruction groups four are general (F, VL, DQ, BW) and 10 accelerate more specific workloads (BITALG, BF16, CD, FP16, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNTDQ).
group | what | AVX10 | instructions | intrinsics (VL) | Zen | Raptor, Golden Cove | Sunny Cove | Skylake | Knights |
---|---|---|---|---|---|---|---|---|---|
F | foundation | yes | 389 | 1435 | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | X, Xeon 2017+ | all |
VL | 128 and 256 bit widths | yes | 223 | 1208 (1028) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | X, Xeon 2017+ | |
DQ | doubleword and quadword | yes | 87 | 399 (176) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | X, Xeon 2017+ | |
BW | byte and word | yes | 150 | 764 (446) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | X, Xeon 2017+ | |
CD | conflict detection | yes | 8 | 42 (28) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | X, Xeon 2017+ | all |
BITALG | population count expansion | yes | 5 | 24 (12) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | ||
IFMA52 | big integer FMA | yes | 3 | 18 (12) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | (Cannon Lake) | |
VBMI | vector byte manipulation | yes | 8 | 30 (20) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | (Cannon Lake) | |
VBMI2 | vector byte manipulation | yes | 21 | 150 (100) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | ||
VNNI | vector neural network | yes | 5 | 36 (24) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | Cascade Lake | |
VPOPCNTDQ | population count | yes | 3 | 18 (12) | 4 | Emerald, Sapphire | Rocket, Tiger, Ice | Mill | |
BF16 | half precision | yes | 5 | 27 (18) | 4 | Emerald, Sapphire | |||
FP16 | half precision | yes | 96 | 938 (1600) | 4 | Emerald, Sapphire | |||
VP2INTERSECT | vector pair to mask pair | no | 2 | 6 | Tiger | ||||
ER | exponential and reciprocal | no | 12 | 60 | all | ||||
PF | prefetch | no | 9 | 20 | all | ||||
4FMAPS | single precision 4x1 FMA | no | 4 | 12 | Mill | ||||
4NNIW | vector neural network | no | 2 | 6 | Mill | ||||
total | no | 1031 | 5193 (2060) |
AVX-512 was introduced by Intel in 2016 on Xeon Phi processors (Knights Landing and, later, Knights Corner). Beginning in Q3 2017, Intel Skylake X-series parts (i7 and i9) and Xeon processors enabled support 3959 of the 5139 AVX-512 intrinsics now defined by Intel. In Q3 2019, Ice Lake (Sunny Cove microarchitecture) expanded the set to 4130 intrinsics and the Golden Cove microarchitecture to 5095 intrinsics (announced Q4 2021 but supporting parts appear unlikely to ship until 2023). Xeon Phi's 4FMAPS, 4NNIW, and PF instruction groups have been superceded by more recent groups and architectural changes, thus appearing to be obsolete. ER instructions are valuable to certain floating point calculations but have not been reimplemented.
For AMD parts, the table above is based on Phoronix' performance analysis as AMD hasn't updated the AMD64 Architecture Programmer's Manual.
For Intel parts, the table above derives from the Intel Intrinsics Guide, Intel ARK, and Intel 64 and IA-32 Architectures Software Developer's Manuals. It will therefore be inaccurate if Intel's information is inaccurate or if transcription errors were made. In particular, sections 15.2-4 of the architecture manual, volume 1, require software check for F before using other groups. However, the Intrinsics Guide does not indicate corresponding dependencies for many groups. The spreadsheet in this repo lists each group's instructions and intrinsics.
In July 2023, Intel announced AVX10. AVX10 formalizes consistent availability of 128, 256, and 512 bit instructions (AVX10/128, AVX10/256, and AVX10/512) across AVX-512 subsets (Intel 2023a, 2023b) and is expected launch as AVX10.1 in 2024 via Granite Rapids Xeons. As of late 2023 it appears most likely Xeon P-cores will support AVX10/512 while E-cores (and possibly desktop P-cores) will support AVX10/256. AVX10 appears to be backwards compatible with existing AVX-512 code and VL intrinsics at the given width.
release dates | processor | laptop, desktop | workstation, server |
---|---|---|---|
Q4 2023 | Emerald Rapids | Silver, Gold, Platinum | |
Q2 to Q4 2023 | Zen 4 | 7900, 8004, 9004, 97x4 | |
Q1 2023 | Sapphire Rapids | W, Bronze, Silver, Gold, Platinum | |
Q3 2022 to Q4 2023 | Zen 4 | 7040, 7000 | |
Q1 2021 to Q3 2021 | Rocket Lake | i5, i7, i9 | E, W |
Q3 2020 to Q3 2021 | Tiger Lake | i3, i5, i7, i9 | W |
Q2 2020 | Cooper Lake | Gold, Platinum | |
Q3 2019 to Q2 2021 | Ice Lake | i3, i5, i7 | W, Silver, Gold, Platinum |
Q2 2019 to Q1 2020 | Cascade Lake | i9 | W, Bronze, Silver, Gold, Platinum |
Q3 2018 | Cannon Lake | i3-8121U | |
Q3 2017 to Q2 2018 | Skylake | X-Series i7, i9 | W, Bronze, Silver, Gold, Platinum |
Q4 2017 | Knights Mill | Phi | |
Q2 2016 to Q4 2016 | Knights Landing | Phi |
Prior to AMD's Zen 4 release in September 2022, AVX-512 was most readily—albeit somewhat briefly—available on Intel 11th generation i5, i7, and i9 parts (Rocket Lake) before being disabled on 12th generation (Alder and, likely, Raptor Lake). Cascade Lake and Skylake Xeons provided AVX-512, along with more limited availabiliy from Cooper, Tiger, and Ice Lake. Cannon Lake i3s (Palm Cove microarchitecture) are rare and the Kaby and Coffee Lake Skylake iterations lack AVX-512. Alder Lake P-cores implement 14 instruction groups (Sunny Cove instructions plus BF16, FP16, and VP2INTERSECT) but typically have AVX-512 fused off. AVX-512 can be enabled on early Alder Lakes but Intel has suppressed this ability through microcode.
Specfic processors are listed in this repo's spreadsheet and the table above uses Intel ARK release dates for Intel parts. No Pentium or Celeron processor supports AVX-512. AMD did not support AVX-512 prior to Zen 4.
The 18 AVX-512 instruction groups have individual CPUID flags and, in principle, an instruction could require an arbitrary number of groups be present. In practice, this is rarely a concern as the only dependencies which exist are on groups F and VL. Since VL is not present independent of F on any of Intel's processors, its dependency is always satisfied. Similarly, all processors with instruction groups containing VL dependent intrinsics implement VL. There are also 324 128 or 256 bit intrinsics for which the Intrinsics Guide does not indicate a VL requirement. These are primarily ss and sd floating point intrinsics which modify only the low 32 or 64 bits of a register.
It appears reasonable to assume all Skylake and later processors with AVX-512 support will implement at least the F, CD, VL, DQ, and BW groups. While Intel could choose otherwise, doing would complicate hardware implementation, compiler support, and software compatibility. It is also plausible the BITALG, IFMA52, VBMI, VBMI2, VNNI, and VPOPCNT groups will be consistently implemented together from Ice Lake forward for the same reasons. However, Intel does not seem to have made any official statement regarding future compatibility. Similarly, AMD does not appear to have indicated what groups a Zen 4 implementation might support. However, an AMD implementation seems likely to be broadly compatible with Intel's.
Since AVX-512 is restricted to 512 bit widths on the now discontinued Xeon Phis, these processors do not implement the vector length extensions of the VL group. They therefore lack dependencies between groups and share only the F and CD groups with Skylake and later implementations.
The Skylake-Cascade Lake and Sunny Cove-Cypress Cove microarchitectures provide SIMD computation on ports 0, 1, and 5. It appears AVX-512 operation is obtained by combining ports 0 and 1 and, when two AVX-512 instructions per clock are supported, possibly by combining ports 5 and 6. In addition to downclocking when the instructions and threads used trigger downclocking by crossing thermal license boundaries, use of AVX-512 may sometimes be slower than implementing the same workload with AVX and AVX2. This occurs because the instruction rate decrease from 3x256 per clock to 2x512 may not be offset by wider loads and stores (Fog 2015, Stackoverflow), zmm register availability, or use of more efficient instructions. AVX-512 may be similarly disadvantageous on processors restricted to 1x512, which includes bronze, silver, some 5000 series gold, and D Skylake Xeons as well as Knights processors. As of September 2022, how the addition of port 10 and other microarchitectural changes in Golden Cove may alter these considerations is uncertain.
In general, compute kernel throughput is sensitive to instruction level parallelism available within the kernel's inner loop and a processor's FMA, ALU, shift, floating point divide, shuffle, and load and store capabilities. For Ice Lake, Intel indicates one AVX-512 FMA and shuffle and two AVX-512 ALUs (e.g. Cutress 2019). Some kernels may therefore execute more quickly at 256 bit width due to accessing two FMA and shuffle units rather than one. Additionally, using the 1400 128 and 256 bit AVX-512VL intrinsics to reduce register spilling from 32 zmm instead of 16 ymm registers may be more efficient than the width of the 3700 512 bit intrinsics. In some cases 128 bit kernels can also be faster than 256 or 512 bit ones due to computational details of the kernel such as dependencies between AVX lanes.
It's therefore useful to profile SIMD implementations at 128, 256, and 512 bit widths across processors and amounts of computation to be performed. In performance code segments, this can result in width dispatching being controlled by loop content or iteration count rather than which instructions are supported by the processor.