proposal to cover more outer loop vectorization cases by vediv #182

jnk0le · 2019-05-20T00:27:30Z

This mostly touches block algorithms that are hard to auto vectorize without handling corner cases and/or cannot be efficiently executed partially (shuffling, rotations etc.), like bitsliced crypto.

If we want to, for example, parallelize 16x8bit element algorithm we can't use vediv due to maximum EDIV limitation and neither of the alternative options lets us to do it effcient.

The mentioned alternatives are:
a) Iterate over every single block and suffer from SIMD syndrome.
b) Try to vectorize with default mode and handle all possible corner cases like eg.:

vsetvl{i} is allowed to to set vl anywhere between ceil(AVL / 2) and VLMAX
So in case SLEN=64b SEW=8b, MAXVL=64, AVL=80 we will most probably get vl equal to ceil(AVL/2)=40 which is roughly 2.5 blocks. As a solution we need to ensure that requested AVL is never beetween VLMAX and VLMAX*2.
Issues adressed in about vid.v #178 and vrgather instruction under SEW=8 can only read from 256 elements #177 requires us to handle corner cases for every SEW<XLEN.

The minimum set of changes that I propose is:

Add one extra bit to vediv[] field in vtype register so all SEWs can be divided into 8bit sub elements.
"half-operational" or capped (integer ALU/accumulators limited to XLEN LSBs like is the FP) large SEWs as we are not likely to force 256+bit accumulators in eg. default unix profile.

Additionally we could consider more case specific changes like:

masking on sub-elements (SEW/EDIV>=8) as it's going to sit in the actual loop, frequent vsetvl changes will clobber cache/decoders and some architectures may not handle it efficiently. EDIT: it can take extra bit in vtype to select sub/whole masking if both are valuable
~~in non-crypto cases making instructions like vlx{b,h,w} to work on sub elements, whereas vlxe on whole elements as advertised, makes more sense for me (SEW/EDIV>=8)~~ EDIT: no more relevant to 1.0

What are your thoughts about it?

The text was updated successfully, but these errors were encountered:

kasanovic · 2019-05-31T14:30:32Z

There is space to extend EDIV later in the vtype register, but for now we'll keep 1,2,4,8 in base V spec.

jnk0le · 2023-02-23T22:18:35Z

ediv was replaced by element groups

kasanovic added the Resolve after v1.0 Does not need to be resolved for v1.0 draft label Jun 28, 2020

jnk0le mentioned this issue Jan 24, 2021

K-extension vector instructions #566

Open

jnk0le closed this as completed Feb 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal to cover more outer loop vectorization cases by vediv #182

proposal to cover more outer loop vectorization cases by vediv #182

jnk0le commented May 20, 2019 •

edited

Loading

kasanovic commented May 31, 2019

jnk0le commented Feb 23, 2023

proposal to cover more outer loop vectorization cases by vediv #182

proposal to cover more outer loop vectorization cases by vediv #182

Comments

jnk0le commented May 20, 2019 • edited Loading

kasanovic commented May 31, 2019

jnk0le commented Feb 23, 2023

jnk0le commented May 20, 2019 •

edited

Loading