This repository has been archived by the owner on Mar 20, 2024. It is now read-only.
proposal to cover more outer loop vectorization cases by vediv #182
Labels
Resolve after v1.0
Does not need to be resolved for v1.0 draft
This mostly touches block algorithms that are hard to auto vectorize without handling corner cases and/or cannot be efficiently executed partially (shuffling, rotations etc.), like bitsliced crypto.
If we want to, for example, parallelize 16x8bit element algorithm we can't use vediv due to maximum
EDIV
limitation and neither of the alternative options lets us to do it effcient.The mentioned alternatives are:
a) Iterate over every single block and suffer from SIMD syndrome.
b) Try to vectorize with default mode and handle all possible corner cases like eg.:
vsetvl{i}
is allowed to to setvl
anywhere betweenceil(AVL / 2)
andVLMAX
So in case
SLEN=64b SEW=8b, MAXVL=64, AVL=80
we will most probably getvl
equal toceil(AVL/2)=40
which is roughly 2.5 blocks. As a solution we need to ensure that requestedAVL
is never beetweenVLMAX
andVLMAX*2
.SEW<XLEN
.The minimum set of changes that I propose is:
vediv[]
field invtype
register so all SEWs can be divided into 8bit sub elements.XLEN
LSBs like is the FP) large SEWs as we are not likely to force 256+bit accumulators in eg. default unix profile.Additionally we could consider more case specific changes like:
SEW/EDIV>=8
) as it's going to sit in the actual loop, frequentvsetvl
changes will clobber cache/decoders and some architectures may not handle it efficiently. EDIT: it can take extra bit in vtype to select sub/whole masking if both are valuablein non-crypto cases making instructions likeEDIT: no more relevant to 1.0vlx{b,h,w}
to work on sub elements, whereasvlxe
on whole elements as advertised, makes more sense for me (SEW/EDIV>=8
)What are your thoughts about it?
The text was updated successfully, but these errors were encountered: