Skip to content
This repository has been archived by the owner on Mar 20, 2024. It is now read-only.

proposal to cover more outer loop vectorization cases by vediv #182

Closed
jnk0le opened this issue May 20, 2019 · 2 comments
Closed

proposal to cover more outer loop vectorization cases by vediv #182

jnk0le opened this issue May 20, 2019 · 2 comments
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft

Comments

@jnk0le
Copy link

jnk0le commented May 20, 2019

This mostly touches block algorithms that are hard to auto vectorize without handling corner cases and/or cannot be efficiently executed partially (shuffling, rotations etc.), like bitsliced crypto.

If we want to, for example, parallelize 16x8bit element algorithm we can't use vediv due to maximum EDIV limitation and neither of the alternative options lets us to do it effcient.

The mentioned alternatives are:
a) Iterate over every single block and suffer from SIMD syndrome.
b) Try to vectorize with default mode and handle all possible corner cases like eg.:

  • vsetvl{i} is allowed to to set vl anywhere between ceil(AVL / 2) and VLMAX
    So in case SLEN=64b SEW=8b, MAXVL=64, AVL=80 we will most probably get vl equal to ceil(AVL/2)=40 which is roughly 2.5 blocks. As a solution we need to ensure that requested AVL is never beetween VLMAX and VLMAX*2.
  • Issues adressed in about vid.v #178 and vrgather instruction under SEW=8 can only read from 256 elements #177 requires us to handle corner cases for every SEW<XLEN.

The minimum set of changes that I propose is:

  • Add one extra bit to vediv[] field in vtype register so all SEWs can be divided into 8bit sub elements.
  • "half-operational" or capped (integer ALU/accumulators limited to XLEN LSBs like is the FP) large SEWs as we are not likely to force 256+bit accumulators in eg. default unix profile.

Additionally we could consider more case specific changes like:

  • masking on sub-elements (SEW/EDIV>=8) as it's going to sit in the actual loop, frequent vsetvl changes will clobber cache/decoders and some architectures may not handle it efficiently. EDIT: it can take extra bit in vtype to select sub/whole masking if both are valuable
  • in non-crypto cases making instructions like vlx{b,h,w} to work on sub elements, whereas vlxe on whole elements as advertised, makes more sense for me (SEW/EDIV>=8) EDIT: no more relevant to 1.0

What are your thoughts about it?

@kasanovic
Copy link
Collaborator

There is space to extend EDIV later in the vtype register, but for now we'll keep 1,2,4,8 in base V spec.

@kasanovic kasanovic added the Resolve after v1.0 Does not need to be resolved for v1.0 draft label Jun 28, 2020
@jnk0le
Copy link
Author

jnk0le commented Feb 23, 2023

ediv was replaced by element groups

@jnk0le jnk0le closed this as completed Feb 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Resolve after v1.0 Does not need to be resolved for v1.0 draft
Projects
None yet
Development

No branches or pull requests

2 participants