# SIMD Programming

Kenjiro Taura

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### SIMD: basic concepts

- SIMD : single instruction multiple data
- a *SIMD register* (or a *vector register*) can hold many values (2 16 values or more) of a single type
- a *SIMD instruction* is an instruction that can apply (typically the same) operation on all or some values on a SIMD register(s)
- each value in a SIMD register is called a *SIMD lane* or simply a *lane*
- they are indispensable tools for CPUs to get performance



## Evolving Intel instruction set

 Recent processors increasingly rely on SIMD as an energy efficient way to boost peak FLOPS

| Microarchitecture | ISA     | throughput      | vector     | max SP flops/cycle |
|-------------------|---------|-----------------|------------|--------------------|
|                   |         | (per clock)     | width (SP) | /core              |
| Nehalem           | SSE     | 1  add + 1  mul | 4          | 8                  |
| Sandy Bridge      | AVX     | 1  add + 1  mul | 8          | 16                 |
| Haswell           | AVX2    | 2 fmas          | 8          | 32                 |
| Ice Lake          | AVX-512 | 2 fmas          | 16         | 64                 |

- ISA: Instruction Set Architecture
- vector width: the number of single precision (SP) operands
- fma: fused multiply-add instruction
- e.g., Peak FLOPS of a machine having  $2 \times$  Intel Xeon Gold 6130 (2.10GHz, 32 cores) = 8.6 TFLOPS
- no SIMD?  $\rightarrow$  can tap at most 1/16 of SP peak performance on machines having AVX-512

## Intel SIMD instructions at a glance

Some example AVX-512F (a subset of AVX-512) instructions

| operation | syntax                              | C-like expression     |  |
|-----------|-------------------------------------|-----------------------|--|
| multiply  | <pre>vmulps %zmm0,%zmm1,%zmm2</pre> | zmm2 = zmm1 * zmm0    |  |
| add       | <pre>vaddps %zmm0,%zmm1,%zmm2</pre> | zmm2 = zmm1 + zmm0    |  |
| fmadd     | vfmadd132ps %zmm0,%zmm1,%zmm2       | zmm2 = zmm0*zmm2+zmm1 |  |
| load      | <pre>vmovups 256(%rax),%zmm0</pre>  | zmm0 = *(rax+256)     |  |
| store     | vmovups %zmm0,256(%rax)             | *(rax+256) = zmm0     |  |

- zmm0 ... zmm31 are 512 bit registers; each can hold
  - 16 single-precision (float of C; 32 bits) or
  - 8 double-precision (double of C; 64 bits) floating point numbers
- XXXps stands for packed single precision

### xmm, ymm and zmm registers

• ISA and available registers

| ISA     | registers                         |  |
|---------|-----------------------------------|--|
| SSE     | xmm0,xmm15                        |  |
| AVX     | $\{x,y\}$ mm0, $\{x,y\}$ mm15     |  |
| AVX-512 | $\{x,y,z\}$ mm0, $\{x,y,z\}$ mm31 |  |

• registers and their widths (vector widths)

| register names | register width (bits) |
|----------------|-----------------------|
| xmmi           | 128                   |
| ymmi           | 256                   |
| zmmi           | 512                   |

 $\bullet$  xmmi, ymmi and zmmi are aliased



## Intel SIMD instructions at a glance

• look at register names (x/y/z) and the last two characters of a mnemonic (p/s and s/d) to know what an instruction operates on

|        |                   | operands           | vector                  | ISA     |
|--------|-------------------|--------------------|-------------------------|---------|
|        |                   |                    | /scalar?                |         |
| vmulss | %xmm0,%xmm1,%xmm2 | 1 SPs              | scalar                  | SSE     |
| vmulsd | %xmm0,%xmm1,%xmm2 | $1 \mathrm{\ DPs}$ | $\operatorname{scalar}$ | SSE     |
| vmulps | %xmm0,%xmm1,%xmm2 | $4 \mathrm{SPs}$   | vector                  | SSE     |
| vmulpd | %xmm0,%xmm1,%xmm2 | 2  DPs             | vector                  | SSE     |
| vmulps | %ymm0,%ymm1,%ymm2 | $8 \mathrm{~SPs}$  | vector                  | AVX     |
| vmulpd | %ymm0,%ymm1,%ymm2 | 4 DPs              | vector                  | AVX     |
| vmulps | %zmm0,%zmm1,%zmm2 | $16 \mathrm{SPs}$  | vector                  | AVX-512 |
| vmulpd | %zmm0,%zmm1,%zmm2 | 8 DPs              | vector                  | AVX-512 |

- ...ss : scalar single precision
- $\bullet \dots sd : scalar \ double \ precision$
- ...ps : packed single precision
- ...pd : packed double precision

# Applications/limitations of SIMD

- SIMD is good at parallelizing computations doing *almost* exactly the same series of instructions on contiguous data
- ⇒ generally, main targets are simple loops whose index values can be easily identified

L is the SIMD width

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### Several ways to use SIMD

- auto vectorization
  - loop vectorization
  - basic block vectorization
- language extensions/directives for SIMD
  - SIMD directives for loops (OpenMP 4.0/OpenACC)
  - SIMD-enabled functions (OpenMP 4.0/OpenACC)
  - array languages (Cilk Plus)
  - specially designed languages
- vector types
  - GCC vector extensions
  - Boost.SIMD
- intrinsics
- assembly programming

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### Auto loop vectorization

- write scalar loops and hope the compiler does the job
- e.g.,

```
void axpy_auto(float a, float * x, float c, long m) {
for (long j = 0; j < m; j++) {
    x[j] = a * x[j] + c;
}
}
</pre>
```

• compile and run

```
$ clang -o simd_auto -mavx512f -mfma -O3 simd_auto.c
```

- -mavx512f -mfma say "should use AVX-512F and FMA instructions" (better to be explicit for the time being)
- $\bullet$  -03 increases the optimization level (so the compiler should work hard to vectorize it)
- read the notebook about options of other compilers (NVIDIA and GCC)

### How to know if the compiler vectorized it?

• there are options useful to know whether a loop is successfully vectorized and if not, why not

|        | report options                      |
|--------|-------------------------------------|
| Clang  | -R{pass,pass-missed}=loop-vectorize |
| NVIDIA | -M{info,neginfo}=vect               |
| GCC    | -fopt-info-vec-{optimized,missed}   |

- but don't hesitate to dive into assembly code
  - make -S option your friend
  - a trick: enclose loops with inline assembler comments to easily locate assembly code for the loop

```
asm volatile ("# xxxxxx loop begins");
for (i = 0; i < n; i++) {
    ... /* hope to be vectorized */
}
asm volatile ("# xxxxxx loop ends");</pre>
```

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

# OpenMP SIMD constructs

- simd pragma
  - directive to vectorize for loops
  - syntax restrictions similar to omp for pragma apply
- declare simd pragma
  - instructs the compiler to generate vectorized versions of a function
  - with it, loops with function calls can be vectorized

### simd pragma

• basic syntax (similar to omp for):

```
#pragma omp simd clauses
for (i = ...; i < ...; i += ...)
S</pre>
```

- clauses
  - aligned(var,var,...:align)
  - uniform(var,var,...) says variables are loop invariant
  - linear(var,var,...:stride) says variables have the specified stride between consecutive iterations

### simd pragma

```
void axpy_omp(float a, float * x, float c, long m) {
    #pragma omp simd
    for (long j = 0; j < m; j++) {
        x[j] = a * x[j] + c;
    }
}</pre>
```

- note: there are no points in using omp simd here, when auto vectorization does the job
- in general, omp simd declares "you don't mind that the vectorized version is not the same as non-vectorized version"

## simd pragma to vectorize programs explicitly

• computing an inner product:

```
void inner_omp(float * x, float * y, long m) {
  float c = 0;
  #pragma omp simd reduction(c:+)
  for (long j = 0; j < m; j++) {
      c += x[j] * y[j];
  }
}</pre>
```

 note that the above loop is unlikely to be auto-vectorized, due to dependency through c

### declare simd pragma

- when given before a function definition, vectorizes a function body
- when given before a function declaration, tells the compiler a vectorized version of the function is available
- basic syntax (similar to omp for):

```
#pragma omp declare simd clauses
the function definition or declaration
```

- clauses
  - those for simd pragma
  - notinbranch
  - inbranch

#### Reasons that a vectorization fails

- potential aliasing makes auto vectorization difficult/impossible
- complex control flows make vectorization impossible or less profitable
- non-contiguous data accesses make vectorization impossible or less profitable
  - giving hints to the compiler sometimes (not always) addresses the problem

### Aliasing and auto vectorization

- "auto" vectorizer succeeds only when the compiler can guarantee a vectorized version produces an *identical result* with a non-vectorized version
- vectorization of loops operating on two or more arrays is often invalid if they point to be the same array

```
1 for (i = 0; i < m; i++) {
2     y[i] = a * x[i] + c;
3</pre>
```

```
what if, say, &y[i] = &x[i+1]?
```

- N.B., good compilers generate code that first checks
   x[i:i+L] and y[i:i+L] overlap
- if you know they don't overlap, you can make that explicit
- restrict keyword, introduced by C99, does just that

### restrict keyword

• annotate parameters of pointer type with restrict, if you know they never point to the same data

```
void axpy_auto(float a, float * restrict x, float c,

float * restrict y, long m) {
for (long j = 0; j < m; j++) {
    y[j] = a * x[j] + c;
}
}
</pre>
```

• you need to specify -std=gnu99 (C99 standard)

### Control flows within an iteration — conditionals

• a conditional execution (e.g., if statement) within an iteration requires a statement to be executed only for a part of SIMD lanes

• AVX-512 supports *predicated execution (execution mask)* for that

### Control flows within an iteration — nested loops

• a nested loop within an iteration causes a similar problem with conditional executions

• if end depends on i (SIMD lanes), it requires a predicated execution

#### Control flows within an iteration — function calls

- if an iteration has an unknown (not inlined) function call, almost no chance that the loop can be vectorized
  - the function body would have to be executed by scalar instructions anyways

• you can declare that f has a vectorized version with #pragma omp declare simd (with such a definition, of course)

```
#pragma omp declare simd uniform(a, x, b, y) linear(i:1) notinbranch
void f(float a, float * restrict x, float b, float * restrict y, long i);
```

### Non-contiguous data accesses

• ordinary vector load/store instructions access a contiguous addresses

```
vmovups (a),%zmm0
```

loads  ${\tt zmm0}$  with the contiguous 64 bytes from address a

 → they can be used only when iterations next to each other access addresses next to each other

### Non-contiguous data accesses

• that is, they cannot be used for

```
void loop_stride(float a, float * restrict x, float b,

float * restrict y, long n) {

#pragma omp simd

for (long i = 0; i < n; i++) {

y[i] = a * x[2 * i] + b;

}

}</pre>
```

#### let alone

• AVX-512 supports *gather* instructions for such data accesses

### Non-contiguous stores

• what about store

- AVX-512 supports *scatter* instructions for such data accesses
- it is your responsibility to guarantee idx[i:i+L] do not point to the same element

# High level vectorization: summary and takeaway

- CPUs (especially recent ones) have necessary tools
  - arithmetic  $\rightarrow$  vector arithmetic instructions
  - load  $\rightarrow$  vector load and gather instructions
  - ullet store o vector store and scatter instructions
  - if and loops  $\rightarrow$  predicated executions
- generally, the compiler is behind CPUs; whether the compiler is able to use them is another story
- become a friend of compiler reports and assembly (-S)

#### Contents

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### Vector types

• many compilers extend C by allowing you to define a type that explicitly represents a vector of values

```
typedef float floatv __attribute__((vector_size(64)));
```

• you can use familiar arithmetic expressions on vector types

```
1 floatv x, y, z;
z += x * y;
```

• Clang/NVIDIA/GCC allow you to mix scalars and vectors

```
float a, b;
floatv x, y;
y = a * x + b;
```

- you can combine them with *intrinsics* I'll get to later
- for reasons I don't get into, a better definition is

### An example using vector extension

• scalar code

```
for (long i = 0; i < n; i++) {
   y[i] = a * x[i] + b;
}</pre>
```

• pseudo code (assume L | n (L divides n))

```
for (long i = 0; i < n; i += L) {
   y[i:i+L] = a * x[i:i+L] + b;
}</pre>
```

• a function or macro (V) implementing x[i:i+L]

```
/* take the address, cast it to (floatv*) and deref it */
#define V(lv) (*((floatv*)&(lv)))
```

• it is then

```
for (long i = 0; i < n; i += L) {
   V(y[i]) = a * V(x[i]) + b;
}</pre>
```

### Dealing with remainder iterations

 $\bullet$  when L  $/\!\!/$  n, run remainders after the vectorized version

```
long i;
for (i = 0; i + L <= n; i += L) {
    V(y[i]) = a * V(x[i]) + b;
}
for ( ; i < n; i++) {
    y[i] = a * x[i] + b;
}</pre>
```

- manually doing this is tedious ...
- make n a multiple of L when the problem allows it (otherwise do the tedious work)

### Make a vector value from scalar value(s)

• you typically make a vector value from an array of scalars

```
float * a = ...;
floatv v = *((*floatv)&a[i]);
```

• a macro/function like the following makes the life better

```
floatv& V(float& lv) { return *((floatv*)(&lv)); } // C++
define V(lv) (*((floatv*)&(lv))) // C
```

with which we can write

```
float * a = ...;
floatv v = V(a[i]);
```

#### ... and vice versa

• you typically store a vector value to an array of scalars

```
float * a = ...;
floatv v = ...;
V(a[i]) = v;
```

and get individual scalars from the array

• you can access a particular lane of a vector directly, as if a vector is a C array. e.g.,

```
floatv v;
float s = v[3];
```

• but a CPU generally lacks instructions to access a lane designated by a value not known at the compile time. e.g.,

```
floatv v; int i = ...;
float s = v[i];
```

it might be essentially doing the former each time you access an element, so might be very inefficient

- SIMD Instructions
- 2 SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### Vector intrinsics

- processor/platform-specific functions and types
- on x86 processors, put this in your code

```
#include <x86intrin.h>
```

and you get

- a set of available vector types
- a lot of functions operating on vector types
- bookmark "Intel Intrinsics Guide" (https://software. intel.com/sites/landingpage/IntrinsicsGuide/) when using intrinsics

# Vector types + intrinsics

vectorizing a loop is largely about converting

```
1 for (i = 0; i < n; i++) {
2   S(i);
}

\Rightarrow

1 for (i = 0; i + L <= n; i += L) {
2   S(i:i+L);
3 } // + remainder code (omitted)
```

• the combination of vector types + intrinsics gives you a powerful way to manually vectorize code (i.e., write S(i:i+L)) the compiler fails to vectorize

## When you want to use manual vectorization

- whenever your compiler fails, but in general
  - a loop containing  $a \ branch \Rightarrow predicated \ execution + value-blending$
  - 2 a loop accessing an array  $non\text{-}contiguously \Rightarrow \text{gather} + \text{scatter}$
  - 3 a loop containing another loop  $\Rightarrow$ 
    - easy if all inner loops have the same trip count
    - follow the strategy for branches (tedious)

#### Vector intrinsics

- vector types:
  - \_m512 (512 bit vector)  $\approx$  float  $\times$  16
  - \_m512d (512 bit vector)  $\approx$  double  $\times$  8
  - \_m512i (512 bit vector)  $\approx$  long  $\times$  8
  - there are no int  $\times$  16
  - similar types for 256/128 bit values (\_m256, \_m256d, \_m256i, \_m128, \_m128d and \_m128i
- functions operating on vector types:
  - \_mm512\_xxx (512 bit),
  - $_{mm256}_{xxx}$  (256 bit),
  - \_mm\_xxx (128 bit),
  - . . .
- each function almost directly maps to a single assembly instruction

# Convenient intrinsics to make a vector value from scalar value(s)

• make a uniform vector

```
1 floatv v = _mm512_set1_ps(f); // { f, f, ..., f }
```

• make an arbitrary vector

```
floatv v = _mm512_set_ps(f0, f1, f2, ..., f15);
```

- SIMD Instructions
- SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

- SIMD Instructions
- SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

#### Predicated instructions

- SIMD instructions that take a vector of boolean values (mask) that specifies lanes for which the instruction is executed
- results on other lanes are taken from another SIMD register (or set zero)
- e.g., an ordinary SIMD add instruction (intrinsics)
  - \_m512 \_mm512\_add\_ps(\_m512 a, \_m512 b)  $\equiv$  [  $a[i] + b[i] \mid i \in 0..L$  ]
- predicated versions
  - \_m512 \_mm512\_maskz\_add\_ps(\_mmask16 k, a, b)  $\equiv [(k[i] ? a[i] + b[i] : 0) | i \in 0..L]$
  - \_m512 \_mm512\_mask\_add\_ps(\_m512 c, k, a, b)  $\equiv$  [ (k[i] ? a[i] + b[i] : c[i]) |  $i \in 0..L$  ]

## Generating a mask

- compare all values of two vectors (with <) \_\_mmask16 k = \_mm512\_cmp\_ps\_mask(a, b, \_CMP\_LT\_OS)  $\equiv$  [  $u[i] < v[i] \mid i \in 0..L$  ]
- you get a 16 bit *mask* that can be used for predicated execution
- search intrinsics guide for symbols to compare in other ways

# A template to vectorize loops containing branches

• a loop having a branch

```
for (i = 0; i < n; i++) {

if (C(i)) {

T(i)

} else {

E(i)
} }
```

```
• ⇒
```

```
for (i = 0; i + L \le n; i += L) {

k = C(i:i+L)

if (any(k)) {

T(i:i+L) predicated on k

}

if (any(^{\kappa}k)) {

E(i:i+L) predicated on ^{\kappa}k

} }
```

• note: values used after the original if statement are made by blending results from both branches (see next slide)

## Blending values

- there are instructions specifically for blending two vectors. e.g., \_m512 \_mm512\_mask\_blend\_ps(k, a, b)  $\equiv [(k[i]? a[i]: b[i]) | i \in i...L]$
- recall that predicated instructions already have a provision for it. e.g.,

```
__m512 _mm512_mask_add_ps(__m512 c, k, a, b) \equiv _mm512_mask_blend_ps(k, a+b, c)
```

## Example

scalar version

```
for (i = 0; i < n; i++) {
   if (i % 2 == 0) {
      y[i] = x[i] + 1;
   } else {
      y[i] = x[i] * 2;
   }
}</pre>
```

•  $\Rightarrow$  pseudo code (assume  $L \mid n$ )

```
for (i = 0; i < n; i += L) {
   __mmask16 k = (i:i+L % 2 == 0);
   t = x[i:i+L] + 1;
   y[i:i+L] = blend(~k, x[i:i+L] * 2 : t);
}</pre>
```

# Example

 $\bullet \Rightarrow \text{actual code}$ 

```
for (i = 0; i < n; i += L) {
   __m512i z = _mm512_set1_epi64(0)
   __mmask16 k = _mm512_cmp_epi64_mask(linear(i) & 1L, z, _MM_CMPINT_EQ)
   __m512i t = V(x[i]) + 1;
   V(y[i]) = _mm512_mask_mul_ps(t, ~k, V(x[i]), 2);
}</pre>
```

- linear(i) is a function (not shown) to generate a vector {
   i, i+1, ..., i+L-1 }
- there are C++ tricks (operator overloading) that make this code less ugly

- SIMD Instructions
- SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

### Gather

- an instruction that can get  $[a[i_0], a[i_1], \dots, a[i_{L-1}]]$  as a vector value
- \_\_m512 \_mm512\_i32gather\_ps(\_\_m512i I, void\* a, int s) takes 16 32-bit indices I and scale s. that is,  $\equiv [f(a[I[i]*s]) \mid i \in 0..L]$  where f(p) gets the value at p as a float value ( $\equiv *((float*)\&p))$
- similar versions for different index/value widths
  - 64 bit indices to gather 8 double precision (64 bit) values
    \_\_m512d \_mm512\_i64gather\_pd
  - 64 bit indices to gather 8 single precision (32 bit) values \_\_m256 \_mm512\_i64gather\_ps
  - 32 bit indices to gather 8 double precision (64 bit) values \_\_m512d \_mm512\_i32gather\_pd
- there are predicated versions as well (\_mm512\_mask\_ixxgather\_ps/pd)

#### Scatter

- an instruction that can assignments  $a[i_0] = x_0; a[i_1] = x_1; \dots; a[i_{L-1}] = x_{L-1};$
- similar name conventions to gather
  - 32 bit indices, to get 32 bit values \_mm512\_i32scatter\_ps
  - 64 bit indices, to get 64 bit values \_mm512\_i64scatter\_pd
  - 64 bit indices, to get 32 bit values: \_mm512\_i64scatter\_ps
  - 32 bit indices, to get 64 bit values: \_mm512\_i32scatter\_pd
- you guessed it. there are masked versions (\_mm512\_mask\_ixxscatter\_ps/pd)

- SIMD Instructions
- SIMD programming alternatives
  - Auto loop vectorization
  - OpenMP SIMD Directives
  - Vector Types
  - Vector intrinsics
- 3 Vectorizing loops compilers fail to vectorize
  - Loops with branches
  - Loops with non-contiguous memory access
  - Loops having another loop inside

# Loops having another loop inside

• consider how to vectorize the *outer* loop

```
for (i = 0; i < m; i++) {
  for (j = 0; j < limit; j++) {
    B(i)  }
}</pre>
```

- if the trip count of the inner loop is the same across lanes (i.e., *limit* does not depend on *i*), then there is no particular difficulty (the compiler nevertheless often fails to vectorize it)
- ullet more difficult is when inner loop has different trip counts depending on i

# Loops having another loop inside

• a general template of scalar code

```
for (i = 0; i < m; i++) {
    while (C(i)) {
        B(i)
    }
}</pre>
```

# Vector types and intrinsics: summary

template

```
for (i = 0; i < n; i++) {

S(i)
}

\rightarrow

for (i = 0; i < n; i += L) {

S(i:i+L)
}
```

- ullet convert every expression into its vector version, which contains what the original expression would have for the L consecutive iterations
- use masks to handle conditional execution and nested loops with variable trip counts
- vectorizing SpMV is challenging but possible with this approach