# Session 6: Vectorisation and data layout

COMP52315: performance engineering

Lawrence Mitchell\*

<sup>\*</sup>lawrence.mitchell@durham.ac.uk

# A beginning

We've seen that for high floating point performance we need vectorisation

#### What does this mean for code?

Vectorisation is typically a transformation performed on loops

- 1. Which loops can be vectorised?
- 2. How do we convince the compiler to vectorise them?
- 3. What might we expect?  $\Rightarrow$  roofline and other models

# Simple guidelines

#### **Necessary**

- 1. Inner loop
- 2. Countable (number of iterations is known at loop entry)
- 3. Single entry/exit
- 4. No conditionals (but...)
- 5. No function calls (but...)

while (condition) {

if (snevalli) 1)

condition of the

#### Best performance when

- 1. Simple inner loops (ideally stride-1 access)  $f_{W}$  (i = 0;  $i \in \mathbb{N}$ ;  $f_{H}$ )
- 2. Minimize indirect addressing
- 3. Align data structures to SIMD width

incregny ~ last mile".

Ra(0) should be divisible by 64.

#### Details

#### Inner loop

```
for (i = 0; i < N; i++)
  for (j = 0; j < P; j++) // Vectorisation candidate
   ...</pre>
```

#### Not countable

#### Details

#### **Conditionals**

```
for (i = 0; i < N; i++) // Vectorisable with masking
if (data[i] > 0)
    sum += data[i];

    \[ \langle u \\ \mathrew \]
    \[ \langle u \\ \mathrew \]
```

#### **Function calls**

```
for (i = 0; i < N; i++) // Not vectorisable
  sum += some_function(data[i]);</pre>
```

Can get this to vectorise if **some\_function** can be *inlined*.

## SIMD units, a reminder

Scalar addition, 1 output element per instruction.

## Register widths:

- 1 operand (scalar)
- 2 operands (SSE)
- 4 operands (AVX)
  - 8 operands (AVX512)

## Challenge

Best code requires SIMD loads, stores, and arithmetic.



## SIMD units, a reminder

Register widths:

- 1 operand (scalar)
- 2 operands (SSE)
- 4 operands (AVX)
- 8 operands (AVX512)

## Challenge

Best code requires SIMD loads, stores, and arithmetic.

AVX addition, 4 output elements per instruction.



## Data types in SIMD registers

Will focus on AVX (Advanced Vector eXtensions). 32-byte (256 bit) registers.



Machine learning is really excited about half precision because "look how many numbers you can handle".

## What does the compiler do?

## Unrolling

```
Before
                                   After
 for (i = 0; i < n; i++) // Vectorisable loop</pre>
   c[i] = a[i] + b[i]; for (i = 0; i < n; i += 4) {
                                  c[i] = a[i] + b[i];
c[i+1] = a[i+1] + b[i+1];
c[i+2] = a[i+2] + b[i+2];
c[i+3] = a[i+3] + b[i+3];
double precisi of
addition and by side
=) until by side
add.
                                   // Remainder loop
                                   for (i = (n/4)*4; i < n; i++)
                                      c[i] = a[i] + b[i];
                         Don't do this by hand.
             I wales you really must.
```

## Roadblocks



## Roadblocks

## Pointer aliasing (C/C++ only)

```
void foo(double *a, double *b, double *c, int n) {
  for (int i = 0; i < n; i++)
    a[i] = b[i] + c[i];
}</pre>
```

#### This is allowed

#### Roadblocks

#### This is allowed

```
void bar(double *data, int n) {
  double *a = data + 2;
  double *b = data + 1;
  double *c = data;
  foo(a, b, c, n - 3);
}
```



## Pointer aliasing solution

## Pointer aliasing (C/C++ only)

```
void foo(double *a, double *b, double *c, int n) {
  for (int i = 0; i < n; i++)
    a[i] = b[i] + c[i];
}</pre>
```

- Smart compiler will "multiversion" this code. Check for aliasing and dispatch to appropriate version (expensive in hot loops).
- You can guarantee that pointers won't alias (C99 or better, not C++)
   with restrict keyword (\_\_restrict\_\_ in many C++ compilers)

# Options for generating SIMD code

# In decreasing order of preference

1. Compiler does it for you (but...) — Cor models of for you.

2. You tell the compiler what to do (directives) #p regues

3. You choose a different (better) programming model. OpenCL, ispo

4. You use intrinsics (C/C++ only) ) of who will for the first only.

fr ( .- - ) のにうせんじ)

## Compiler options: Intel

- At -02 or above, intel compiler will start vectorising
- Specific vector instruction sets can be selected with -xF00
  - 1. -xSSE2 Everything post 2000ish

car proc/cominto

- 2. -xAVX SandyBridge/IvyBridge (quite old)
- 3. -xCORE-AVX2 Haswell/Broadwell (somewhat old) -xHOST
  4. -xCORE-AVX512 Skylake/Icelake (pretty recent) march = nahui
- Also prefer to set -march=RELEVANTCPU, e.g. -march=broadwell
- If you want AVX512 to really work, also need -qopt-zmm-usage=high
- Tell the compiler there are no pointer aliasing issues with -fno-alias. Don't lie here!
- See icc -help for more

## Compiler options: GCC

At -02 GCC vectorises if you also say -ftree-vectorize (or just use -03).

-merch= n=hoi

- Specific instruction sets with
  - · -msse2
  - · -mavx
  - · -mavx2 + -mfma (to enable FMAs)
  - · -mavx512f
  - See gcc --help-target for more
- Also prefer to set -march=RELEVANTCPU, e.g. -march=broadwell
- Provide a preferred vector width (in bits) with-mprefer-vector-width={128,256,512}
- Tell the compiler there are no pointer aliasing issues with
  - -fargument-noalias. Don't lie here!

# How to know what is happening

- Compilers can provide feedback on what they are doing
- Intel compiler is by far the best here
- Enable output with -qopt-report=n ( $n \in \{1, 2, 3, 4, 5\}$ ) with larger values producing more detail

#### Example

```
for (int i=1; i<n;i++)
a[i] = a[i]+a[i-1];

Letter dipute.
```

# What if things aren't vectorised

- Suppose you know that a particular loop is vectorisable, and should be, but the compiler just won't do it.
- ⇒ use OpenMP SIMD pragmas
  - Turn on with -qopenmp-simd (Intel); -fopenmp-simd (GCC)
  - · Basic usage #pragma omp simd
  - Tells the compiler "it's fine, please vectorise" ⇒ don't lie!

#### Example

#### Prerequisites

- · Countable loop
- Inner loop

```
for (j = 0; j < n; j++)
    #pragma omp simd
    for (i = 0; i < n; i++)
        b[j] += a[j*n + i];</pre>
```

# Overriding the compiler cost model

- Compilers decide to do things based on some cost model
- ⇒ if the cost model is wrong, they may take "bad" decisions
  - For vectorisation purposes we usually need to say "please vectorise this loop" or "please unroll this loop".

```
Vectorisation

#pragma omp simd

#pragma unroll

#pragma unroll(n)

// GCC

#pragma GCC unroll(n)

Clary: #pragma clary loop unoll(n)
```

## Exercise

• What do you need to do to convince Intel compiler to vectorise the GEMM micro kernel?

 $\Rightarrow$  Exercise 9.