## Instruction-Level Parallelism and Vectorization

Instruction-Level Parallelism (ILP) refers to the parallel execution of a sequence of instructions in a program. The amount of parallelism is measured by the number of instructions completed per cycle. ILP is not concurrent execution, because there is only a single-thread serial program. Rather, it is a set of techniques that are used to ensure that a processor completes as many instructions as possible by executing them simultaneously.  These techniques include:

* Instruction pipelining: instructions are completed in stages that can be overlapped when instructions are independent.

* Vector processing: multiple instructions can be executed in parallel on adjacent data.  This is a subset of superscalar processing, which is a more general term that includes the idea of using different hardware units at the same time. 

* Out-of-order execution: Instructions may be run in an order different than written in the program. This can be done statically at compile time or dynamically by the hardware.

* Speculative execution (branch prediction): Running a program past a control point. Most often this is done by predicting the outcome of an if-else branch and running the program past the expected outcome. 


One should be aware that all of these techniques exist and have a high-level understanding of them. As a parallel programmer, only **vector processing** has a programmatic interface that you will use.

It is typical to assume that CPUs complete one instruction per cycle. This is *NOT TRUE*. The completion rate is a complex function of architecture and program.
This is measured by:
* Cycles per instruction (CPI): # of clock cycles / # of instructions
* Instructions per cycles (IPC): # of instructions / # of clock cycles

These are dynamic measures taking against a running program. It is more typical to use CPI in parallel computing, because we are expecting to complete more than one.  

If you would like to go down the well, you can look at the [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf) at the [Software optimization resources](https://www.agner.org/optimize/) page.
On many processers, the simplest instructions take about one cycle and complex instructions (such as division) take tens of cycles. 

### Instruction Pipelining

Having a high-level knowledge of instruction pipeling is valuable in writing efficient code, be it parallel or not. The illustrated example of the [Wikipedia page](https://en.wikipedia.org/wiki/Instruction_pipelining) and the subsequent pipeline bubble example are sufficient. You should understand the following concepts:
* Instructions consist of multiple stages of execution
* Each stage can operate at the same time
* Independent instructions are issued and complete at a rate of one per clock cycle
* Data dependencies between instructions result in stalls/bubbles that prevent concurrent execution
* Waiting on instructions or data (often from memory) can prevent instructions from being issued

Managing the pipeline is the domain of the compiler writer or assembly-level programmer. However, an application programmer that is aware of pipelines can write programs that are easier for the compiler to process by minimizing data dependencies, avoiding unneccessary branches, or explicitly prefetching data. 

### Vector Processing

A _vector processor_ is a CPU that is designed to perform simulatneous instructions on a one-dimensional array (vector) of data. The design space of vector processors is rich and varied. We will consider a limited subset of vector operations called _Single Instruction Stream, Multiple Data Stream_ (SIMD) vectors of fixed width. This is called Pure (fixed) SIMD or Packed SIMD. The Graphics Processing Units (GPUs) that drive modern AI hardware are another example of a vector processor that operate on very-wide vectors.

Our examples will use the Intel instrinsic functions to program 128-bit vectors to demonstrate the speedup possible from parallel execution in the SIMD model.

#### What is a vector?

A vector is a packed array of data elements that vary from 8-64 bits into a 128-512 bit contiguous region of memory. Operations against vectors conduct a basic operation (add, multiply) against all elements of the vector simultaneously.   

![Vector Operation](./images/vector_op.JPG "Vector Operation")

Compilers provides _intrinsic_ functions that allow one to call vector instructions using C-style functions. They are much more convenient than writing assembly code.

#### What about my compiler?

For the most part, compilers do a reasonable job of vectorizing code, particulary code in loops. However, the suitability of code for vectorization depends upon how it is written.  If you write code that is conducting the same operation on a sequential array of data then the compiler will vectorize the code. If you write semantically equivalent code that does not access data sequentially then the compiler will be unable to vectorize it.  

__Conclusion__: You can rely on your compiler to automatically vectorize your code when you write it in a way that it can be vectorized. To do so, you must understand vectorization.

There are cases in which the compiler won't do a good job and you will need to vectorize your code by hand. To determine this, you will need to inspect the compiler generated code. Again, you need to understand vectorization.

### Interpreting Code with Compiler Explorer

Given an array of data, we want to count how many ints are equal to the target.

Here is some simple code to count.
```c
long count_ints(int* data, long n, int target) {
    long count = 0;
    for (long i = 0; i < n; i++) {
        if (data[i] == target) {
            count+=1;
        }
    }
    return count;
}
```

To help us look at code and how it compiles we are going [Compiler Explorer](https://godbolt.org/). This website allows you to, among other things, compile code online and view the assembly. The specific problem we are looking at is We can see it https://godbolt.org/z/sdn6nx9Ph. Compiling with `-O3 -mno-sse -fno-unroll-loops ` produces the following:

```
count_ints(int*, long, int):
        test    rsi, rsi
        jle     .LBB0_1
        xor     ecx, ecx
        xor     eax, eax
.LBB0_4:
        xor     r8d, r8d
        cmp     dword ptr [rdi + 4*rcx], edx
        sete    r8b
        add     rax, r8
        inc     rcx
        cmp     rsi, rcx
        jne     .LBB0_4
        ret
.LBB0_1:
        xor     eax, eax
        ret
```

If we ask ChatGPT to help us read the core code:

```
.LBB0_4:
    xor     r8d, r8d                       ; Clear R8D (temporary storage for match flag)
    cmp     dword ptr [rdi + 4*rcx], edx   ; Compare array[i] with target
    sete    r8b                            ; Set R8B to 1 if equal, else 0
    add     rax, r8                        ; Add R8 to EAX (increment count if match)
    inc     rcx                            ; Increment loop counter
    cmp     rsi, rcx                       ; Compare loop counter with length
    jne     .LBB0_4                        ; If not equal, continue loop
    ret                                    ; Return the count
```

Compiling the same code with `-O3 -mavx2 -fno-unroll-loops` uses the AVX extensions to the Intel archiecture for 128-bit vectors. This core code (after some complex setup).


```
.LBB0_6:
        vpcmpeqd   xmm3, xmm1, xmmword ptr [rdi + 4*rax]  ; Compare four integers with 'target'
        vpmovzxdq  ymm3, xmm3                             ; Zero-extend comparison results
        vpand   ymm3, ymm3, ymm2                          ; Mask comparison results with '1'
        vpaddq  ymm0, ymm0, ymm3                          ; Accumulate counts
        add     rax, 4                                    ; Move to the next block of four
        cmp     rcx, rax                                  ; Check if all blocks are processed
        jne     .LBB0_6                                   ; If not, continue the loop

```

The main instruction here is `vpcmpqeqd` which has pseudocode:
```
FOR j := 0 to 3
	i := j*32
	dst[i+31:i] := ( a[i+31:i] == b[i+31:i] ) ? 0xFFFFFFFF : 0
ENDFOR
```

### Intrinsics

If you want to program to vectors manually and you don't want to write assembly code, you can use function call wrappers to vector instructions.  These are known as [intrinsics](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html) 
that compile down to a single assembly instruction 

For example to call [vpcmpeqd](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm256_cmpeq_epi32&ig_expand=878,879) and compare 8 32-bit integers at once we can use 

`__m256i _mm256_cmpeq_epi32 (__m256i a, __m256i b)`

We are going to compare 8 elements at a time.  This means our main loop with increment by 8 each time.  This means we will also have to perform work at the end to deal with any extra iterations. The code must also initialize the 

```c

long count_ints(int* data, long n, int target) {
  long count = 0;
  long clean_end = (n / 8) * 8;
  long i = 0;
  for (; i < clean_end; i += 8) {
      auto cmp_vec = _mm256_cmpeq_epi32(
        _mm256_loadu_si256((__m256i *)(data + i)),
                                    _mm256_set1_epi32(target));
        int movemask = _mm256_movemask_epi8(cmp_vec);
        count += _mm_popcnt_u32(movemask);
  }
  // we need this division to handle the fact we pulled out the bytes not the ints, but we can pull it out of the loop
  count /=4;
  for (; i < n; i++) {
    if (data[i] == target) {
      count += 1;
    }
  }
  return count;
}

```

I don't expect you to full understand this example. I do want you to know that intrinsics exist.
This code and the assembly can be seen [https://godbolt.org/z/zdaac8fsT](https://godbolt.org/z/zdaac8fsT).

The downside of intrinsics is that they are compiled for a particular archictecture. In practice, code that relies on intrinsics will have a base (non-optimized) implementation available when the hardware does not have the optimized instruction. 

Compilers are more portable and should generate vectorized code when it's possible. This is what happened in the matrix multiplication example. 

### Takeaways

* Vector instructions are critical to realize high CPI.
* Your compiler can help you:
    * but you should check it's work
* You can use intrinsics if your compiler is not accomplishing your goals.