# Processors Alone

![](images/cpu.jpg)

$$\Huge\color{blue}{\text{Registers}}$$

$$\Huge\color{red}{\text{Scheduler}}$$

$$\Huge\color{green}{\text{Functional Units}}$$

### When introducing OpenMP (which we will look at in more depth later in the class), it's typical to start with a simple example of how easy it makes it to parallelize your code:

```C
/* We added one pragma and it's parallel! */
#pragma omp parallel for
for (i = 0; i < N; i++) {
  A[i] = func (b[i], c[i]);
}
```

### In fact, parallelizing you code can be even simpler, you don't even have to change it.

You simply have to change from this:

In [5]:
cd $CSE6230_DIR/assignments/first-week-flops
cc -g -c -o fma_loop_host.o fma_loop_host.c -O0

To this:

In [17]:
cc -g -c -o fma_loop_host_opt.o fma_loop_host.c -O3

By asking the compiler to try its best to optimize my code, it is able to exploit parallelism within the CPU core, even for my serial program.

---

Questions we'd like to answer today:

- What kind of parallelism is available in a single core, and how much of it?
- How can I exploit it?
- Can all applications exploit it?
- How can I make the compiler do the work for me?
- How does the parallelism on a CPU core compare to the parallelism in a GPU?


Tools we will use today:

- Code compilers, like `cc` above, focusing on their optimization options.  `cc` is typically an alias for a major compiler (you can typically run `man CC` or `CC --help` to get big lists of optimization and other options)
  * GNU `gcc`
  * LLVM `clang`
  * NVIDIA `nvcc`
  * Vendor specific (like Intel `icc`)
  (focus on their optimization options)
- Code decompilers and diagnostics (to see what the heck compilers are doing)
- Hardware counters for things that happen in the processor

**Note:** For many, many applications, optimal performance can't be achieved by optimizing just the processor's performance alone: we have to optimize it's interactions with the memory system. That is why I'd like to finish the "Processors alone" module today.

Luckily, optimizing the processor in isolation is something that compilers are quite good at.

If you have to take away one keep concept today, it is **Little's law**, which will often tell you how to structure your code to set the compiler up for success.

## Recall the end stage of compilation we discussed in the first lecture: machine code

When I compiled `fma_loop_host.o` above, it created a file with those instructions.  It's encoded in binary, so opening it up in a text editor won't tell us much, but I can still find out what those machine code instructions are by decompiling the binary into assembly language.  The utility that let's me do that is called `objdump`.

(We haven't talked about how CUDA code is different, but for now let's just mention that it comes with its own decompiler: `cuobjdump`)

Here is the entirety of `fma_loop_host.c` from assignment 1:

In [9]:
cat fma_loop_host.c | pygmentize

[36m#[39;49;00m[36minclude[39;49;00m [37m"fma_host.h"[39;49;00m[36m[39;49;00m

[37m/* fma_loop: Fused Multiply Add loop[39;49;00m
[37m *           -     -        -[39;49;00m
[37m *[39;49;00m
[37m * a[:] = a[:] * b + c, T times[39;49;00m
[37m *[39;49;00m
[37m * Inputs:[39;49;00m
[37m * N : the size of the array[39;49;00m
[37m * T : the number of loops[39;49;00m
[37m * b : the multiplier[39;49;00m
[37m * c : the shift[39;49;00m
[37m *[39;49;00m
[37m * Input-Outputs:[39;49;00m
[37m * a : the array[39;49;00m
[37m */[39;49;00m
[36mvoid[39;49;00m
[32mfma_loop_host[39;49;00m ([36mint[39;49;00m N, [36mint[39;49;00m T, [36mfloat[39;49;00m *a, [36mfloat[39;49;00m b, [36mfloat[39;49;00m c)
{
  [34mfor[39;49;00m ([36mint[39;49;00m i = [34m0[39;49;00m; i < N; i++) {
    [34mfor[39;49;00m ([36mint[39;49;00m j = [34m0[39;49;00m; j < T; j++) {
      a[i] = a[i] * b + c;
    }
  }
}


In [11]:
objdump -Sd fma_loop_host.o | pygmentize -l c-objdump

fma_loop_host.o:     file format [33melf64-x86-64[39;49;00m


Disassembly of section .text:

[34m0000000000000000[39;49;00m <[32mfma_loop_host[39;49;00m>:
 * Input-Outputs:
 * a : the array
 [04m[31;01m*/[39;49;00m
[36mvoid[39;49;00m
fma_loop_host ([36mint[39;49;00m N, [36mint[39;49;00m T, [36mfloat[39;49;00m *a, [36mfloat[39;49;00m b, [36mfloat[39;49;00m c)
{
   0:	[34m55 [39;49;00m                  	[32mpush[39;49;00m   [31m%rbp[39;49;00m
   1:	[34m48 89 e5 [39;49;00m            	[32mmov[39;49;00m    [31m%rsp[39;49;00m,[31m%rbp[39;49;00m
   4:	[34m89 7d ec [39;49;00m            	[32mmov[39;49;00m    [31m%edi[39;49;00m,-[34m0x14[39;49;00m([31m%rbp[39;49;00m)
   7:	[34m89 75 e8 [39;49;00m            	[32mmov[39;49;00m    [31m%esi[39;49;00m,-[34m0x18[39;49;00m([31m%rbp[39;49;00m)
   a:	[34m48 89 55 e0 [39;49;00m         	[32mmov[39;49;00m    [31m%rdx[39;49;00m,-[34m0x20[39;49;00m([31m%rbp[39;49;00m)
   e:	[34mf3 0f 11 45 

(Note that, as illegible as this is, it would be much worse if we didn't have source code interspersed with instructions.  You should always compile C code with `-g` and CUDA with `-G` for this reason and others.)

For our purposes, this assembly code has three types of instructions:

- Instructions that take **registers** as inputs (those things that are addressed like `%rax` and `%rbp`) and
  write their outputs over the locations of their inputs.  Examples are floating point operations like `mulss` (multiply two single precision numbers together), integer operations like `addl` (add two 32-bit integers), and logical operations like `cmp` (determine if one integer is less than another and write the output to a special register).

  * In hardware, registers are data locations in a register file: the storage closest to the execution units.
    Register space is quite dear, so to reflect that, most instruction sets have a limited number of registers (see
    e.g. the wikipedia page for the [AVX512](https://en.wikipedia.org/wiki/AVX-512#Extended_registers) instruction 
    set.).  When a thread has too many computations to keep track of, data that would otherwise be stored in a register is *spilled* to memory, which slows things down.  I mention all of this just to say that one things compilers are trying to do is figure out how to squeeze your complex instructions into the limited scratchpad space provided by the registers.
    
- Instructions that load and store data from memory like `mov`: we're not going to talk about open can of worms today.

- Branching instructions that control the flow of instructions like `jl` (jump to a given code location based on the outcome of a comparison)

Again, for our simple purposes today, a **thread** is:

- a stream of instructions, with
- a limited set of registers as a workspace for partial computations

### How a thread is executed

(This is a simplification of the [classic RISC pipeline](https://en.wikipedia.org/wiki/Classic_RISC_pipeline))

1. An instruction like `add    %rdx,%rax` is:

  1. *fetched* from the instruction queue, 
  2. *decoded* (it's operation and register inputs / outputs are identified), 
  3. **executed** (the part we care about), and
  4. *written back* to registers
  
2. Move to the next instruction and repeat

### Pipelining

If a *cycle* is the smallest unit of time of a processor, and an instruction has multiple steps (each step takes a cycle), does that means that an instruction takes multiple cycles?

Yes!  Let's say $k$ cycles.

Does that mean that a processor takes $kN$ cycles to complete $N$ instructions?

No! Instructions are **Pipelined:**

`(TODO: Whiteboard)`

The key thing to understand about pipelined operations:

**The results of an operation can't be inputs to another operation until they exit
the pipeline.**

Any cycle of a pipeline when there isn't a new input is a *bubble*.

The *efficiency* (work / cycle) of your pipelined algorithm is the *fraction of non-bubble cycles*.

**A fully efficient pipelined algorithm has at least $k$ concurrent indepent operations at any point in time, where $k$ is the depth of the pipeline**

(I think that the way that many diagrams show pipelined instructions (time axis horizontal, data axis vertical, instructions labeled) is not helpful, because the "pipe" in the pipeline is always moving, and because the diagram gets larger in both dimensions as time goes on.  I prefer to have *instructions* on the vertical axis and *data* labeled on the diagram, because that way the diagram only grows on one axis, and each column looks like a time slice of the pipeline)

`(TODO: Whiteboard)`

### Pipelines and branching

Even in our simple program, we saw that my nice clean breakdown of register-register instructions and memory instructions wasn't respected: some instructions like `mulss  -0x24(%rbp),%xmm0` combine a memory access
(`-0x24(%rbp)` accesses a memory location stored in `%rbp`, offset by a certain amount).

More complex instructions require more decoding: the pipeline of operations before **execute** is quite long on a modern CPU.

`(TODO: Whiteboard)`

That is why branching instructions are hard to combine with pipelined execution.  We don't know why instruction
should go into the pipeline, it depends on the output of a computation like a comparison.  The CPU could:

- Hold up everything to wait until it is known which branch to take (always stall the pipeline, bad)
- Try to *predict* which branch will be taken an keep feeding the pipeline with that branch (bad when there is a *misprediction*)

Branch prediction is a complicated, sophisticated thing on modern CPUs.  In your programming, you should assume the following:

- Computers are good are recognizing patterns: there is a branch in every loop of a for loop, but if you keep looping out, it will eventually start predicting that is the branch to take, and a branch will be a neglible part of the execution time.  For loops with known bounds can also be **unrolled** meaning copy-pasted the right number of times with no branching at all.

```C
for (int i = 0; i < N; i++) { /* if N = 10000000000, branch prediction will almost always be right */
    /* ... */
}
```

```C
for (int  i = 0; i < 8; i++) {
  /* If the bound is known at compile time, the loop can be unrolled with no branching */
  /* ... */
}
```

```C
/* You can give the compiler hints about how you want to break up a loop in to unrolled sections,
   reducing the number of branches */
#pramga unroll(8)
for (int i = 0; i < N; i++) {
    /* ... */
}
```

- If your branching has no patterns, then you should expect lots of branch misprediction: the instruction pipeline 
  has to be cleared out, leading to a stall in your code *proportional in length to the pipeline depth* (~10-29 cycles)
  
- Branch misprediction is the kind of hardware event that can be counted by a performance counter like `perf`

`(TODO: perf demonstration)`

In [None]:
cd $CSE6230_DIR/assignments/first-week-flops
make run_fma_prof PERF="perf stat -v"

### Executing instructions

Like I said, the depth of the pipeline before and after execution really only affects us when there is branching.  Let's talk about *execute*:

- Different types of instructions are executed on different *functional units*:

  - *ALU*: arithmetic and logic unit
  - *FPU*: floating point unit
  - etc.
  
See, e.g., the [Kaby Lake](https://en.wikichip.org/wiki/intel/microarchitectures/kaby_lake) diagram from Wikichip that we say in the first lecture.  This is what the cartoon at the top of the lecture is supposed to be a simplification of.


### Superscalarity

There are multiple functional units in a processor.  In the pipeline diagrams we've seen so far, there is only one `execute` instruction happening per cycle.  That would mean that only one functional units is called on per cycle, leaving the others idle.  Is that a waste of resources?

It is, the diagrams are wrong! Modern CPUS are **superscalar:** there are multiple instruction pipelines that can happen at once.

`(TODO: Whiteboard diagram)`

### How can we exploit superscalarity?

- Some combination of a smart **scheduler,** which is able to
  1. Look ahead several instructions,
  2. Identify *independent* operations, and
  3. Reorder for concurrent independence
  
- And a smart **compiler**, which
  1. Knows the functional units that are available
  2. Knows the amount of register space available and the superscalar factor, and
  3. Tries to reorder and change which registers are used to solve the
     optimal scheduling problem
     
In almost all cases, the compiler is better than you are at this: don't try to out think it.

If you think the compiler is getting it wrong:

- Use the decompiler to see what it's doing
- Use *optimization reports* (like Intel `-qopt-report=5`) to ask the compiler to tell you what it's doing
 

### We can also exploit superscalarity with multiple threads

When one thread has a pipeline stall, another can be issue instructions.

- If there is hardware support for multiple threads, that means they can both have their registers in the register file at the same time, and the scheduler can switch between them.  If there is OS support for multiple threads, that means the OS switches which threads have their registers in the processor at a given time.  We can talk more about this another day.


### Let's see how well the compiler optimizes a simple loop

- Pass `-xHost` for compiling to your current chip with `icc`

In [27]:
cd $CSE6230_DIR/assignments/first-week-flops
cat fma_loop_short.c | pygmentize
cc -g -c -o fma_loop_short.o fma_loop_short.c -O3 -march="broadwell"

[36m#[39;49;00m[36minclude[39;49;00m [37m"fma_host.h"[39;49;00m[36m[39;49;00m

[37m/* fma_loop: Fused Multiply Add loop[39;49;00m
[37m *           -     -        -[39;49;00m
[37m *[39;49;00m
[37m * a[:] = a[:] * b + c, T times[39;49;00m
[37m *[39;49;00m
[37m * Inputs:[39;49;00m
[37m * N : the size of the array[39;49;00m
[37m * T : the number of loops[39;49;00m
[37m * b : the multiplier[39;49;00m
[37m * c : the shift[39;49;00m
[37m *[39;49;00m
[37m * Input-Outputs:[39;49;00m
[37m * a : the array[39;49;00m
[37m */[39;49;00m
[36mvoid[39;49;00m
[32mfma_loop_short[39;49;00m ([36mint[39;49;00m N, [36mint[39;49;00m T, [36mfloat[39;49;00m *a, [36mfloat[39;49;00m b, [36mfloat[39;49;00m c)
{
  [34mfor[39;49;00m ([36mint[39;49;00m i = [34m0[39;49;00m; i < N; i++) {
    a[i] = a[i] * b + c;
  }
}


In [28]:
objdump -Sd fma_loop_short.o | pygmentize -l c-objdump

fma_loop_short.o:     file format [33melf64-x86-64[39;49;00m


Disassembly of section .text:

[34m0000000000000000[39;49;00m <[32mfma_loop_short[39;49;00m>:
 * a : the array
 [04m[31;01m*/[39;49;00m
[36mvoid[39;49;00m
fma_loop_short ([36mint[39;49;00m N, [36mint[39;49;00m T, [36mfloat[39;49;00m *a, [36mfloat[39;49;00m b, [36mfloat[39;49;00m c)
{
  [34mfor[39;49;00m ([36mint[39;49;00m i = [34m0[39;49;00m; i < N; i++) {
   0:	[34m85 ff [39;49;00m               	[32mtest[39;49;00m   [31m%edi[39;49;00m,[31m%edi[39;49;00m
   2:	[34m0f 8e a7 02 00 00 [39;49;00m   	[32mjle[39;49;00m    [34m2af[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0x2af[39;49;00m>
   8:	[34m48 89 d0 [39;49;00m            	[32mmov[39;49;00m    [31m%rdx[39;49;00m,[31m%rax[39;49;00m
   b:	[34m8d 77 ff [39;49;00m            	[32mlea[39;49;00m    -[34m0x1[39;49;00m([31m%rdi[39;49;00m),[31m%esi[39;49;00m
   e:	[34m48 c1 e8 02 [39;49;00m         	[32mshr[39;49;

 106:	[34m44 39 c1 [39;49;00m            	[32mcmp[39;49;00m    [31m%r8d[39;49;00m,[31m%ecx[39;49;00m
 109:	[34m72 e5 [39;49;00m               	[32mjb[39;49;00m     [34mf0[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0xf0[39;49;00m>
 10b:	[34m44 89 c9 [39;49;00m            	[32mmov[39;49;00m    [31m%r9d[39;49;00m,[31m%ecx[39;49;00m
 10e:	[34m83 e1 f8 [39;49;00m            	[32mand[39;49;00m    [31m$0xfffffff8[39;49;00m,[31m%ecx[39;49;00m
 111:	[34m42 8d 04 11 [39;49;00m         	[32mlea[39;49;00m    ([31m%rcx[39;49;00m,[31m%r10[39;49;00m,[34m1[39;49;00m),[31m%eax[39;49;00m
 115:	[34m41 39 c9 [39;49;00m            	[32mcmp[39;49;00m    [31m%ecx[39;49;00m,[31m%r9d[39;49;00m
 118:	[34m0f 84 b2 01 00 00 [39;49;00m   	[32mje[39;49;00m     [34m2d0[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0x2d0[39;49;00m>
 11e:	[34mc5 f8 77 [39;49;00m            	[32mvzeroupper[39;49;00m 
 121:	[34m48 63 c8 [39;49;00m            	[32mmovs

  [34mfor[39;49;00m ([36mint[39;49;00m i = [34m0[39;49;00m; i < N; i++) {
 1ef:	[34m8d 48 07 [39;49;00m            	[32mlea[39;49;00m    [34m0x7[39;49;00m([31m%rax[39;49;00m),[31m%ecx[39;49;00m
 1f2:	[34m39 cf [39;49;00m               	[32mcmp[39;49;00m    [31m%ecx[39;49;00m,[31m%edi[39;49;00m
 1f4:	[34m0f 8e b5 00 00 00 [39;49;00m   	[32mjle[39;49;00m    [34m2af[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0x2af[39;49;00m>
    a[i] = a[i] * b + c;
 1fa:	[34m48 63 c9 [39;49;00m            	[32mmovslq[39;49;00m [31m%ecx[39;49;00m,[31m%rcx[39;49;00m
 1fd:	[34mc5 f8 28 d0 [39;49;00m         	[32mvmovaps[39;49;00m [31m%xmm0[39;49;00m,[31m%xmm2[39;49;00m
 201:	[34m48 8d 0c 8a [39;49;00m         	[32mlea[39;49;00m    ([31m%rdx[39;49;00m,[31m%rcx[39;49;00m,[34m4[39;49;00m),[31m%rcx[39;49;00m
 205:	[34mc4 e2 71 99 11 [39;49;00m      	[32mvfmadd132ss[39;49;00m ([31m%rcx[39;49;00m),[31m%xmm1[39;49;00m,[31m%xmm2[39;49;00m
 20a

 2c0:	[34m41 ba 01 00 00 00 [39;49;00m   	[32mmov[39;49;00m    [31m$0x1[39;49;00m,[31m%r10d[39;49;00m
 2c6:	[34me9 03 fe ff ff [39;49;00m      	[32mjmpq[39;49;00m   [34mce[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0xce[39;49;00m>
 2cb:	[34m0f 1f 44 00 00 [39;49;00m      	[32mnopl[39;49;00m   [34m0x0[39;49;00m([31m%rax[39;49;00m,[31m%rax[39;49;00m,[34m1[39;49;00m)
 2d0:	[34mc5 f8 77 [39;49;00m            	[32mvzeroupper[39;49;00m 
 2d3:	[34mc3 [39;49;00m                  	[32mretq[39;49;00m   
 2d4:	[34m0f 1f 40 00 [39;49;00m         	[32mnopl[39;49;00m   [34m0x0[39;49;00m([31m%rax[39;49;00m)
 2d8:	[34m45 31 d2 [39;49;00m            	[32mxor[39;49;00m    [31m%r10d[39;49;00m,[31m%r10d[39;49;00m
 2db:	[34me9 ee fd ff ff [39;49;00m      	[32mjmpq[39;49;00m   [34mce[39;49;00m <[31mfma_loop_short[39;49;00m+[34m0xce[39;49;00m>
 2e0:	[34m31 c0 [39;49;00m               	[32mxor[39;49;00m    [31m%eax[39;49;00m,[31m%eax[39;

The compiler actually compiled several version of the loop, some optimized for different inputs.

There are instructions that we didnt see before, like `vfmadd132ps`.  Let's go to Intel's [intrinsics reference](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#!=undefined) to see what we can see.

What did we learn:

- There are **vectorized** instructions, when a single instruction operates on multiple data at one time (SIMD).
- **Fused multiply add** is an instruction that counts as two flops at once!  It is so fundamental to linear algebra that it deserves optimization.
- **Execution itself is pipelined**, with the pipeline depth depending on the instruction.
- Sometimes there are multiple functional units that can do the same instruction (2 FPUs on a modern Intel chip, for instance).

### Putting it all together

Because pipelined instructions have to be independent, how many independent FMAs do we have to have to issue one per cycle to each FPU on a core, thus achieving peak flops/cycle?

An application of

## Little's Law

$$\Huge L = \lambda W$$

- $L$: The concurrency, number of concurrent, independent operations that will fill the pipeline
- $\lambda$: the "width" of the data that can be entered into the pipeline in a single cycle
- $W$: the depth of the pipeline

Multiply this by the number of cores in a node and the number of nodes in a machine to get
the CPU flops of that machine!

## Comparing CPU cores and GPU streaming multiprocessors (SMs)

`(TODO: inline Prof. Vuduc's slides)`

Some key takeaways:

- The CUDA programming model is Single Instruction Mutliple Thread: each thread has its own registers, but a shared instruction stream.
- One instruction is executed on a **warp** a group of 32 threads that mostly work in lock step
- Every instruction is vectorized, not just special instructions on the CPU.
- Mostly: any branch divergence between them is *serialized*, so in addition to misprediction, branching has another steep price on GPUs.
- Question: what are the depths of the pipelines on a Streaming Multiprocessor?

## Comparing CPU cores and GPU streaming multiprocessors (SMs)

`(TODO: inline Prof. Vuduc's slides)`

Some key takeaways:

- The CUDA programming model is Single Instruction Mutliple Thread: each thread has its own registers, but a shared instruction stream.
- One instruction is executed on a **warp** a group of 32 threads that mostly work in lock step
- Every instruction is vectorized, not just special instructions on the CPU.
- Mostly: any branch divergence between them is *serialized*, so in addition to misprediction, branching has another steep price on GPUs.
- Question: what are the depths of the pipelines on a Streaming Multiprocessor?

## Exploiting The Concurrency In Your Code

(TODO: inline Prof. Vuduc's slides)