# Cores & Memory: Streaming Kernels

## - Assignment 1 Miscellany
## - Review: Little's Law, CPUs vs. GPUs
## - Streaming Kernels
## - Arithmetic Intensity, Machine Balance, & the Roofline Model

## First-Week-Flops Miscellany

### My mistakes:

#### - It took a few tries to get the jupyter notebook job working, but it does now

#### - `nvcc` syntax: use `-lineinfo` for symbols in binaries; `-G` turns off all optimizations :(

#### - I'll try to include `module load cse6230` in future assignments.  (If it doesn't work, check if modules are loaded first)



### Useful commands:

#### - `pbsnodes -a` / `qnodes -a` on the head node

#### - `nvidia-smi`: present on all nodes with GPUs.  Useful in and of itself, but also good for detecting when no GPUs are present: `which nvidia-smi || echo "No GPUs"`

### Questions / confusion people have had:

#### - "Intel says this is a 6-core / 12-hardware thread package; PACE says this is a 12 core node.  Is PACE wrong?"

Try this on a compute node when you have X forwarding:

In [None]:
module load hwloc
lstopo

You'll see that PACE has it right because, for all of the nodes we will be using, they have installed *two sockets per node*.

### There is now a grading script that you can try for yourself

`./grading-script.sh` will:

- split up your notebook into host and compute node components
- use your qsub expressions to run your compute node script on one of each type of node
- even though you tune for one type of node, it should run without crashing on any node

### For those seeking peak GPU performance, you can also control the "grid size" (number of thread blocks)

Add, e.g., `Gs=15` to the `run_fma_prof` and `run_fma_prof_opt` targets

This will let you experiment with ILP vs. TLP.

## Review: Little's Law

$\Huge L = \lambda W$

#### - $L$: amount of $x$ in a system.
#### - $\lambda$: arrival rate, $x$ / sec.
#### - $W$: time spent in the system (sec).

- Note that it has dimensions $x$, whatever it is we're trying to count.
- For pipelines, we often think of $\lambda$ as $x$ / cycle and $W$ as length of the pipeline in cycles.  Every unit but $x$ cancels out, so we can take whichever form is more convenient.

### Example from last time: how many independent FMAs are needed for peak flop/s?

#### - $W$: depth of the pipeline
#### - $\lambda$: arrival rate = # FPUs * # FMAs in a *vectorized* instruction

[[Intel's intrisics guide](https://software.intel.com/sites/landingpage/IntrinsicsGuide/)]

![vfmadd132ps](./images/intel-intrinsics.png)

latency = $W$, CPI (cycles per instruction) = 1 / $\lambda_v$, where $\lambda_v$ = throughput for *vectorized* FMAs.  Multiply by vector width (see operation pseudocode) to get full $\lambda$.

### Question from last time:

#### - What is the latency of FMA on the GPUs?  How could we estimate it?

### Quick Review: CPUs  (Hosts)

#### - One set of instructions per thread
#### - OS schedules threads (software multithreading), can *migrate* them between cores
#### - x86-64 (AVX2) instruction set has 32 256-bit vector registers per threads
#### - Parallelism via software multithread, hardware multithreading, superscalar execution, vectorization (**SIMD**)


### Quick Review: GPUs (Devices)

(See the nice illustrations from Prof. Vuduc's [slides](http://vuduc.org/cse6230/slides/cse6230-fa14--05-cuda.pdf), starting on slide 27)

#### - A *compute kernel* is a task that the host assigns to the device in a kernel launch

```c++

solveForX<<<ThreadsPerBlock,BlocksPerGrid>>>(A,x,y);
```

- Proceeds asynchronously from the host until the host requires the results

#### - The task is broken down into a **grid** of *independent* thread blocks

- The host has no control over which thread blocks are assigned where and in what order

#### - Each thread block is assigned to a streaming multiprocessor (SM), where it stays

- A SM may be assigned multiple thread blocks

#### - The SM breaks down the thread blocks into **warps** (32 threads): a warp shares an instruction set, so all (non-divergent) instructions are vectorized
#### - The SM issues instructions from multiple warps per cycle (sometimes multiple instructions per warp per cycle) 
#### - When a warp is stalled, another is scheduled
#### - Avoid: thread divergence


### Using `nvprof` to estimate the FMA latency

`nvprof` is a performance analysis tool like `perf` and `gprof` combined for the GPU

It can be inserted before a program in the same way as `perf`:

In [None]:
make run_fma_prof Nh=0 Bs=1024 Gs=15 Nd=$((1024*15))