# Streams and Roofs

In this week's assignment we are going to make some roofline diagrams for some $n$-body problems.

This week's assignment is meant to be run on a node with a Tesla P100 GPU.  Let's load in our class module:

In [1]:
module use $CSE6230_DIR/modulefiles

In [2]:
module load cse6230

|                                                                         |
|       A note about python/3.6:                                          |
|       PACE is lacking the staff to install all of the python 3          |
|       modules, but we do maintain an anaconda distribution for          |
|       both python 2 and python 3. As conda significantly reduces        |
|       the overhead with package management, we would much prefer        |
|       to maintain python 3 through anaconda.                            |
|                                                                         |
|       All pace installed modules are visible via the module avail       |
|       command.                                                          |
|                                                                         |


In [3]:
module list

Currently Loaded Modulefiles:
  1) curl/7.42.1
  2) git/2.13.4
  3) python/3.6
  4) /nv/coc-ice/tisaac3/opt/pace-ice/modulefiles/jupyter/1.0
  5) intel/16.0
  6) cuda/8.0.44
  7) cse6230/default


And verify that we're running where we expect to run:

In [4]:
nvidia-smi

Thu Sep  6 21:24:18 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.55                 Driver Version: 367.55                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P100-PCIE...  On   | 0000:81:00.0     Off |                    0 |
| N/A   24C    P0    25W / 250W |      0MiB / 16276MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|  No ru

Great!

Now, about the $n$-body simulations we're going to run: a classical $n$-body simulation has each body, or *particle*, interacting with each other, for $n(n+1)/2$ total interactions.  That hardly matches up to the streaming kernels we've been talking about!  So we're going to simplify a bit.

We are going to simulate $n$ infinitesimal particles circling around an infinitely massive sun at the origin.  In this system, the sun is unmoved, and the particles are not affected by each other.

We're going to normalize our coefficients and say that each particle is an ordinary differential equation with *six* components: three of position $X=(x, y, z)$ and three of velocity $U=(u, v, w)$.  The position, is changed by the velocity, of course, but the velocity changes under acceleration that depends on position:

$$\begin{aligned} \dot{X} &= V \\ \dot{V} &= - \frac{X}{|X|^3}.\end{aligned}$$

To discretize this differential equation, we are going to use a time stepping method called the Verlet leap-frog method, which is good for calculating long simulations of stable orbits.  Given a time step length `dt`, our pseudocode for one time step for one particle looks like the following:

1. `X += 0.5 * dt * V`
2. `R2 = X . X` (dot product)
3. `R = sqrt (R2)`
4. `IR3 = 1. / (R2 * R)`
5. `V -= X * dt * IR3`
6. `X += 0.5 * dt * V`

**Question 1.** Assuming `sqrt` and `div` count for one flop each, and assuming `x, y, z` and `u, v, w` are **double-precision** floating point
numbers, **estimate the arithmetic intensity of a *particle time step***.  You should ignore the time it takes to load `dt`.  Your answer should have units of flops / byte.  Give your answer in a new cell below this one, and show how you arrived at that number.

In [None]:
echo 0.302 flops/byte
#the total flops of a particle time step: 1+3+3 + 5 + 1 + 2 + 1+3+3 + 1+3+3 = 29 flops
#the amount of memory access: 3*8*4 = 96 bytes
# Estimated arithmetic intensity is 29 / 96 = 0.302


**Question 2.** Using the peak theoretical **double-precision** flop/s of this node (flop/s on the CPUs and GPU combined), calculated the same way as in the last assignment, and reported peak memory bandwidths from the manufacturers, **estimate the system balance of CPUs and the GPU of this node separately**.  Note that the bandwidth estimate from intel will be for one socket (4 cores) with attached memory, and our node has two such sockets.

In [None]:
#1 node 8 cores(2 sockets), 1 gpu  teslap100
#from last assignment's calculation method, the peak flop/s is as below
#CPU peak double-precision flops/s: 1.4 e+10 flop/s
#GPU peak double-precision flop/s: 4.3 e+12 flop/s
#from manufacturers reports, the bandwiths is as below:
#CPU Intel® Xeon® Processor E5-2623 v4 68.3GB/s 6.83 e+10 byte/s
#GPU Tesla P100 16GB 732GB/s 7.32 e+11 byte/s
#So we have the machine ballance:
#CPU machine ballance: 0.205
#GPU machine ballance: 5.874

Last week, we didn't take the peak flop/s values from the manufacturers at face value, and this week we are not going to take the beak Gbyte/s for granted either.  Last week we used a custom benchmark in our calculations; this week we will use an industry standard: the
[STREAM benchmark](https://www.cs.virginia.edu/stream/ref.html).

We can run the stream benchmark on the CPUs for this assignment with a makefile target:

In [13]:
make runstream STREAM_N=40000000 COPTFLAGS=-Ofast OMP_NUM_THREADS=8 OMP_PROC_BIND=spread

icc -g -Wall -fPIC -Ofast -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -o stream stream.c -DSTREAM_ARRAY_SIZE=40000000
OMP_NUM_THREADS=1  ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 40000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the or

The `STREAM_N` argument will control the size of the stream arrays.

**Question 3:** Modify the invocation of `make runstreams` by modifying the values of
`STREAM_N`, `COPTFLAGS` (optimization flags), and/or `OMPENV` (the openMP environment variables) to get the largest streaming bandwidth from main memory that you can for this node.

- Follow the directions in the output of the file and make sure you are testing streaming bandwidth from memory and not from a higher level of cache.
- You should try to get close to the same bandwidth for all tests:

- There are two variables in the openMP environment you should care about, OMP_NUM_THREADS, which is self explanatory, and OMP_PROC_BIND is discussed [here](http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-affinity.html).  **You should try to use as few threads as possible** to achieve peak bandwidth.

**Question 4:** What does `OMP_PROC_BIND=close` mean, and why is it a bad choice, not just for this benchmark, but for any streaming kernel?

In [None]:
#it is a bad choice because when using OMP_PROC_BIND=close, the assignents 
#goes successively through available places, where they could ends up being on
#the same socket, thus reducing the bandwith.

**Question 5:** I've modified the benchmark, calling it `stream2.c`.  Here's the difference, it's one line of code:

In [8]:
diff stream.c stream2.c

267d266
< #pragma omp parallel for


: 1

Copy your options for `runstream` to `runstream2` below.  The reported results should be different: why?

In [None]:
#The result is different because in stream2.c, we didn't do the parallelization
#for the initialization of the arrays, and this causes that all memories will be
#associated with the master thread's socket. Which makes it significantly slower
#to run because subsequent acess by other sockets will access data from remote

In [14]:
make runstream2 STREAM_N=40000000 COPTFLAGS=-Ofast OMP_NUM_THREADS=8 OMP_PROC_BIND=spread

icc -g -Wall -fPIC -Ofast -I/usr/local/pacerepov1/cuda/8.0.44/include -qopenmp -o stream2 stream2.c -DSTREAM_ARRAY_SIZE=40000000
OMP_NUM_THREADS=1  ./stream2
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 40000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the

**Question 6:** Now we're going to run stream benchmarks for the GPU.  As above, modify the array size until you believe you are testing streaming bandwidth from memory and not from cache.

In [None]:
make runstreamcu STREAM_N=40000000

**Question 7 (2 pts):** This is final time we're running a stream benchmark, I promise.  This benchmark is also for the GPU, but instead of the arrays originating in the GPUs memory, they start on the CPUs memory, and must be transfered to the GPU in back.  This mimics a common design pattern when people try to modify their code for GPUs: identify the bottleneck kernel, and try to "offload" it to the GPU, where it will have a higher throughput (once it get's there).  You don't have to modify this run, I just want you to see what bandwidths it reports:

In [9]:
make runstreamcu2 STREAM_N=1000000

nvcc -ccbin=icpc -lineinfo -Xcompiler '-fPIC' -O -o streamcu2 stream2.cu -DSTREAM_ARRAY_SIZE=1000000


./streamcu2
-------------------------------------------------------------
CSE6230 CUDA STREAM based on version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 1000000 (elements), Offset = 0 (elements)
Memory per array = 7.6 MiB (= 0.0 GiB).
Total memory required = 22.9 MiB (= 0.0 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
Ordinal of GPUs requested = 0
  Device name: Tesla P100-PCIE-16GB
  Memory Clock Rate (KHz): 715000
  Memory Bus Width (bits): 4096
  Peak Memory Bandwidth (GB/s): 732.160000

-------------------------------------------------------------
1.000000 2.000000 0.000000
-----------------------------------------

Now, with the three peak bandwidths that we have *computed* (not the reported values from question 2) -- CPU, GPU with arrays on the GPU, and GPU with arrays on the CPU -- and with the theoretical peak flop/s for the CPU and GPU, compute *effective system balances* and create a plot with rooflines for all three balances overlayed.

- The y axis should be absolute Gflop/s, not relative, so we can compare them, and should be labeled "Gflop/s"
- Label with roofline goes with which balance: "CPU", "GPU", "CPU->GPU->CPU"
- The x axies should be in units of "double precision flops / byte"

Save your plot as the jpg `threerooflines.jpg` so that it can embed in the cell below

In [None]:
#CPU Bandwidth: 65.632 GB/s Peak Performance: 140 Gflop/s
#GPU Bandwidth: 509.043 GB/s Peak Performance: 4300 Gflop/s
#CPU-GPU-CPU Bandwidth : 6.220 GB/s Peak Performance: 4320 Gflop/s

![Three rooflines](./threerooflines.jpg)

**Question 8 (2 pts):** Remember those particles all the way back in question 1?  Your arithmetic intensity estimate could be placed on the roofline plot for the CPUs, and you could make a judgement about whether the kernel is compute bound or memory bound.

Now let's put it to the test.  The `make runcloud` target simulates `NPOINT` particles orbiting the sun for `NT` time steps.  Because these particles are independent, you can optionally "chunk" multiple time steps for each particle independent of the other particles.  Doing this reduces the number of memory accesses per flop:  each particle stays in register for `NCHUNK` time steps.

Do your best to optimize the throughput of the simulation both in the limit of few particles and many time steps, and in the limit of many particles and few time steps.
Do that by modifying the commands below.

- Make the simulations each run about a second
- Do your best to optimize the compiler flags and the runtime (openMP) environment

Using the outputs of those runs, estimate the *effective* arithmetic intensity: take the peak flop/s of the CPU and divide by the throughput of particle time steps per second.  Give that effective arithmetic intensity below.  If that differs from the estimated arithmetic intensity from the first question, can you point to any assumptions that we made that are bad and could explain the difference?

In [17]:
make runcloud NPOINT=32 NT=10000000 NCHUNK=128 COPTFLAG=-O3 OMP_NUM_THREAD=8 OMP_PROC_BIND=spread


rm -f cloud cloud.o verlet.o
make verlet.o DEFINES="-DNT=1"
make[1]: Entering directory `/nv/coc-ice/yzhao492/cse6230/assignments/streams-and-roofs'
icc -std=c99 -g -Wall -fPIC -O -qopt-report=3 -I/usr/local/pacerepov1/cuda/8.0.44/include -DNT=1 -qopenmp -c -o verlet.o verlet.c
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
make[1]: Leaving directory `/nv/coc-ice/yzhao492/cse6230/assignments/streams-and-roofs'
make cloud
make[1]: Entering directory `/nv/coc-ice/yzhao492/cse6230/assignments/streams-and-roofs'
icc -std=c99 -g -Wall -fPIC -O -qopt-report=3 -I/usr/local/pacerepov1/cuda/8.0.44/include  -qopenmp -c -o cloud.o cloud.c
icc: remark #10397: optimization reports are generated in *.optrpt files in the output location
icpc -qopenmp -o cloud verlet.o cloud.o -Wl,-rpath,.
make[1]: Leaving directory `/nv/coc-ice/yzhao492/cse6230/assignments/streams-and-roofs'
OMP_NUM_THREADS=1  ./cloud 32 1000000 0.01 1
./cloud, NUM_POINTS=32, NUM_STEPS

In [None]:
make runcloud NPOINT=6400000 NT=100 NCHUNK=1 COPTFLAG=-O3 OMP_NUM_THREAD=8 OMP_PROC_BIND=spread


In [None]:
#few particles, many time steps: 3.22e+08 particle time steps per second
# many particles, few time steps: 7.25e+08 particle time steps per second
#effective arithmetic intensity:
# 6.56e+10 flops/s * 1s / (7.25e+8 * 96 bytes) = 0.94 flops/byte
#it differs from our calculation in the beggining, which is 0.302
#One explanation is that the assumption we made, that the sqrt and / all counts
# as 1 flop, is wrong, because, you know, sqrt takes about 20 flops and divide 
#takes about 15 flops, that will gives us around 60 flop per particle time steps,
#which makes the arithmetic intensity 0.67, which is closer to our observation.
#Another issue might be that the memory is occupied by some other programs, and 
#thus it is not fully engaged in our calculation, which also increases the effective
#arithmetic intensity.