# <img src='./images/logo.svg' width=90 style="vertical-align:middle" /> SHAREing: High-level performance assessment notebook

This a template notebook for performing a high-level performance assessment, designed by the SHAREing consurtium. We see this notebook as a working document, where performance analysts can input measured data for a code, and use the markdown cells to make notes of their assessment.

## Software details (**in progress!**)

We recommend the analyst provide key pieces of information (which currently contains dummy info):
* Program name - **Tester**
* Parallel model (e.g., OpenMP, MPI, etc.) - **OpenMP**
* Compiler (including optimisation flags) - **gcc** with `-Ofast`
* Libraries/dependencies - `hdf5`
* Details on data input - using the `test_short.dat` configuration

## Report

In [None]:
from topics.core_table import core_perf
from topics.intra_node import intra_node_perf
from topics.summary_radar import summary

### Core
For a high-level core analysis we just want 4 measurements for a serial code:

1. Target peak FLOPS (Mflops/s)
2. Measured average peak FLOPS (Mflops/s)
3. Target memory bandwidth (MByte/s)
4. Measured average memory bandwidth (MByte/s)

**FLOPS** \
To measure these we use LIKWID, i.e., for FLOPS we use
```bash
likwid-bench -t peakflops -W S0:16kB:1
likwid-perfctr -f -C 0 -g FLOPS_DP ./my_exe
```
in which the input data for the `peakflops` microbenchmark is half the L1 cache. 

**Memory Bandwidth** \
For memory bandwidth we use
```bash
likwid-bench -t triad -W S0:8GB:1
likwid-perfctr -f -C 0 -g MEM ./my_exe
```
Now input your measured values in the cell below:

In [None]:
# Target peak per core (Mflops/s)
peak_perf_single_core = 7255.60
# Measured average application peak (Mflops/s)
measured_average_perf = 241.1913

# Target bandwidth per core (MB/s)
target_bw_per_core = 19506.62
# Measured average bandwidth requirements (MB/s)
measured_avg_bw_requirements = 18017.7017

We read these values into our `core_perf` class

In [None]:
core_perf_statistics = core_perf(peak_perf_single_core,
                                 measured_average_perf,
                                 target_bw_per_core,
                                 measured_avg_bw_requirements)

Generate the **core performance** table below

In [None]:
core_perf_statistics.core_perf_table()

### Intra-node

To quantify intra-node performance at a high level we reply simply on runtimes under a strong scaling analysis.

**Serial run** \
If possible, we first measure the runtime of a serial application without any parallel libraries, e.g., compile without the `-fopenmp` flag. This can seem redundant but allows us to see the overhead of the parallel library when compared to a single-core run including the parallel library. 

If the parallel library cannot be switched off simply, then we suggest just setting the serial runtime equal to a single-core (with parallel library enabled) runtime.

**Strong scaling** \
We now perform a strong scaling analysis by keeping our problem sized fixed but increasing the core count up to the maximum for your hardware. For an OpenMP code the thread number can be set simply with the `OMP_NUM_THREADS` environment variable, however, with this method thread affinity can be an issue. It can make performance variable relative to a thread pinned run.

Thread pinning can be easily acheived by setting the `OMP_PROC_BIND` environment variable to `close`, however, we again make use of LIKWID
```bash
likwid-pin -c N:0-3 ./my_exe
```
This command can be nested into a for loop to increase the core count to efficiently perform a strong scaling analysis.

**Input data** \
In the cell below we ask for the:
1. Serial runtime
2. A list of the core numbers used for the strong scaling
3. A list of the relative runtime per number of cores

The core count and runtimes are currently just setup with dummy data as lists. For significant strong scaling analyses, it can be quicker to save these data to a `*.csv` file and read these into lists rather than inputting values by hand.

In [None]:
# Enter serial performance time (s)
serial_time = 162.22

# enter number of cores in each trial
number_of_cores = [1, 2, 3, 4, 6, 8, 12, 16]

# Enter time for each number of cores (s)
time = [162.22, # 1 core
        55.61, # 2 core
        42.85, # 3 core
        35.35, # 4 core
        30.82, # 6 core
        24.45, # 8 core
        21.93, # 12 core
        19.72 # 16 core
       ]

We read these values into our `intra_node_perf` class

In [None]:
intra_node_statistics = intra_node_perf(serial_time, number_of_cores, time)

Generate the intra-node **parallel efficiency** plot below, including amber and red vertical lines which indicate the core counts below which the parallel efficiency drops to 80% and 60%, respectively.

In [None]:
intra_node_statistics.plot_efficiency_graph()

Generate the intra-node **runtimes** plot below

In [None]:
intra_node_statistics.plot_time_graph()

Generate the **intra-node** performance table below

In [None]:
intra_node_statistics.intra_perf_table()

### Summary diagram (radar plot) - **In Progress!**

In [None]:
summary_statistics = summary(core_perf_statistics,intra_node_statistics)
summary_statistics.draw_radar()

This project has received funding through the UKRI Digital Research Infrastructure Programme under grant UKRI1801 (SHAREing)

<img src='./images/ukri.png' width=200 style="vertical-align:middle" /> 