# Performance
|Author| Stanley A. Baronett|
|--|-------------------------------|
|Created | 7/8/2021|
|Updated | 7/9/2021|

## Setup
- Following runs used a 256$^2$ grid, 32$^2$ meshblocks, with `np4`.
- All `<output#>` blocks removed from `athinput.si`.
- Athena++ configuration based on [Recommended Compiler Options (HECC KB)](https://www.nas.nasa.gov/hecc/support/kb/recommended-compiler-options_99.html):
```bash
./configure.py --prob=streaming_instability -p --eos=isothermal --nghost=3 -hdf5 -h5double -mpi --cxx=icpc -mpi --mpiccmd="icpc -lmpi -lmpi++" --cflag="-axCORE-AVX512,CORE-AVX2 -xAVX"
```
With the latest Intel compiler, these compiler options (`--cflag`) allow a single executable that will run on any of the processor types, with suitable optimization determined at runtime.
- Requested all cores on minimum number of nodes needed for 64 processes (e.g., 80 cores requested across 2 Aitken Cascade Lake nodes).

## Configuration Details

|                                    | Sandy Bridge        | Ivy Bridge             | Haswell                | Broadwell              | Skylake                | Cascade Lake           |
|------------------------------------|---------------------|------------------------|------------------------|------------------------|------------------------|------------------------|
| Processor                          | 8-core Xeon E5-2670 | 10-core Xeon E5-2680v2 | 12-core Xeon E5-2680v3 | 14-core Xeon E5-2680v4 | 20-Core Xeon Gold 6148 | 20-Core Xeon Gold 6248 |
| Newest Instuction Set              | AVX                 | AVX                    | AVX2                   | AVX2                   | AVX-512                | AVX-512                |
| Base CPU-Clock                     | 2.6 GHz             | 2.8 GHz                | 2.5 GHz                | 2.4 GHz                | 2.4 GHz                | 2.5 GHz                |
| Max. Double Prec. Flops/Cycle/Core | 8                   | 8                      | 16                     | 16                     | 32                     | 32                     |
| Memory Bandwidth (read/write)      | 51.2 GB/sec         | 59.7 GB/sec            | 68 GB/sec              | 76.8 GB/sec            | 128 GB/sec             | 141 GB/sec             |
| Intersocket Interconnect           | 32 GB/sec           | 32 GB/sec              | 38.4 GB/sec            | 38.4 GB/sec            | 41.6 GB/sec            | 62.4 GB/sec            |
| Inter-node InfiniBand              | 56 Gbits/s          | 56 Gbits/s             | 56 Gbits/s             | 56 Gbits/s             | 100 Gbits/s            | 200 Gbits/s            |

HECC KB Sources:
- [Pleiades Configuration Details](https://www.nas.nasa.gov/hecc/support/kb/pleiades-configuration-details_77.html)
- [Electra Configuration Details](https://www.nas.nasa.gov/hecc/support/kb/electra-configuration-details_537.html)
- [Aitken Configuration Details](https://www.nas.nasa.gov/hecc/support/kb/aitken-configuration-details_580.html)

## `np4` Results

| Compute Node | Micro-architecture | Cores /Node | PBS Code | CPU Time (s) (tlim=0.5) | zone-cycles /cpu_second (tlim=0.5) | CPU Time (s) (tlim=5) | zone-cycles /cpu_second (tlim=5) | CPU Time (s) (tlim=50) | zone-cycles /cpu_second (tlim=50) |
|--------------|--------------------|-------------|----------|--------------------------|------------------------------------|-----------------------|----------------------------------|------------------------|-----------------------------------|
| Aitken       | Cascade Lake       | 40          | cas_ait  | 2.96                     | 2.25e+07                           | 30.04                 | 2.22e+07                         | 688.81                 | 1.0306e+07                        |
| Electra      | Skylake            | 40          | sky_ele  | 3.05                     | 2.18e+07                           | 31.00                 | 2.15e+07                         | 774.20                 | 9.3390e+06                        |
| Pleiades     | Broadwell          | 28          | bro      | 2.94                     | 2.27e+07                           | 29.49                 | 2.26e+07                         | 688.62                 | 1.0310e+07                        |
|              | Haswell            | 24          | has      | 3.09                     | 2.16e+07                           | 31.45                 | 2.12e+07                         | 731.00                 | 9.6947e+06                        |
|              | Ivy Bridge         | 20          | ivy      | 3.24                     | 2.05e+07                           | 30.65                 | 2.17e+07                         | 747.96                 | 9.5300e+06                        |
|              | Sandy Bridge       | 16          | san      | 4.94                     | 1.35e+07                           | 48.76                 | 1.37e+07                         | 981.67                 | 7.2294e+06                        |

## Particle Overhead Results (Broadwell)

| tlim | CPU Time (s) (np4) | CPU Time (s) (1p) | zone-cycles /cpu_second (np4) | zone-cycles /cpu_second (1p) |
|------|--------------------|-------------------|-------------------------------|------------------------------|
| 0.5  | 2.94               | 0.68              | 2.27e+07                      | 9.72e+07                     |
| 5    | 29.49              | 6.83              | 2.26e+07                      | 9.73e+07                     |
| 50   | 688.62             | 66.60             | 1.03e+07                      | 9.97e+07                     |

## Tasks
- [x] Try rerunning on different CPU architectures
- [x] Run 1000 cycles/CPU (use working np4)
 - [x] Verify theo. calc of cycles/CPU based on input file
 - [x] Adjust tlim accordingly
 - [x] Adjust PBS job script requested times
 - [x] Use dev queue
- [x] Find fastest architecture (zone-cycles/cpu_second)
 - [x] W/ fastest arch., try 1-particle (npx1=npx2=npx3=1)
- [x] Make comparison table (Performance.ipynb)
