RAJA Performance Suite
======================

The RAJA Performance Suite is designed to explore performance of loop-based
computational kernels found in HPC applications. Specifically, it can be
used to assess and monitor runtime performance of kernels implemented using
[RAJA] C++ performance portability abstractions and compare those to variants
implemented using common parallel programming models, such as OpenMP and CUDA,
directly. Some important terminology used in the Suite includes:

  * `Kernel` is a distinct loop-based computation that appears in the Suite in
    multiple variants (or implementations), each of which performs the same
    computation.
  * `Variant` is an implementation or set of implementations of a kernel in the
    Suite that share the same approach/abstraction and programming model,
    such as baseline OpenMP, RAJA OpenMP, etc.
  * `Tuning` is a particular implementation of a variant of a kernel in the
    Suite, such as gpu block size 128, gpu block size 256, etc.
  * `Group` is a collection of kernels in the Suite that are grouped together
    because they originate from the same source, such as a specific benchmark
    suite.

Each kernel in the Suite appears in multiple RAJA and non-RAJA (i.e., baseline)
variants using parallel programming models that RAJA supports. Some kernels have
multiple tunings of a variant to explore some of the parametrization that the
programming model supports. The kernels originate from various HPC benchmark
suites and applications. For example, the "Stream" group contains kernels from
the Babel Stream benchmark, the "Apps" group contains kernels extracted from
real scientific computing applications, and so forth.

The suite can be run as a single process or with multiple processes when
configured with MPI support. Running with MPI in the same configuration used
by an hpc app allows the suite to gather performance data that is more relevant
for that hpc app than performance data gathered running single process. For
example running sequentially with one MPI rank per core vs running sequentially
with a single process yields different performance results on most multi-core
CPUs.

Find complete documentation on RAJAPerf here [[RAJAPerf Documentation](https://rajaperf.readthedocs.io/en/develop/)]

* * *

RAJA Performance Suite Tutorial
===============================

The RAJA Performance Suite Tutorial guides the user through the process of running
RAJAPerf, including generating sweeps through various problem sizes, visualizing
the timing hierarchy, and walking through some simple post-processing analysis,
comparing the performance across two compilers, GCC and Clang, respectively.
In addition the tutorial guides the user with adding a new kernel to the suite,
and rerunning the analysis on the new kernel.

The Jupyter environment is ready-built containing Caliper enabled versions of RAJAPerf,
with GCC and Clang variants. Also, we are able to allow interactive C++ interpreter
of OpenMP constructs utilizing Xeus-Cling under the hood.

Technologies Referenced in the Tutorial
=======================================
* [[RAJA](https://raja.readthedocs.io/)]
* [[Caliper](https://software.llnl.gov/Caliper/)]
* [[Hatchet](https://llnl-hatchet.readthedocs.io/en/latest/user_guide.html)]
* [[OpenMP](https://www.openmp.org/)]
* [[Xeus-Cling](https://xeus-cling.readthedocs.io/en/latest/)]
* [[Cling](https://root.cern/cling/)]

Table of Contents
=================

1. [Intro Xeus-Cling](./01-intro-xeus-cling.ipynb)
2. [RAJA OpenMP](./02-raja-openmp.ipynb)
3. [Running RAJAPerf](./03-running-rajaperf.ipynb)
4. [Running RAJAPerf Sweeps](./04-rajaperf-sweeps.ipynb)
5. [Kernel Prep](./05-kernel-prep.ipynb)
6. [Add Kernel](./06-add-kernel.ipynb)
7. [Run Tutorial Kernel](./07-run-tutorial-kernel.ipynb)

* * *