# An Introduction to Enzyme

In this short tutorial we will introduce the main concepts behind [Enzyme](https://enzyme.mit.edu), a modern compiler plugin within the [LLVM](https://llvm.org) ecosystem which synthesizes gradients while being deeply ingrained in, and leveraging the LLVM compiler infrastructure. The most important core features of Enzyme are in short:

* Forward-Mode Derivatives
* Reverse-Mode Gradients
* Vector-Modes, and Vectorization of Undifferentiated Code
* Automatic Differentiation of Parallelism (MPI, MPI+OpenMP, OpenMP, RAJA, CUDA, ROCm)
* Cross-Language Automatic Differentiation

We will give a glimpse into some of these capabilities with code examples, as well as in-depth looks at key concepts behind Enzyme, and what makes Enzyme different to preceding tools. With Enzyme being a compiler plugin it always requires simulations to be compiled with a LLVM-based compiler into which it can be loaded as a plugin. Such compilers are:

* [Clang](https://clang.llvm.org) & Flang(-new)
* Intel Compiler Suite: [Intel C/C++ Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html#gs.qzto86) & [Intel Fortran Compiler](https://www.intel.com/content/www/us/en/developer/tools/oneapi/fortran-compiler.html#gs.qztmhg)
* [AMD Optimizing C/C++ and Fortran Compilers (AOCC)](https://developer.amd.com/tools-and-sdks/)
* [ARM HPC Compiler Toolchain](https://community.arm.com/arm-community-blogs/b/high-performance-computing-blog/posts/arm-compilers-and-libraries-for-hpc-now-free): ARM C/C++ Compiler & ARM Fortran Compiler

While your simulation needs to be compiled with LLVM, you don't require a source installation of any of these compilers to follow this introduction to Enzyme along.

## Outline

* [1. Getting Set Up for the Tutorial](#getting-set-up)
* [2. The Core Ideas Behind Enzyme's Design](#enzyme-design)
* [3. Differentiating First Functions](#first-differentiation)
  * [3.1 Square](#_square)
  * [3.2 Norm](#_norm)
* [4. In-Depth Look at Enzyme's Intermediate Steps in the Case of ReLU-3](#enzyme-relu3)
  * [4.1 Activity Analysis](#enzyme-activity-analysis)
  * [4.2 Allocation and Zeroing of the Shadow Memory](#enzyme-allocation-shadow)
  * [4.3 Computation of the Adjoints](#enzyme-comp-adjoints)
  * [4.4 Post Optimization](#enzyme-post-opt)
* [5. The Importance of Activity Analysis](#activity-analysis)
  * [5.1 Activity Annotations in Code](#enzyme-annotations)
* [6. Parallel Automatic Differentiation](#enzyme-parallel-ad)
  * [6.1 Seamless Automatic Differentiation of OpenMP](#enzyme-openmp)
* [7. Further Resources](#enzyme-further-resources)

## 1. Getting Set Up for the Tutorial <a name="getting-set-up"></a>

In order to following with this tutorial there exist two main approaches

* Staying inside of this tutorial and utilizing the interactive [compiler explorer instances](https://enzyme.mit.edu/explorer) to interact with the examples
* Setting Enzyme up on your device, and run the examples locally

The preferred approach we will follow for this SIAM-CSE tutorial will be following the former for the sake of time. But we would highly urge you to sit down and play around with the examples on your local device and compile them for yourself. To do so you'll need to

1. Install Enzyme
  * [Install Enzyme with Brew](https://formulae.brew.sh/formula/enzyme#default)
  * [Install Enzyme with Spack](https://spack.readthedocs.io/en/latest/package_list.html#enzyme)
  * [Pull the Enzyme Docker Container](https://github.com/EnzymeAD/enzyme-dev-docker)
  * [Install Enzyme from Source](https://enzyme.mit.edu/Installation/)
2. [Download the Enzyme-Tutorial](https://github.com/EnzymeAD/Enzyme-Tutorial)

We are providing the link to the respective source file examples with each example as we lead through this tutorial. While this is hidden away in the case of the compiler explorer instances, compilation of Enzyme provides a number of shared libraries to invoke Enzyme from different levels of the compilation process:

* `ClangEnzyme-<LLVM Version>.so`
* `FlangEnzyme-<LLVM Version>.so`
* `LLDEnzyme-<LLVM Version>.so`
* `LLVMEnzyme-<LLVM Version>.so`

The first two are just loaded into the compiler, while `LLDEnzyme` is invoked during the link-time optimization and mostly used for the automatic differentiation of multi-source projects. `LLVMEnzyme` is mostly used for the manual invocation of Enzyme during the optimization process with `opt`, a stage whiuch is usually handled automatically by a compiler such as Clang, or Flang.

## 2. The Core Ideas Behind Enzyme's Design <a name="enzyme-design"></a>

In existing automatic differentiation pipelines you customarily have a host language such as C, C++, Fortran, Julia, Rust, etc. in which the automatic differentiation is computed on the level of the host language before the differentiated code is then lowered into the compilation pipeline, where the code is optimized, and the executable generated. Considering languages which utilize the LLVM compiler infrastructure this takes the following shape:

<center>
    <img src = "https://i.imgur.com/uBfAsVh.png" width = "750">
</center>

Enzyme is able to leverage being inside of the compiler to run optimizations before gradient synthesization, and benefit from the large library of optimizations written by compiler developers in the past.

<center>
    <img src = "https://i.imgur.com/xtHeCon.png" width = "750">
</center>

### 3. Differentiating First Functions <a name="first-differentiation"></a>

#### 3.1 Square <a name="_square"></a>

A first simple example we consider to differentiate with Enzyme is a square-function, which takes the derivative with respect to x.

```c
// Function to differentiate
double square(double x) {
  return x * x;
}
```

The Enzyme autodiff call is then introduced into the C source-file, along with a number of annotations the purpose of which we will explain in later parts of the tutorial.

```c
double __enzyme_autodiff(void*, ...);
```

Initializing a first value to our then differentiated function, we can then obtain the gradient of the square-function at point `x`.

```c
double grad_x = __enzyme_autodiff((void*)square, x);
printf("Gradient square(%f) = %f\n", x, grad_x);
```

we then compile this first square-example with clang, or the LLVM-based C-compiler of our choice, loading Enzyme as a LLVM compiler-plugin:

```bash
clang-12 square.c -O3 -Xclang -load -Xclang /path/to/Enzyme/ClangEnzyme-12.so
```

An interactive version of this example can be accessed in this saved [Enzyme Explorer instance](https://fwd.gymni.ch/L4Lu2A) or in the Enzyme-Tutorial under `1_square`.

#### 3.2 Norm <a name="_norm"></a>

To consider the asymptotic speed-ups afforded by Enzyme's design into the compiler, we have to consider applying a norm with and without optimizations for which we time the two variants in comparison. The normalization function we consider for this is

```c
// Normalization function
void normalize(double *__restrict__ out, const double *__restrict__ in, int n) {
 for(int i = 0; i < n; ++i)
    out[i] = in[i] / mag(in, n);
}
```

of which we then compile two executables, one of which applies optimizations _before_ the gradient synthesization, and one of which applies optimizations _after_ gradient synthesization. Breaking down the pipelines for the two:

1. O2-Optimization -> Enzyme -> O2-Optimization
2. Enzyme -> O2-Optimization -> O2-Optimization

Which show that while AD on the unoptimized IR takes >400s, the optimization being run before AD results in a runtime of <1s. An interactive version of the unoptimized example can be accessed in this [Enzyme Explorer instance](), the optimized version in this [Enzyme Explorer Instance](), or in the Enzyme-Tutorial under `2_norm`.

### 4. In-Depth Look at Enzyme's Intermediate Steps in the Case of ReLU-3 <a name="enzyme-relu3"></a>

Looking at a model example of what Enzyme actually does under the hood, we now consider the following ReLU-3 function, consider the intermediate steps involved in the gradient synthesization, and how this translates into LLVM where we are looking at the abstraction of basic blocks of the LLVM IR. The C-function is given by

```c
double relu3(double x){
    double result;
    if (x > 0)
        result = pow(x, 3);
    else
        result = 0
    return result;
}
```

and invoked as in the first code example above with `__enzyme_autodiff`.

<center>
    <img src = "https://i.imgur.com/cQHqoE3.png" width = "750">
</center>

#### 4.1 Activity Analysis <a name="enzyme-activity-analysis"></a>

Beginning by running a series of analyses, the most important of which is the _Activity Analysis_ which deduces which individual instructions, and which particular values influence the gradient computation. This is done to reduce the amount of total derivative computation by only taking the derivatives which impact the final gradient we are interested in. As this produces a much more compact gradient, it reduces the runtime and makes our derivatives faster. It furthermore avoids attempts at differentiating functions, which are not to be differentiated such as `CPUID`.

<center>
    <img src = "https://i.imgur.com/jhaYti6.png" width = "750">
</center>

#### 4.2 Allocation and Zeroing of the Shadow Memory <a name="enzyme-allocation-shadow"></a>

Next it initializes the shadows, which we are to accumulate the gradients into later. Invoking Enzyme and passing a variable by reference like with a pointer, we then need to _seed_ those shadows. Internal data structures are handled by Enzyme automatically. For the current example we initialize shadows for all of the active variables and set them to 0. Intermediate computation is then accumulated into the shadows throughout the computation. At the end of the computation Enzyme then returns the total derivative.

<center>
    <img src = "https://i.imgur.com/Ilryot1.png" width = "750">
</center>

#### 4.3 Computation of the Adjoints <a name="enzyme-comp-adjoints"></a>

With this, we can then compute the adjoints where we particularly focus on the $x>0$ branch of the computation.

<center>
    <img src = "https://i.imgur.com/AyVAzvI.png" width = "750">
</center>

The derivative of the `pow`/ $x^3$ is then $3 * x^2$ or `3 * pow(x, 2)` and Enzyme performs the corresponding propagation of partial gradients.

<center>
    <img src = "https://i.imgur.com/PmGZzY9.png" width = "750">
</center>

#### 4.4 Post Optimization <a name="enzyme-post-opt"></a>

After the adjoint computation, Enzyme runs further optimization passes to further optimize the code, which if we were to handwrite back into C from the LLVM IR amounts to the optimal gradient we could hand-write ourselves.

<center>
    <img src = "https://i.imgur.com/s13x0oS.png" width = "750">
</center>

### 5. The Importance of Activity Analysis <a name="activity-analysis"></a>

As seen in the previous example, considering and arguing about which variables are relevant to the synthesization of gradients is of utmost importance to performant automatic differentiation. In Enzyme, we have three main activity categories with which we annotate our variables:

* `enzyme_const`: Values who don't impact the derivative computation and/or don't need to be differentiated. For example, there is no need to differentiate the size of an array.
* `enzyme_dup`: Duplicated arguments are active/differentiable values passed by reference. Examples for this are pointer. `enzyme_dup` furthermore requires you to pass a second, shadow variable to store the derivative information into. Reverse-mode adds the derivative into the shadows, and generally should be zero-initialized as shown in the seeding tutorial. A sub-variant of `enzyme_dup` is `enzyme_dupnoneed`, which only returns the gradient while discarding the original return of the function.
* `enzyme_out`: Output arguments which are active/differentiable values passed by value. Examples for this include floats or doubles.

#### 5.1 Activity Annotations in Code <a name="enzyme-annotations"></a>

To show the difference activity annotations can make to the performance of our differentiated code, we examine the dot-product of two vectors:

```c
double dot(double* __restrict__ A, double* __restrict__ B, double C, int n) {
  double sum = 0;
  for (int i=0; i<n; i++) {
    sum += A[i] * B[i];
  }
  return C + sum;
}
```

Providing no activity annotations

```c
__enzyme_autodiff((void*)dot,
                   A, grad_A,
                   B, grad_B,
                   C,
                   n);
```

and adding activity annotations s.t. Enzyme knows exactly how to handle each provided argument

```c
__enzyme_autodiff((void*)dot,
                   enzyme_dup, A, grad_A,
                   enzyme_dup, B, grad_B,
                   enzyme_out, C,
                   enzyme_const, n);
}
```

Viewing both examples in the compiler explorer, we can see the true impact activity annotations have on the performance of our differentiation:

* [No Activity Annotations](https://fwd.gymni.ch/ou4xEJ)
* [With Activity Annotations](https://fwd.gymni.ch/caYQGk)

With activity annotations giving us a ~33% performance uplift. As you can see from the compilation flags, Enzyme does not require additional flags for the annotations. An example connecting both examples, and showing the performance difference of the two can be found in `3_dot` of the Enzyme-Tutorial.

### 6. Parallel Automatic Differentiation <a name="enzyme-parallel-ad"></a>

With simulations seeking to become ever more highly resolved, or including ever more physics and hence needing to parallelize ever better the automatic differentiation of such simulations keeps pace. But this introduces multiple difficulties for automatic differentiaton such as the reversal of the control flow transforming a read race in the forward pass into a write race in the reverse pass. Enzyme provides a number of optimizations and strategies to solve these issues for automatic differentiation users, as well as ensure similar scalability to the original program including but not limited to:

* Non-atomic load/store
* Atomic increments
* ...

Operating inside of the compiler affords us many niceties in this regard such as the ability to combinatorially combine different parallelization paradigms including

* MPI
* Julia Threads
* OpenMP
* CUDA
* ROCm

Enzyme's approach of operating on the optimized intermediate representation of the compiler is preserved throughout:

<center>
    <img src = "https://i.imgur.com/QNgiw4U.png" width = "750">
</center>

Each implementation of individual parallelization paradigms can be viewed as a basic building block, which can be combined to differentiate higher level parallelization paradigms such as the [RAJA Performance Portability Layer](https://github.com/LLNL/RAJA), without requiring its own custom logic inside of Enzyme. Such combinations of individual automatic differentiation logic are only possible due to Enzyme's deep integration into the compiler and the abstraction level it operates on. Different parallelization paradigms utilizing e.g. fork-join parallelism look the same to Enzyme at the LLVM IR as is better visualized in the following graphic:

<center>
    <img src = "https://i.imgur.com/yM1cMUv.png" width = "750">
</center>

Especially the automatic differentiation of CPU-parallelism is entirely seamless, as is best shown in the following code example with OpenMP parallelism.

#### 6.1 Seamless Automatic Differentiation of OpenMP <a name="enzyme-openmp"></a>

Taking the earlier example of the automatic differentiation of `square`, we now expand this example to utilize an OpenMP `parallel for` region through which Enzyme is able to seamlessly differentiate. Returning to the automatic differentiation of the `square` function, we trivially parallelize the evaluation of the `square`-function for a number of inputs with an OpenMP parallel-for:

```c
void omp(float *x, int npoints) {
#pragma omp parallel for
    for (int i = 0; i < npoints; i++) {
        x[i] *= x[i];
    }
}
```

To which we can then apply the Enzyme autodiff call in the usual fashion without the necessity of any additional annotation

```c
double __enzyme_autodiff(void*, ...);

__enzyme_autodiff((void*)omp, array, d_array, 1000);
```

Which is compiled in the usual fashion with the added `-fopenmp` flag customary to all OpenMP-compilation within LLVM-based compilers. An interactive version of this example can be found in the [Enzyme Explorer](https://fwd.gymni.ch/QcHiCN), or in the Enzyme-Tutorial under `openmp/parallel_for`.

### 7. Further Resources <a name="enzyme-further-resources"></a>


While we have now gotten a glimpse into Enzyme's approach to automatic differentiation there are a whole number of features which we have left out:

* Forward-Mode Automatic Differentiation
* Split-Mode Automatic Differentiation
* Vector-Mode Forward & Reverse-Mode
* Custom Gradients
* Custom Behaviour
* Automatic Differentiation of further parallelism such as the ones of MPI, CUDA, and ROCm

To explore examples illustrating these individual features, please take a look at:

* [Enzyme Tutorial](https://github.com/EnzymeAD/Enzyme-Tutorial)
* [Enzyme Documentation: Getting Started](https://enzyme.mit.edu/getting_started/)
* [Example of Enzyme in a Makefile with the manual Compilation Steps for Multisource Projects](https://github.com/EnzymeAD/Enzyme-Tutorial/blob/main/9_multisource/Makefile)
* [Enzyme CMake Example Project](https://github.com/EnzymeAD/CMake-Template)