# Part I - Devito Performance modes

This tutorial is the first one out of a series of tutorials describing the code generated when using different DLE modes. In this tutorial we present the performance optimizations applied by the Devito compiler, including the Devito Loop Engine that delivers optimizations for parallelism and cache locality.

For the purposes of this tutorial we will compare the generated code between several cobinations of DLE modes.

We will use a trivial `Operator` that, at each time step, increments by 1 all points in the physical domain and the code produced in each case.

In [1]:
# This function will be used to print the difference between the generated code.
def _unidiff_output(expected, actual):
    """
    Helper function. Returns a string containing the unified diff of two multiline strings.
    """
    import difflib
    expected=expected.splitlines(1)
    actual=actual.splitlines(1)

    diff=difflib.unified_diff(expected, actual)

    return ''.join(diff)

In [2]:
from devito import clear_cache
import numpy as np
clear_cache()


In [3]:
from devito import Grid, TimeFunction, Eq, Operator, clear_cache
from examples.cfd import plot_field, init_hat
from devito import Eq, solve
from devito import configuration

# Initialise our problem parameters
nx = 200
ny = 200
grid = Grid(shape=(nx, ny))
u = TimeFunction(name='u', grid=grid)
eq = Eq(u.forward, (u + 0.1, u + 0.2, u + 0.1))

#Set up an operator with DLE set to noop and one with DLE set to advanced.
op_noop = Operator(eq, dle = 'noop')
op_advanced = Operator(eq, dle = 'advanced')

str_op_noop = str(op_noop)
str_op_advanced = str(op_advanced)

print(_unidiff_output(str_op_noop, str_op_advanced))

--- 
+++ 
@@ -2,6 +2,8 @@
 #include "stdlib.h"
 #include "math.h"
 #include "sys/time.h"
+#include "xmmintrin.h"
+#include "pmmintrin.h"
 
 struct dataobj
 {
@@ -23,6 +25,9 @@
 int Kernel(struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x_M, const int x_m, const int y_M, const int y_m)
 {
   float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
   for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))
   {
     struct timeval start_section0, end_section0;
@@ -30,6 +35,7 @@
     /* Begin section0 */
     for (int x = x_m; x <= x_M; x += 1)
     {
+      #pragma omp simd aligned(u:32)
       for (int y = y_m; y <= y_M; y += 1

The code diff in the cell above depicts some differences between these two modes.
First of all, we can notice the addition of 
```
+#include "xmmintrin.h"
+#include "pmmintrin.h"
```
and 
```
+  /* Flush denormal numbers to zero in hardware */
+  _MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
+  _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);
```
Denormals are normally flushed when using SSE-based instruction sets, except when compiling shared objects.


We can then see the addition of SIMD Vectorization 
```
+      #pragma omp simd aligned(u:32)
```
before the loop where we iterate the y-direction.

In our next comparison we import performance_mode from devito and more optimizations are enabled:
TO ADD MORE

In [5]:
from devito import mode_performance
mode_performance()

op_speculative = Operator(eq, dle = 'speculative')

str_op_noop = str(op_noop)
str_op_speculative = str(op_speculative)

print(_unidiff_output(str_op_noop, str_op_speculative))

--- 
+++ 
@@ -19,26 +19,42 @@
   double section0;
 } ;
 
+void bf0(struct dataobj *restrict u_vec, const int t0, const int t1, const int x0_blk_M, const int x0_blk_m, const int x0_blk_size, const int y0_blk_M, const int y0_blk_m, const int y0_blk_size);
 
-int Kernel(struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x_M, const int x_m, const int y_M, const int y_m)
+int Kernel(struct dataobj *restrict u_vec, const int time_M, const int time_m, struct profiler * timers, const int x0_blk_size, const int x_M, const int x_m, const int y0_blk_size, const int y_M, const int y_m)
 {
-  float (*restrict u)[u_vec->size[1]][u_vec->size[2]] __attribute__ ((aligned (64))) = (float (*)[u_vec->size[1]][u_vec->size[2]]) u_vec->data;
   for (int time = time_m, t0 = (time)%(2), t1 = (time + 1)%(2); time <= time_M; time += 1, t0 = (time)%(2), t1 = (time + 1)%(2))
   {
     struct timeval start_section0, end_section0;
     gettimeofday(&start_section0

You can now notice that the field computation is happening inside blocked loops.
The size of the blocks is decided form the autotuner.

In [None]:
print(u_skew.data[0])
print("----------")
print(u_adv.data[0])
print("----------")

print("Comparison of the results is :",  np.array_equal(u_skew.data[0], u_adv.data[0]))

In particular, we observe that:

* `u` has size 3 along the time dimension, since it was built with `save=3`. Therefore `op` could only execute 2 timesteps, namely time=0 and time=1; given `Eq(u.forward, u + 1)`, executing time=2 would cause out-of-bounds access errors. Devito figures this out automatically and sets appropriate minimum and maximum iteration points.
* All 16 points in each timeslice of the 4x4 `Grid` have been computed.

To access all default arguments used by `op` *without* running the `Operator`, one can run

In [None]:
#print(op_skew.arguments())
#print(op_adv.arguments())

`'u'` stores a pointer to the allocated data; `'timers'` stores a pointer to a data structure used for C-level performance profiling.

One may want to replace some of these default arguments. For example, we could increase the minimum iteration point along the spatial Dimensions `x` and `y`, and execute only the very first timestep:

In [None]:
u_skew.data[:] = 0.  # Explicit reset to initial value
summary = op_skew.apply(x_m=2, y_m=2, time_M=0)

We look again at the computed data to convince ourselves that everything went as intended to go

In [None]:
u_skew.data

Given a generic `Dimension` `d`, the naming convention is such that:

* `d_m` is the minimum iteration point
* `d_M` is the maximum iteration point

Hence, `op.apply(..., d_m=4, d_M=7, ...)` will run `op` in the compact interval `[4, 7]` along `d`. For historical reasons, `d=...` aliases to `d_M=...`; in many Devito examples it happens to see `op.apply(..., time=10, ...)` -- this is just equivalent to `op.apply(..., time_M=10, ...)`.

If we try to specify an invalid iteration extreme, Devito will raise an exception.

In [None]:
from devito.exceptions import InvalidArgument
try:
    op_skew.apply(time_M=2)
except InvalidArgument as e:
    print(e)

The same `Operator` can be applied to a different `TimeFunction`. For example:

In [None]:
u2 = TimeFunction(name='u', grid=grid)
summary = op_skew.apply(u_skew = u2, time = 10)
u2.data

Note that this was the third call to `op.apply`, but code generation and JIT-compilation only occurred upon the very first call.

There is one relevant case in which the maximum iteration point along the time dimension must be specified -- whenever `save` is unset, as in such a case the `Operator` wouldn't know for how many iterations to run.

In [None]:
v = TimeFunction(name='v', grid=grid)
op2 = Operator(Eq(v.forward, v + 1))
try:
    op2.apply(time = 10)
except ValueError as e:
    print(e)

In [None]:
summary = op2.apply(time_M=4)
v.data

The `summary` variable can be inspected to retrieve performance metrics.

In [None]:
summary

We observe that basically all entries except for the execution time are fixed at 0. This is because by default Devito avoids to compute performance metrics, to minimize the processing time before returning control to the user (in complex `Operators`, the processing time to retrieve, for instance, the operation count or the memory footprint could be significant). To compute all performance metrics, a user could either export the environment variable `DEVITO_PROFILING` to `'advanced'` or change the profiling level programmatically *before* the `Operator` is constructed

In [None]:
from devito import configuration
configuration['profiling'] = 'advanced'

op = Operator(Eq(u_skew.forward, u_skew*u_skew + 1))
op.apply(time = 10)