# Performance and Acceleration

In this section, we shall see how to accelerate our Python code assembly and get native speed.
We will be using [Numba](https://numba.pydata.org/) and [Pyccel](https://github.com/pyccel/pyccel).

#### Important

The **StencilMatrix** format are based on a negative indexing, which we provide through a syntactic-sugar approach by overiding the **__getitem__** and **__setitem__** methods. The entries are in fact stored in the the private attribute **\_data** which is a *numpy.NdArray*. We therefor use a shift by adding **p** as described in the **StencilMatrix** method

```python
class StencilMatrix( object ):
    ...
    def _shift_index( index, shift ):
        if isinstance( index, slice ):
            start = None if index.start is None else index.start + shift
            stop  = None if index.stop  is None else index.stop  + shift
            return slice(start, stop, index.step)
        else:
            return index + shift
```

In order to use a Python accelerator, we will first need to pass the *NdArray* as argument to the assembly function and not the **StencilMatrix** object.

Since this will be *ugly* it is highly recommanded that one creates an *interface* function that will be calling the *core* assembly function. It is the later one that we will accelerate using Numba or Pyccel.

In [1]:
from simplines import SplineSpace
from simplines import TensorSpace
from simplines import StencilMatrix
from simplines import StencilVector

## 1D Case

### Pure Python

The novelties here are

* we add Python Typing Syntax to our function
* we use the matrix as **inout** argument, in the spirit of Fortran. This means that we shall have a procedure and not a function.
* we shift the access to memory when storing the Matrix entries

In [2]:
def assemble_stiffness_1d(nelements: 'int', 
                          degree: 'int', 
                          spans: 'int[:]', 
                          basis: 'double[:,:,:,:]', 
                          weights: 'double[:,:]', 
                          points: 'double[:,:]', 
                          matrix: 'double[:,:]'):
    """
    assembling the stiffness matrix using stencil forms
    """

    # ... sizes
    ne1       = nelements
    p1        = degree
    spans_1   = spans
    basis_1   = basis
    weights_1 = weights
    points_1  = points

    k1 = weights.shape[1]
    # ...

    # ... build matrices
    for ie1 in range(0, ne1):
        i_span_1 = spans_1[ie1]
        for il_1 in range(0, p1+1):
            for jl_1 in range(0, p1+1):
                i1 = i_span_1 - p1 + il_1
                j1 = i_span_1 - p1 + jl_1

                v = 0.0
                for g1 in range(0, k1):
                    bi_0 = basis_1[ie1, il_1, 0, g1]
                    bi_x = basis_1[ie1, il_1, 1, g1]

                    bj_0 = basis_1[ie1, jl_1, 0, g1]
                    bj_x = basis_1[ie1, jl_1, 1, g1]

                    wvol = weights_1[ie1, g1]

                    v += (bi_x * bj_x) * wvol

                # we shift the test index by p1
                matrix[i1, p1+j1-i1]  += v
    # ...

    # NOTE: we will not return the matrix. 
    #       explainations will come later
    #return matrix

#### Timing using pure Python

In [3]:
V = SplineSpace(degree=3, nelements=400)
M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_1d( V.nelements, V.degree, V.spans, V.basis, V.weights, V.points, M._data )

38 ms ± 2.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Using Numba

When using Numba, one needs to add the decorator **njit** or **jit** and then redefine the function:

In [4]:
from numba import njit

@njit
def assemble_stiffness_1d_numba(nelements: 'int', 
                          degree: 'int', 
                          spans: 'int[:]', 
                          basis: 'double[:,:,:,:]', 
                          weights: 'double[:,:]', 
                          points: 'double[:,:]', 
                          matrix: 'double[:,:]'):
    """
    assembling the stiffness matrix using stencil forms
    """

    # ... sizes
    ne1       = nelements
    p1        = degree
    spans_1   = spans
    basis_1   = basis
    weights_1 = weights
    points_1  = points

    k1 = weights.shape[1]
    # ...

    # ... build matrices
    for ie1 in range(0, ne1):
        i_span_1 = spans_1[ie1]
        for il_1 in range(0, p1+1):
            for jl_1 in range(0, p1+1):
                i1 = i_span_1 - p1 + il_1
                j1 = i_span_1 - p1 + jl_1

                v = 0.0
                for g1 in range(0, k1):
                    bi_0 = basis_1[ie1, il_1, 0, g1]
                    bi_x = basis_1[ie1, il_1, 1, g1]

                    bj_0 = basis_1[ie1, jl_1, 0, g1]
                    bj_x = basis_1[ie1, jl_1, 1, g1]

                    wvol = weights_1[ie1, g1]

                    v += (bi_x * bj_x) * wvol

                # we shift the test index by p1
                matrix[i1, p1+j1-i1]  += v
    # ...

    # NOTE: we will not return the matrix. 
    #       explainations will come later
    #return matrix

#### Timing

In [5]:
V = SplineSpace(degree=3, nelements=400)
M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_1d_numba( V.nelements, V.degree, V.spans, V.basis, V.weights, V.points, M._data )

42.3 µs ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Using Pyccel

In [6]:
from pyccel.epyccel import epyccel

In [7]:
assemble_stiffness_1d_pyccel = epyccel(assemble_stiffness_1d)

#### Timing

In [8]:
V = SplineSpace(degree=3, nelements=400)
M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_1d_pyccel( V.nelements, V.degree, V.spans, V.basis, V.weights, V.points, M._data )

21.4 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)


## 2D case

### Pure Python

In [9]:
def assemble_stiffness_2d(ne1: 'int', ne2: 'int', 
                          p1:  'int', p2:  'int', 
                          spans_1:          'int[:]', spans_2:          'int[:]', 
                          basis_1: 'double[:,:,:,:]', basis_2: 'double[:,:,:,:]', 
                          weights_1:   'double[:,:]', weights_2:   'double[:,:]', 
                          points_1:    'double[:,:]', points_2:    'double[:,:]', 
                          matrix: 'double[:,:,:,:]'):
    """
    assembling the stiffness matrix using stencil forms
    """

    # ... sizes
    k1 = weights_1.shape[1]
    k2 = weights_2.shape[1]
    # ...

    # ... build matrices
    for ie1 in range(0, ne1):
        i_span_1 = spans_1[ie1]
        for ie2 in range(0, ne2):
            i_span_2 = spans_2[ie2]
            # evaluation dependant uniquement de l'element

            for il_1 in range(0, p1+1):
                for il_2 in range(0, p2+1):
                    for jl_1 in range(0, p1+1):
                        for jl_2 in range(0, p2+1):
                            i1 = i_span_1 - p1 + il_1
                            j1 = i_span_1 - p1 + jl_1

                            i2 = i_span_2 - p2 + il_2
                            j2 = i_span_2 - p2 + jl_2

                            v = 0.0
                            for g1 in range(0, k1):
                                for g2 in range(0, k2):
                                    bi_0 = basis_1[ie1, il_1, 0, g1] * basis_2[ie2, il_2, 0, g2]
                                    bi_x = basis_1[ie1, il_1, 1, g1] * basis_2[ie2, il_2, 0, g2]
                                    bi_y = basis_1[ie1, il_1, 0, g1] * basis_2[ie2, il_2, 1, g2]

                                    bj_0 = basis_1[ie1, jl_1, 0, g1] * basis_2[ie2, jl_2, 0, g2]
                                    bj_x = basis_1[ie1, jl_1, 1, g1] * basis_2[ie2, jl_2, 0, g2]
                                    bj_y = basis_1[ie1, jl_1, 0, g1] * basis_2[ie2, jl_2, 1, g2]

                                    wvol = weights_1[ie1, g1] * weights_2[ie2, g2]

                                    v += (bi_x * bj_x + bi_y * bj_y) * wvol

                            matrix[i1, i2, p1+j1-i1, p2+j2-i2]  += v
    # ...

#### Timing

In [10]:
# create the spline space for each direction
V1 = SplineSpace(degree=3, nelements=32)
V2 = SplineSpace(degree=3, nelements=32)

# create the tensor space
V = TensorSpace(V1, V2)

M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_2d( V1.nelements, V2.nelements, V1.degree, V2.degree, V1.spans, V2.spans, V1.basis, V2.basis, V1.weights, V2.weights, V1.points, V2.points, M._data )

15.8 s ± 493 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Using Numba

In [11]:
@njit
def assemble_stiffness_2d_numba(ne1: 'int', ne2: 'int', 
                          p1:  'int', p2:  'int', 
                          spans_1:          'int[:]', spans_2:          'int[:]', 
                          basis_1: 'double[:,:,:,:]', basis_2: 'double[:,:,:,:]', 
                          weights_1:   'double[:,:]', weights_2:   'double[:,:]', 
                          points_1:    'double[:,:]', points_2:    'double[:,:]', 
                          matrix: 'double[:,:,:,:]'):
    """
    assembling the stiffness matrix using stencil forms
    """

    # ... sizes
    k1 = weights_1.shape[1]
    k2 = weights_2.shape[1]
    # ...

    # ... build matrices
    for ie1 in range(0, ne1):
        i_span_1 = spans_1[ie1]
        for ie2 in range(0, ne2):
            i_span_2 = spans_2[ie2]
            # evaluation dependant uniquement de l'element

            for il_1 in range(0, p1+1):
                for il_2 in range(0, p2+1):
                    for jl_1 in range(0, p1+1):
                        for jl_2 in range(0, p2+1):
                            i1 = i_span_1 - p1 + il_1
                            j1 = i_span_1 - p1 + jl_1

                            i2 = i_span_2 - p2 + il_2
                            j2 = i_span_2 - p2 + jl_2

                            v = 0.0
                            for g1 in range(0, k1):
                                for g2 in range(0, k2):
                                    bi_0 = basis_1[ie1, il_1, 0, g1] * basis_2[ie2, il_2, 0, g2]
                                    bi_x = basis_1[ie1, il_1, 1, g1] * basis_2[ie2, il_2, 0, g2]
                                    bi_y = basis_1[ie1, il_1, 0, g1] * basis_2[ie2, il_2, 1, g2]

                                    bj_0 = basis_1[ie1, jl_1, 0, g1] * basis_2[ie2, jl_2, 0, g2]
                                    bj_x = basis_1[ie1, jl_1, 1, g1] * basis_2[ie2, jl_2, 0, g2]
                                    bj_y = basis_1[ie1, jl_1, 0, g1] * basis_2[ie2, jl_2, 1, g2]

                                    wvol = weights_1[ie1, g1] * weights_2[ie2, g2]

                                    v += (bi_x * bj_x + bi_y * bj_y) * wvol

                            matrix[i1, i2, p1+j1-i1, p2+j2-i2]  += v
    # ...

#### Timing

In [12]:
# create the spline space for each direction
V1 = SplineSpace(degree=3, nelements=32)
V2 = SplineSpace(degree=3, nelements=32)

# create the tensor space
V = TensorSpace(V1, V2)

M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_2d_numba( V1.nelements, V2.nelements, V1.degree, V2.degree, V1.spans, V2.spans, V1.basis, V2.basis, V1.weights, V2.weights, V1.points, V2.points, M._data )

9.1 ms ± 70.8 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Using Pyccel

In [13]:
assemble_stiffness_2d_pyccel = epyccel(assemble_stiffness_2d)

#### Timing

In [14]:
# create the spline space for each direction
V1 = SplineSpace(degree=3, nelements=32)
V2 = SplineSpace(degree=3, nelements=32)

# create the tensor space
V = TensorSpace(V1, V2)

M = StencilMatrix(V.vector_space, V.vector_space)

%timeit assemble_stiffness_2d_pyccel( V1.nelements, V2.nelements, V1.degree, V2.degree, V1.spans, V2.spans, V1.basis, V2.basis, V1.weights, V2.weights, V1.points, V2.points, M._data )

7.33 ms ± 372 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Improving the API

In [15]:
def assemble_matrix(core, V, out=None):
    if out is None:
        out = StencilMatrix(V.vector_space, V.vector_space)
        
    # ...
    args = []
    if isinstance(V, TensorSpace):
        args += list(V.nelements)
        args += list(V.degree)
        args += list(V.spans)
        args += list(V.basis)
        args += list(V.weights)
        args += list(V.points)
        
    else:
        args = [V.nelements, 
                V.degree, 
                V.spans, 
                V.basis, 
                V.weights, 
                V.points]
    # ...
        
    core( *args, out._data )
    
    return out

In the sequel, we shall use the [partial](https://docs.python.org/3/library/functools.html#functools.partial) function from **functools**. 

In [16]:
from functools import partial

### 1D case

In [17]:
assemble_stiffness = partial(assemble_matrix, assemble_stiffness_1d_pyccel)

In [18]:
V = SplineSpace(degree=2, nelements=400)

stiffness = assemble_stiffness(V)

### 2D case

In [19]:
assemble_stiffness = partial(assemble_matrix, assemble_stiffness_2d_pyccel)

In [20]:
# create the spline space for each direction
V1 = SplineSpace(degree=2, nelements=64)
V2 = SplineSpace(degree=2, nelements=64)

# create the tensor space
V = TensorSpace(V1, V2)

stiffness = assemble_stiffness(V)

In **simplines** you will find the function **compile_kernel**. It uses the **partial** function but you will need to give it the **arity** (2 for a matrix, 1 for a vector and 0 for a scalar)

## Exercises

1. Write the accelerated version of the rhs assembly produce using Pyccel and Numba and compute their timing.
2. Write the API for the rhs assembly.
3. Perform a benchmark between Pyccel and Numba while varying the B-Spline degree.