# Python for High Performance Computing
# Interfacing to C and Fortran using OpenMP
<hr style="border: solid 4px green">
<br>
<center> <img src="images/arc_logo.png"; alt="Logo" style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Python extensions and OpenMP
<hr style="border: solid 4px green">

### C and Fortran extensions
* one way to improve performance of critical functions
* results are fast compared to equivalent `numpy` implementations
* but are *serial* -- *single thread* of execution, using a *single core*
* this can be easily improved on using OpenMP (reasonable multicore performance is achievable with minimal effort)
<br><br>

### We shall be looking at
* how to use OpenMP to multithread Python extensions
* some factors that influence performance

## A quick overview of OpenMP
<hr style="border: solid 4px green">

### Introducing OpenMP
* beyond the aim of this presentation
  * online tutorials
  * dedicated ARC course
* however, the examples are enough as a starting point
<br><br>

### What is OpenMP?
* **Open** **M**ulti-**P**rocessing
* an API that supports multithreaded programming in C, C++, and Fortran
* primarily targets data parallelism (*e.g.* loops over arrays)
  * multiple threads work on separate parts of the data
  * execution is in parallel

## A quick overview of OpenMP (cont'd)
<hr style="border: solid 4px green">

### *Shared memory* programming model
* multiple (identical) CPUs linked to a single, unified main memory
* the program is a single process, which consists of independent threads of execution
* all threads can access the shared data stored in the main memory
<br><br>

### Explicit parallelism
* explicit (not automatic) programming model
* full programmer control over parallelisation
  * can be as simple as taking a serial program and inserting compiler directives
  * or as complex as using subroutines to set multiple levels of parallelism, locks, nested locks, etc.

## A quick overview of OpenMP (cont'd)
<hr style="border: solid 4px green">

### Thread-based parallelism
* parallelism achieved exclusively through the use of threads
* thread of execution = the smallest unit of processing that can be scheduled by an operating system
* threads exist within the resources of a single process
<br><br>

### Performance notes
* for performance, the number of threads match the number of cores
* thread scheduling can be
  * left to the operating system
  * controlled using appropriate tools

## A quick overview of OpenMP (cont'd)
<hr style="border: solid 4px green">

### OpenMP multithreading 
* fork / join mechanism
  * start with master thread
  * fork into multiple threads (at the start of the parallel region)
  * each thread performs part of the *processing* on a part of the *data*
  * join threads into the master one (at the end of the loop)

<table>
  <tr>
    <th>Single thread execution</th>
    <th>Multithreaded execution (fork-join)</th>
  </tr>
  <tr>
    <th><img src="./images/tasks-serial.png";   style="float: center; width: 100%"></th>
    <th><img src="./images/tasks-parallel.png"; style="float: center; width: 100%"></th>
  </tr>
</table>

## A quick overview of OpenMP (cont'd)
<hr style="border: solid 4px green">

### API components
* compiler directives
* environment variables
* runtime libraries
<br><br>

### Compiler directives
* the bread-and-butter of OpenMP programming
* aimed at data parallelism in loops
* comments inserted in the source code
* control
  * spawning a parallel region
  * distributing loop iterations between threads
  * synchronising work among threads
  * setting the number of threads
  * specifying how loop interations are divided (scheduling)

## A quick overview of OpenMP (cont'd)
<hr style="border: solid 4px green">

### Environment
* setting the number of threads
* specifying how loop interations are divided (scheduling)
* binding threads to processors
* etc.
<br><br>

### RTL routines
* setting (and querying) the number of threads
* querying a thread's unique identifier
* querying a thread's team size
* querying wall clock time and resolution
* etc.
<br><br>

> *Note*: there is overlap in functionality (*e.g.* setting the number of threads), which gives programming flexibility.

## OpenMP practical example
<hr style="border: solid 4px green">

### How to
* write code (step #1)
* compile it (step #2)
* run it (step #3)
<br><br>

### A simple example
* one for loop
* 2 arrays

## OpenMP practical example: step #1
<hr style="border: solid 4px green">

### Using *directives* to tell compiler what and how to multithread
<br><br>

### Original C code
```c
for (i=0; i<N; i++) {
    y[i] = x[i]*x[i];
}
```
<br><br>

### Thread parallelised C code
```c
# pragma omp parallel default (none) shared (N, x,y) private (i) default(none)
{
# pragma omp for schedule (static)
for (i=0; i<N; i++) {
    y[i] = x[i]*x[i];
}
}
```

## OpenMP practical example: step #1
<hr style="border: solid 4px green">

### Same for Fortran
<br><br>

### Fortran code
```fortran
do i=1, N
    y(i) = x(i)*x(i)
end do
```
<br><br>

### Thread parallelised Fortran code
```fortran
!$omp parallel default (none) shared (N, x,y) private (i) default(none)
!$omp do schedule (static)
do i=1, N
    y(i) = x(i)*x(i)
end do
!$omp end do
!$omp end parallel
```

## OpenMP practical example: step #2
<hr style="border: solid 4px green">

### Compilation
* the source code (containing compiler directives) is compiled
* the compiler is instructed to *not* ignore the compiler directives via an OpenMP support flag
  * `gcc` and `gfortran` flag: `-fopenmp`
  * `icc` and `ifort` flag: `-qopenmp`
  * more at http://openmp.org/wp/openmp-compilers/

## OpenMP practical example: step #3
<hr style="border: solid 4px green">

### The executable
* run in the nomal way
* there are extra controls for OpenMP
  * number of threads
  * scheduling (how loop iterations are divided)
  * *etc.*

## OpenMP practical example: summary
<hr style="border: solid 4px green">

### Standalone C code
```
$ gcc -o example -fopenmp example.c
$ export OMP_NUM_THREADS=8
$ ./example
```
<br><br>

### Options for controlling the number of threads
* environment variable `OMP_NUM_THREADS`
  * *e.g.* `export OMP_NUM_THREADS=16`
  * set in shell, before running the code
* run-time library function `omp_set_num_threads()`
* compiler directive `num_threads()`

> *Note*: `OMP_NUM_THREADS` has the widest scope and can be overriden by the other two.

## Python extensions with OpenMP
<hr style="border: solid 4px green">

### Write the extensions
* simply follow the same guidelines as for serial extensions
* additionally, code is enhanced with compiler directives
* compilation uses the extra OpenMP support flags 
<br><br>

### Build the extensions: option #1
Use the same tools and guidelines as for serial extensions
* `gcc` and `f2py`
* add the `-fopenmp` flag
* *Pros*
  * complete control over the process
* *Cons*
  * most of the build process takes place outside Python
  * potential problems (depending on `NumPy` configuration)
    * *e.g.* linking to and loading the right OpenMP RT library

## Python extensions with OpenMP (cont'd)
<hr style="border: solid 4px green">

### Build the extensions: option #2
Using the `numpy.distutils` package
* support for building and installing modules
  * can be pure Python or C/Fortran extension modules
  * can be collections of Python packages which include modules
* *Pros*
  * the build process is well integrated with Python
  * links to the right OpenMP RT libraries
  * established standard procedure
* *Cons*
  * `distutils` is old and can be temperamental
<br><br>

> *Notes*:
> * `setuptools` (includes `easy_install`) is a modern tool
> * options discussed at https://packaging.python.org/installing/

## Installing Python extensions via <span style="font-family: Courier New, Courier, monospace;">distutils</span>
<hr style="border: solid 4px green">

### The setup script
* the centre of all activity (building, distributing, and installing) a module
* describes the module distribution to `distutils`
<br><br>

### Usage
* from "help" `python setup.py --help`
  * `python setup.py build`: build package under directory `build/`
  * `python setup.py install`: install the package at "standard" location or at location specified via `--prefix`
  * `python setup.py clean`: clean the build

## Installing Python extensions via <span style="font-family: Courier New, Courier, monospace;">distutils</span> (cont'd)
<hr style="border: solid 4px green">

### OpenMP extensions
* build C code
  * `setup.py` contains C flags
  * `python setup.py install`
* build Fortran code
  * Fortran flags passed on to `f2py` at command line
  * `python setup.py config_fc --f90flags="-O2 -fopenmp" install`
<br><br>

### Installing in current directory
* adding `--prefix=$PWD` (Linux, Mac OS) to `python setup.py build`
  * builds in `$PWD/build`
  * installs in `$PWD/lib/python2.7/site-packages`

## Using Python extensions installed via <span style="font-family: Courier New, Courier, monospace;">distutils</span>
<hr style="border: solid 4px green">

### The extensions (libraries) have to be in the path
* Option #1: control paths from Python code
  * *e.g.* using `sys.path.append()`
* Option #2: use the shell environment
  * C ('ctypes') extentions: update `LD_LIBRARY_PATH` for `ctypes.cdll.LoadLibrary()`
  * Fortran extensions: update `PYTHONPATH` for `import`
<br><br>

### The extensions are loaded in the normal way
* `lib = ctypes.cdll.LoadLibrary("c_lib.so")`
* `import fortran_lib`

## Example
<hr style="border: solid 4px green">

### Task
Compute the square root of the entries in an array (example used before).
<br><br>

### Steps
* code in directory `src/`
* inspect `setup.py`
* inspect source
  * `c_array_sqrt_omp.c`
  * `f_array_sqrt_omp.f90`
* install extensions
* test extensions using `test_array_sqrt_omp.py`
  * run the extension functions using 1, 2, 4, ... threads

## Example: inspect <span style="font-family: Courier New, Courier, monospace;">setup.py</span>
<hr style="border: solid 4px green">

In [1]:
# %load setup.py
#
# purpose: setup file to install the compiled-language python libraries
# usage:   python setup.py config_fc --f90flags="-O2 -fopenmp" install --prefix=$PWD
#

from numpy.distutils.core import Extension

c_array_sqrt = Extension (name = "c_array_sqrt_omp",
                          sources = ["./src/c_array_sqrt_omp.c"],
                          extra_compile_args = ["-O2 -ffast-math -std=c99 -fopenmp"],
                          extra_link_args = ["-lgomp"])

f_array_sqrt = Extension (name = "f_array_sqrt_omp",
                          sources = ["./src/f_array_sqrt_omp.f90"],
                          extra_compile_args = ["-O2 -ffast-math -fopenmp"],
                          extra_link_args = ["-lgomp"])

if __name__ == "__main__":
    from numpy.distutils.core import setup
    setup ( name = "array-sqrt-openmp",
            description  = "Illustration of Python extensions using OpenMP",
            author       = "Mihai Duta",
            author_email = "mihai.duta@it.ox.ac.uk",
            ext_modules  = [c_array_sqrt, f_array_sqrt]
          )

# end


SystemExit: usage: ipykernel_launcher.py [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
   or: ipykernel_launcher.py --help [cmd1 cmd2 ...]
   or: ipykernel_launcher.py --help-commands
   or: ipykernel_launcher.py cmd --help

error: option -f not recognized

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Example: inspect C source
<hr style="border: solid 4px green">

* make use of `omp_set_num_threads()` to set the number of threads in the parallel region
* this makes threading easier to control from the test script

In [None]:
# %load src/c_array_sqrt_omp.c
# include <math.h>
# ifdef _OPENMP
# include <omp.h>
# endif

void array_sqrt (const int n,
                 double *restrict a_in,
                 double *restrict a_out,
                 const int nt)
{

  int i;

# ifdef _OPENMP
  // set the number of threads to input nt
  omp_set_num_threads(nt);
  // schedule a parallel loop
  # pragma omp parallel for \
    default (none)          \
    shared (a_in,a_out)     \
    firstprivate (n)        \
    private (i)
# endif
  for (i = 0; i < n; i++) {
    a_out[i] = sqrt (a_in[i]);
  }
}


## Example: inspect Fortran source
<hr style="border: solid 4px green">

* make use of `omp_set_num_threads()` to set the number of threads in the parallel region
* this makes threading easier to control from the test script

In [2]:
% load src/f_array_sqrt_omp.f90
subroutine array_sqrt (n, a_in, a_out, nt)
  use omp_lib

  implicit none
  integer, intent(in) :: n
  real(kind=8), dimension(n), intent(in)  :: a_in
  real(kind=8), dimension(n), intent(out) :: a_out
  integer, intent(in) :: nt

  integer :: i

  !! set the number of threads to input nt
  call omp_set_num_threads (nt)

  !! schedule a parallel loop
  !$omp parallel do default(none) shared(a_in,a_out,n) private(i)
  do i = 1, n
     a_out(i) = sqrt (a_in(i))
  end do
  !$omp end parallel do

  return

end subroutine array_sqrt


SyntaxError: invalid syntax (<ipython-input-2-e3675e99b8e1>, line 2)

## Example: build extensions
<hr style="border: solid 4px green">

In [3]:
!python setup.py config_fc --f90flags="-O2 -fopenmp" install --prefix=$PWD

[39mrunning config_fc[0m
[39munifing config_fc, config, build_clib, build_ext, build commands --fcompiler options[0m
[39mrunning install[0m
[39mrunning build[0m
[39mrunning config_cc[0m
[39munifing config_cc, config, build_clib, build_ext, build commands --compiler options[0m
[39mrunning build_src[0m
[39mbuild_src[0m
[39mbuilding extension "c_array_sqrt_omp" sources[0m
[39mbuilding extension "f_array_sqrt_omp" sources[0m
[39mf2py options: [][0m
[39mf2py:> build/src.linux-x86_64-2.7/f_array_sqrt_ompmodule.c[0m
[39mcreating build[0m
[39mcreating build/src.linux-x86_64-2.7[0m
Reading fortran codes...
	Reading file './src/f_array_sqrt_omp.f90' (format:free)
Post-processing...
	Block: f_array_sqrt_omp
			Block: array_sqrt
In: :f_array_sqrt_omp:./src/f_array_sqrt_omp.f90:array_sqrt
get_useparameters: no module omp_lib info used by array_sqrt
Post-processing (stage 2)...
Building modules...
	Building module "f_array_sqrt_omp"...
		Constructing wrapper function "arr

Check dynamic libraries were created.

In [4]:
!ls -l ./lib/python2.7/site-packages

total 112
-rw-rw-r-- 1 ouit0554 ouit0554   248 Mar  6 17:55 array_sqrt_openmp-0.0.0-py2.7.egg-info
-rwxrwxr-x 1 ouit0554 ouit0554 11176 Mar  6 17:55 c_array_sqrt_omp.so
-rwxrwxr-x 1 ouit0554 ouit0554 96328 Mar  6 17:55 f_array_sqrt_omp.so


## Example: run test
<hr style="border: solid 4px green">

In [5]:
! python test_array_sqrt_omp.py

 === C extensions
 1 threads, 0.695964 seconds
 2 threads, 0.365554 seconds
 4 threads, 0.383612 seconds
 === F90 extensions
 1 threads, 0.637081 seconds
 2 threads, 0.322948 seconds
 4 threads, 0.445562 seconds


## Summary
<hr style="border: solid 4px green">

### OpenMP
* easy to use
* *pros*
  * very portable (supported by all compilers)
  * easy incremental parallel code development
  * achieving reasonably good performance is easy
* *cons*
  * achieving very good scaling not trivial
  * easy to make mistakes and trigger elusive bugs

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >
<br>
<br>