# Python for High Performance Computing
# Parallel Programming
<hr style="border: solid 4px green">
<br>
<center> <img src="images/arc_logo.png"; alt="Logo" style="float: center; width: 20%"></center>
<br>
## http://www.arc.ox.ac.uk
## support@arc.ox.ac.uk

## Bad news and good news
<hr style="border: solid 4px green; ">

### Bad news
* serial software does not run faster on newer hardware than on old
* *parallel processing* has been the key to increased performance for the past decade
<br><br>

### Good news
* there are relatively easy ways for parallel programming in Python

##  Parallel processing: motivation
<hr style="border: solid 4px green; ">

###  Traditional motivation
* make large simulations *possible* -- spread the memory load by running on several hosts
* make large simulations *faster* -- spread the computational load (more CPUs doing a fixed amount of work)
<br><br>

### Multicore design
* all systems (servers, desktops, smart phones) are *parallel*

##  Parallel processing: motivation (cont'd)
<hr style="border: solid 4px green; ">

###  Moore's law:
* the number of transistors in CPU design doubles every 2 years
<br><br>

### Speed for free: CPU design until cca. 2005
* the doubling in the number of transistors has correlated well with a doubling in speed owing to
  * clock speed increase and
  * instruction level parallelism
* for over 40 years, Moore's law has practically meant a doubling in CPU performance every 2 years
<br><br>

### Cores for free: CPU design after 2005
* reached a power wall (heat generated by the chip cannot be dissipated fast enough)
* froze clock speed at just under 3GHz
* has gone *multicore*

<img src="./images/cpu_clockspeed.jpg"; style="float: center; width: 60%">

## Parallel processing: multicore systems
<hr style="border: solid 4px green; ">

### CPUs are not getting faster anymore, they are just getting more cores
<br><br>

### The promise: all cores can be harnessed for processing
* result: overall performance is proportional to the number of cores
<br><br>

### The reality: performance needs careful programming
* cores are *fast*, access to data (memory, network, files) is *slow*
* programming needs some care to extract performance

## Memory access
<hr style="border: solid 4px green; ">

### Memory access is a "traditional" limitation
* the bus interconnect between CPU and main memory is a bandwidth limitation
  * *e.g.* a 4-core Intel Core i7 + DDR3 RAM
    * memory bandwidth 25.6GB/s max (2 memory channels)
    * peak theoretical speed (single precision): 100 -- 190 GFlops (depending on frequency)
* the main memory has large latency
<br><br>

### CPUs are starved of data
* the bus is slower to "feed" data than the rate at which the CPU can "consume"
* theoretical peak performance is useless
  * the HPL (Linpack) benchmark achieves 90% of peak because it is designed to
  * real world applications do not come this close
* caching alleviates the problem (up to a point)

## Memory access (cont'd)
<hr style="border: solid 4px green; ">

### Cache = *fast* (but *small*) memory that sits between CPU and main memory
* all program data is stored in main memory (*slow* but *large*)
* often used data is stored temporarily in cache (*fast* but *small*)
<br><br>

### Cached reads (easiest to understand)
* a CPU memory read from main memory fetches a whole *cache line*
  * data requested by current instruction plus
  * data *likely* to be requested by next instructions
* assumptions about data already requested
  * is physically stored near data to be requested in future (*spatial locality*)
  * is soon going to be requested again (*temporal locality*)

## Memory access (cont'd)
<hr style="border: solid 4px green; ">

### Cached reads (1 cache level)
* when data is requested (read) from memory, the cache line containing the data (plus adjacent data) is transfered to cache
* subsequent reads will look in cache first
  * *cache hit* -- data already in the cache line stored in cache (fast memory access)
  * *cache miss* -- data not cached, send request for another cache line from main memory (slow memory access)
* the size of cache line of x86 architecture is 64 bytes (16 integers/floats or 8 doubles)
  * *e.g.*: if `x[20]` is requested (`float *x`), a whole cachel line containing `x[16]` through to to `x[31]` is transferred
<br><br>

### Cache writes are similar
* https://en.wikipedia.org/wiki/Cache_(computing) is instructive<br><br>
<br><br>

### Caching is transparent to applications
* applications do not "see" the cache levels (only memory access)
* but they "feel" it (good or bad performance)

## Memory access (cont'd)
<hr style="border: solid 4px green; ">

### NUMA architecture further complicates things
<br><br>

### **N**on-**U**niform **M**emory **A**ccess architecture
* modern systems have multiple multicore CPUs
* each CPU is attached (via a bus) to a *part* ot the global memory
* CPU plus its part of memory = NUMA node
* *all* memory is seen from *anywhere* but "far" memory is accessed more slowly than "near" memory
<br><br>

<table>
  <tr>
    <th>Uniform Access</th>
    <th>NUMA (modern software)</th>
  </tr>
  <tr>
    <th><img src="./images/numa-node-1.png"; style="float: center; width: 100%"></th>
    <th><img src="./images/numa-node-4.png"; style="float: center; width: 100%"></th>
  </tr>
</table>

## Memory access (cont'd)
<hr style="border: solid 4px green; ">

### The lesson
* computation is cheap (almost free)
* memory access is costly
<br><br>

### What we can do
* memory access cannot be avoided but
* application performance is optimal through
  * good programming
  * careful run configuration
<br><br>

### Programming: respect data locality
* single core: *minimise cache misses*
  * *e.g* traverse a matrix in the right way (*e.g.* row-wise in Python and C)
* multiple cores: *minimise remote data access*
  * *e.g.* initialise data using the same multithreaded scheduling used to process data (*i.e.* respect first touch policy in memory initialisation)
<br><br>

### Running
* avoiding process and thread migration
  * *e.g.* process and/or thread pinning
<br><br>

> *Note*: network is another parallel performance factor but only affects distributed computing.

## Hardware options for parallel programming
<hr style="border: solid 4px green; ">

### Specialised harware: clusters
* collection of individual hosts that work together in a tightly connected fashion
  * hosts are connected via a fast network, exploitable via appropriate programming
  * the entire cluster (or a portion of it) works as a single systems
* *pros*
  * the key to large simulations
  * parallel solutions can be scaled to large number of CPUs
* *cons*
  * more difficult to program than a single machine (or host)
  * achieving good scaling is not easy
  * Python does not fit well and solutions (*e.g.* `mpi4py`) are acceptable at best

## Hardware options for parallel programming (cont'd)
<hr style="border: solid 4px green; ">

### Commodity hardware: laptops, workstations, servers, ...
* multicore CPU (or CPUs)
* shared common memory
* *pros*
  * fits well with general Python programmability
  * relatively easy programming solutions
* *cons*
  * memory and parallel processing are both limited by what is available on a single system

## Programming models
<hr style="border: solid 4px green">

### In order to run large simulations
* exploit the multiple cores available in a single host and/or
* expand parallel processing beyond a single host and run in a distributed fashion on multiple hosts
<br><br>

### Two memory models for parallel programming
* *shared memory*
* *distributed memory*

## Programming models: distributed memory
<hr style="border: solid 4px green">

### The programming model
* multi-core system, each core has its own private memory
* local core memory is invisible to all other processors
* agent of parallelism: the *process* (the program = a collection of processes)
* exchanging information between processes requires *explicit* message passing
* the dominant programming standard: *MPI*
<br><br>

### Hardware
* clusters
* single host

## Programming models: distributed memory (cont'd)
<hr style="border: solid 4px green">

<img src="./images/prog-model-distrib.png"; style="float: center; width: 80%">

## Programming models: shared memory
<hr style="border: solid 4px green">

### Shared Memory Programming Model
* multi-core system
* each core has access to a shared memory space
* agent of parallelism: the *thread* (program = collection of threads)
* threads exchange information *implicitly* by reading/writing shared variables
* the dominant programming standard: *OpenMP*
<br><br>

### Hardware
* single host

## Programming models: shared memory (cont'd)
<hr style="border: solid 4px green">

<img src="./images/prog-model-shared.png"; style="float: center; width: 80%">

## Python solutions
<hr style="border: solid 4px green">

### Four solutions
* single system processing
  * C/Fortran extensions using OpenMP (data parallelism)
  * `numba` (data parallelism)
  * `cython` (data parallelism)
  * `multiprocessing` (task parallelism)
* distributed processing
  * `mpi4py`
<br><br>

### Alternatives
* shared memory processing
  * PyCuda, PyOpenCL
  * OpenACC
  * OpenMP 4 (capability to offload to GPU)
* distributed memory processing
  * ParallelIPython
  * IPython cluster
  * Apache Spark
  * pypar, pyMPI

## Nomenclature
<hr style="border: solid 4px green; ">

### Distinction between process and processor
* *processor* = a physical piece of hardware
* *process* = an instance of a computer program (software)
  * essentially it has two components: instructions to execute and associated data
  * in parallel programming, we often have multiple instances (processes) of the same program

## Nomenclature (cont'd)
<hr style="border: solid 4px green; ">

### Distinction between process and thread
* *thread* = the smallest unit of processing the opetating system can handle (*i.e.* to which it allocates processor time)
* a process always consists of one or more *threads* of execution

## Nomenclature (cont'd)
<hr style="border: solid 4px green; ">

### Distinction between process and thread
* *implicit sharing*
  * threads share the memory and state of the parent process
  * processes share nothing
* *explicit communication*
  * processes use inter-process communication to share data
  * threads share data implicitly

<img src="../../images/reusematerial.png"; style="float: center; width: 90"; >
<br>
<br>