# Code paralellisation

## How can we make our code run faster?

1. Optimize the code.


2. Move computationally demanding parts of the code from an interpreted language (Python, Ruby, etc.) to a compiled language (C/C++, Julia, Rust, etc.)


3. Use better theoretical methods that require less computation for the same accuracy.


4. A different strategy for speeding up codes is **parallelization**, in which we split the computational work among multiple processing units that labor simultaneously.

## Processing units:

The **processing units** might include central processing units (CPUs), graphics processing units (GPUs), vector processing units (VPUs), or something similar.

- Multiple processing units can complete a calculation faster than a single processing unit.


- For example, if a calculation takes 1 hour to run using one CPU, it might be possible to parallelize the work of the calculation across two CPUs and run it in only 30 minutes.


- Note that parallelization cannot reduce the total amount of computational work required to run a calculation; in fact, it generally introduces additional work associated with communication and coordination between the processing units.


- In general, if a calculation takes **t** hours to run in serial (that is, on a single processing unit), it will take at least **t/n** hours to run on n processing units. tTis principle is more formally expressed through **Amdahl’s law** (https://en.wikipedia.org/wiki/Amdahl%27s_law). A primary goal of parallelization is to ensure that the actual parallelized runtimes are as close to the ideal runtime of **t/n** as possible.

## What is High-Performance Computing?

The field of high performance computing (HPC) takes this concept of parallelization to its logical limit:

- Rather than parallelizing over just a handful of processing units, such as those you might find on a typical desktop or laptop computer, HPC applications involve the use of supercomputers that might consist of many thousands of processing units.


- Parallelising codes to such scales is often very difficult and requires very good management. 

#### Reference:
https://education.molssi.org/parallel-programming/01-introduction/index.html

### Types of Parallelization

#### 1. Distributed-memory parallelization.

In this approach, multiple instances of the same executable, called “processes,” are run simultaneously, with each process being run on a different core. Each process has its own independent copy of any information required to run the simulation; in other words, if you run a calculation with 2 processes, it might require twice as much memory as a calculation run with a single process (although there are some things you can do to mitigate this).


#### 2. Shared-memory parallelization.

In this approach, multiple “threads” are run using a single, shared memory allocation in RAM. This has the advantage of being more memory efficient than distributed-memory parallelism, but because all threads must have access to the same memory, it cannot be used for inter-node parallelization.


#### 3. Vectorization.

Vectorization takes advantage of that modern CPU cores support “single instruction, multiple data” (SIMD), which means that they can perform the same operation on multiple operands simultaneously. For example, if you want to multiply each element of a vector by a scalar, SIMD can enable individual cores to perform this multiplication on multiple elements of the vector simultaneously. Vectorization is largely handled by the compiler (although there are various ways you can influence when and how the compiler vectorizes code) and is not covered in these lessons.

#### 4. Heterogeneous computing (e.g. GPUs, FPGAs, etc.).

One of the big trends in HPC in recent years is the use of GPU acceleration, in which parts of a calculation are run on a CPU while other parts of the calculation are run on a GPU.


The above forms of parallelization are not mutually exclusive. Codes can benefit from vectorization and GPU-acceleration in addition to other parallelization techniques. Notably, one popular approach to parallelization is to use shared-memory parallelization to handle intra-node parallelization, while using a distributed-memory parallelization technique to handle inter-node parallelization.