<center><img src=img/MScAI_brand.png width=70%></center>

# High-Performance Computing

The purpose of this notebook is to provide the basic terminology and concepts of high-end/high-performance computing. If you already know what these are, you can skip it!
* Threads versus processes
* Distributed versus parallel computing
* GPUs
* Taskfarming
* Profilers

### Python and HPC
Python is slow, but it's still used for (some) HPC! Why?

* Numpy can move many inner loops to C or Fortran
* "Just-in-time" compilation with Numba, Cython, PyPy, etc. 
* Deep learning uses GPUs, Python uses GPU libraries
* Python is slower than C only by a (multiplicative) constant. Getting the right algorithm remains more important.

### GPUs

<center><img src=img/GPU-images.png width=60%></center>


Aa **graphics processing unit** is a specialised processor for graphics, especially video games. A GPU is specialised for processing large numerical matrices. Many AI and numerical computing problems run very effectively on GPUs!



<center><img src=img/nvidia-fission-GPU.png width=60%></center>
    
* Deep learning is always on GPUs, except for toy problems
* Cryptocurrency <a href=https://www.sciencedaily.com/releases/2019/06/190613104533.htm>:-(</a>
* Weather simulations, nuclear simulations, and more.


### Distributed versus parallel computing


Parallel computing: running multiple things at once on a single machine (especially with multiple CPUs on a single machine)

Distributed computing: running one thing across multiple machines, usually with message-passing between them.


### Parallel computing 

<center><img src=img/serial-parallel-processing.png width=60%></center>
             
<font size=1>https://www.explainthatstuff.com/how-supercomputers-work.html</font>

### Parallel computing with threads and processes

Threads and processes are two similar concepts. They both involve multiple parts of a program running in parallel (on a single machine).

* Every program, when it runs, becomes a **process**
* A program may **fork** to become multiple processes
* A program may also create multiple **threads**

Threads are more "lightweight" (easier/quicker to start and stop threads than processes).

If we have multiple CPUs, then multiple processes or multiple threads can take advantage of that.

Even if we have a single CPU, multiple processes or threads can be useful. (They don't really run in **parallel**, but the OS gives that illusion by interleaving their work, and it can still help a lot!)

### Examples
* Any program with a GUI will have at least a back-end process and a GUI process or thread. Even when the back-end process is busy processing, the GUI remains responsive.

* A web server typically launches a new thread for every incoming request.

* Some optimisation algorithms can run multiple processes or threads, e.g. all running in parallel, to take advantage of multiple CPUs, e.g.~`GridSearchCV` has a parameter `n_jobs`.

Communication between threads and/or processes is a complex topic and causes many bugs :-/

### Using multiple cores -- the hard way

Take our program, and rewrite it to use multiple processes and/or threads with communication between them. When it runs, the OS will put each process/thread on separate cores in a fairly smart/adaptive way. We'll get a speedup, but probably sublinear :-(

### Using multiple cores -- the easy way

If we need to run our program **multiple times** and we have **multiple cores** (and if each run won't use up too much RAM), then don't rewrite the program. Just run it multiple times at once. E.g. if we have 4 cores, open up 4 terminals and type `python myprog.py` in each one. We'll probably get a 4x speedup :-)

This is a bit too **manual** for large-scale tasks (e.g. many runs on multiple machines). **Taskfarming** takes this to a logical conclusion: we put our commands into a file and hand it to a specialised taskfarming program. It decides how many to run in parallel, and takes care of starting the next one whenever a task finishes (to ensure CPU utilisation) etc. 

It can even use multiple machines if available. 

In the next notebook/video we'll see how this is used on ICHEC.

### Distributed computing

<center><img src=img/HPC-images.png width=80%></center>

In distributed computing, we have multiple machines running a single job, so usually they need to communicate during the job. This is outside our scope.

### Understanding performance in terms of bottlenecks

When a program is too slow, there are many possible reasons:
* Slow CPU
* Not enough RAM causes the machine to move data from RAM to disk and back often ("thrashing")
* Wrong algorithm (e.g. "accidentally quadratic")

And many possible solutions:

* Buy a bigger computer or a GPU
* Use parallel and/or distributed computing
* Find a better algorithm.


<center><img src=img/bottleneck.jpg width=50%></center>
<font size=1>https://www.kissclipart.com/manufacturing-bottleneck-clipart-bottleneck-busine-0xtdag/</font>

A **bottleneck** model is useful for thinking about performance. Usually when a program is slow, there's one main reason why it's slow, e.g. just one function or one inner loop. That's the only place where an optimisation effort can yield any improvements.

Before trying to optimise a function, we should find out **which functions** are using most of the time in our program! 

That's the job of the **profiler**. A profiler is a tool that runs your program and tells you where it spends most of its time. It gives a report like this, telling us how often each function is called and the total time spent in each function.

<font size=6>
    
```python
      
197 function calls (192 primitive calls) in 0.002 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000    0.001    0.001 <string>:1(<module>)
     1    0.000    0.000    0.001    0.001 re.py:212(compile)
     1    0.000    0.000    0.001    0.001 re.py:268(_compile)
     1    0.000    0.000    0.000    0.000 sre_compile.py:172(_compile_charset)
     1    0.000    0.000    0.000    0.000 sre_compile.py:201(_optimize_charset)
     4    0.000    0.000    0.000    0.000 sre_compile.py:25(_identityfunction)
   3/1    0.000    0.000    0.000    0.000 sre_compile.py:33(_compile)
```

</font>

Python has a built-in profiler: https://docs.python.org/3/library/profile.html

Sophisticated understanding of algorithms and engineering and budget trade-offs might be needed! You could pay a developer to make your algorithm use less RAM, but it might be cheaper to just buy a load of extra RAM. You could implement a Hadoop solution, but it might be better to just <a href=https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html>learn to use basic tools properly</a>. 

We have barely scratched the surface here, but you can see more in **Tools and Techniques for Large-Scale 
Data Analytics** in Semester 2. 

For now, we'll assume that our program is reasonably well-optimised, and we want to run it multiple times with different hyperparameters, so a possible solution is to use **taskfarming** on a large compute server with multiple CPUs (maybe even multiple machines) and/or machines with GPUs. That's what we'll study next.