# Parallel Computing, HPC, and Slurm [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ua-2025q3-astr501-513/ua-2025q3-astr501-513.github.io/blob/main/501/07/lab.ipynb)

Modern scientific research, from simulating black holes to modeling
climate systems, requires computational resources that far exceed what
a single processor can provide.
Problems involving massive datasets or computationally expensive
algorithms (e.g., Monte Carlo simulations, numerical PDE solvers,
machine learning training) demand performance beyond sequential
execution.

Parallel computing addresses this by breaking a problem into smaller
tasks that can be solved simultaneously on multiple processing
elements.
With the rise of multicore CPUs, distributed systems, GPUs, and
specialized accelerators, parallel computing has become central to
high-performance computing (HPC).
This lab will introduce you to the theory, programming models, and
practical execution of parallel codes, with examples in Python, C, and
MPI.
You will also gain experience running jobs on a modern HPC cluster
with a workload manager like Slurm.

## Theoretical Foundations

Before we dive into implementation, we review key concepts that define
the limits and opportunities of parallelism.

### Amdahl's Law (Strong Scaling)

If a fraction $f$ of a program is inherently sequential, the maximum
speedup $P$ with $P$ processors is:
\begin{align}
  S(P) = \frac{1}{f + (1-f)/P}.
\end{align}
Note that, as $P \to \infty$, $S \to 1/f$.

Implication: Even a small sequential portion limits total speedup.
* Example: If 5% of your code is sequential, the maximum speedup is
  20x, no matter how many processors you add.
* This highlights why HPC algorithms for systems like Frontier (the
  DOE exascale machine) must minimize sequential bottlenecks.

This corresponds to strong scaling tests, where the problem size is
fixed and we ask how performance improves as resources increase.

### Gustafson's Law (Weak Scaling)

A more optimistic view: as we increase $P$, we also increase the
problem size to fully utilize resources:
\begin{align}
  S(P) = f + (1-f)P.
\end{align}

Implication: In scientific computing, we often want higher resolution
or larger domains, so performance scales with problem size.

This corresponds to weak scaling tests, which measure how performance
changes when the workload grows proportionally with resources.

### Flynn's Taxonomy

To better understand computing architectures, 
[Flynn (1972)](https://en.wikipedia.org/wiki/Flynn%27s_taxonomy)
classified them into four categories:
* SISD:
  Single Instruction, Single Data (traditional CPU execution)
* SIMD:
  Single Instruction, Multiple Data (vector units, GPUs, NumPy and JAX
  vectorization if hardware supported)
* MISD:
  Multiple Instruction, Single Data (rare, mostly theoretical)
* MIMD:
  Multiple Instruction, Multiple Data (clusters, multicore CPUs,
  distributed MPI systems)

Programming models map naturally onto these:
* [OpenMP](https://www.openmp.org/):
  shared-memory SIMD/MIMD
* [CUDA](https://en.wikipedia.org/wiki/CUDA)/[OpenCL](https://www.khronos.org/opencl/):
  SIMD execution on GPUs
* [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface):
  distributed-memory MIMD

### Additional Resources

* HPC Carpentry lessons:
  https://hpc-carpentry.github.io
* MPI Tutorial:
  https://mpitutorial.com/
* Slurm quick-start guide:
  https://slurm.schedmd.com/quickstart.html

## Monte Carlo Computation of $\pi$

We will parallelize a simple algorithm using different techniques.
The algorithm is monte carlo computation of $\pi$.
This is an embarrassingly parallel problem.  so not much actual
algorithm consideration is needed.
We mainly use it to get ourselve familiar with different tools.

### Python Series Code

Here is the algorithm in native python:

In [None]:
import random

def mcpi_loop(n_total):
    n_inside = 0
    for _ in range(n_total):
        x, y = random.random(), random.random()
        if x*x + y*y < 1.0:
            n_inside += 1
    return n_inside

In [None]:
pi = 4 * mcpi_loop(1000_000) / 1000_000
print(pi)

In [None]:
%timeit mcpi_loop(1000_000)

On my laptop it takes about 80ms to perform 1M samples.
The number of significant digits is $\sim \log_{10}\sqrt{N} = 3$.

### Embarrassingly Parallel Computing

Since this algorithm is embarrassingly parallelizable, we can simply
run it multiple times and compute the mean.
Let's do this as a class exercise using this
[Google Sheet](https://docs.google.com/spreadsheets/d/11h8p5dsJzD8vCcgBBvA4B0RC2oWzuWjT2HLnogf9nlc/edit?gid=245417564#gid=245417564).

Effectively, we just did a weak scaling test!

### Numpy Parallel Code

When compiled with BLAS backend, `Numpy` automatically distribute
compute across multiple cores.

In [None]:
import numpy as np

np.__config__.show()

In [None]:
import os
print(os.environ.get('OPENBLAS_NUM_THREADS', 0))
print(os.environ.get('MKL_NUM_THREADS',      0))

In [None]:
def mcpi_numpy(n_total):
    x = np.random.rand(n_total)
    y = np.random.rand(n_total)
    return np.sum(x*x + y*y < 1.0)

In [None]:
pi = 4 * mcpi_numpy(1000_000) / 1000_000
print(pi)

In [None]:
%timeit mcpi_numpy(1000_000)

In [None]:
os.environ['MKL_NUM_THREADS']      = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1'
%timeit mcpi_numpy(1000_000)

In [None]:
os.environ['MKL_NUM_THREADS']      = '4'
os.environ['OPENBLAS_NUM_THREADS'] = '4'
%timeit mcpi_numpy(1000_000)