

## PARALLEL AND GPU PROGRAMMING IN PYTHON

Mohsen Safari and Ben Czaja HPC Advisors, SURF

November 15&16, 2023

SURF

# Amsterdam Science Park



#### **Outline**

- GPUs as hardware accelerator
- PyCUDA programming
  - CUDA programming and execution model
- Examples:
  - Vector (1D array) addition
  - Matrix (2D array) addition
  - Matrix multiplication
  - Reduction
- Optimization tips
- Two bugs in GPU programming



#### Resources

- The slides and source code of the examples can be found at:
  - https://github.com/sara-nl/Parallel-and-GPU-programming-in-Python



## **Jupyter Notebook**

- The notebook for the GPU part of the course:
  - https://jupyter.snellius.surf.nl/jhssrf006



## Hardware Accelerator (e.g., GPUs)

What is it?

Why do we need it?

How to benefit it?

- ..



### A computer is





### **Peripherals**



#### **Main Gloals**

General-Purpose

Low latency



## **Complicated CPUs**





## **Memory Hierarchy**





#### Moore's Law



Number of transistors on a CPU chip will double every 18 months



## **Dennard Scaling Law**

As transistors become smaller, their power density stays constant

- As a result of Moore's and Dennard's law:
  - CPU manufacturers can raise clock frequency without significantly increasing overall circuit power consumption



## **Clock Frequency**





### **End of Moore's/Dennard's Law**

The prediction had been true for a long time

 We observe that #transistors does not increase in the scale of Moore's law

We reach the end of Dennard scaling law



#### **Multi-core CPUs**



## **Multi-processors**





## **New requirements**

Big data

- New applications:
  - Massively parallel
  - Certain operations

Faster computation



#### **Accelerator**





## **Accelerators/Co-processors**

Graphics Processing Units (GPUs)

Field Programmable Gate Arrays (FPGAs)

Tensor Processing Units (TPUs)

**—** ...



## Simplere many cores

Simpler cores (i.e., simplified ALUs and CUs)

Replicate many of them

As a co-processor



#### **GPUs**

Initially invented for image rendering purposes

Gradually evolved to be used as General Purpose GPU



#### **GPUs vs CPUs**





#### **Two Metrics**

Latency: the time it takes an instruction to be processed

 Throughput: the number of instructions that can be processed in a certain amount of time



#### **Two Metrics**

CPUs are latency-optimized processors

GPUs are throughput-optimized (co-)processors



#### **GPU Manufacturers**









## Supercomputers

| Rank | System                                                                                                                                                                         | Cores     | Rmax<br>(PFlop/s) | Rpeak<br>(PFlop/s) | Power<br>(kW) |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|-------------------|--------------------|---------------|
| 1    | Frontier - HPE Cray EX235a, AMD Optimized 3rd<br>Generation EPYC 64C 2GHz, AMD Instinct MI250X,<br>Slingshot-11, HPE<br>D0E/SC/Oak Ridge National Laboratory<br>United States  | 8,730,112 | 1,102.00          | 1,685.65           | 21,100        |
| 2    | Supercomputer Fugaku - Supercomputer Fugaku,<br>A64FX 48C 2.2GHz, Tofu interconnect D, Fujitsu<br>RIKEN Center for Computational Science<br>Japan                              | 7,630,848 | 442.01            | 537.21             | 29,899        |
| 3    | LUMI - HPE Cray EX235a, AMD Optimized 3rd<br>Generation EPYC 64C 2GHz, AMD Instinct MI250X,<br>Slingshot-11, HPE<br>EuroHPC/CSC<br>Finland                                     | 1,110,144 | 151.90            | 214.35             | 2,942         |
| 4    | Summit - IBM Power System AC922, IBM POWER9 22C 3.07GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR Infiniband, IBM DOE/SC/Oak Ridge National Laboratory United States         | 2,414,592 | 148.60            | 200.79             | 10,096        |
| 5    | Sierra - IBM Power System AC922, IBM POWER9 22C<br>3.1GHz, NVIDIA Volta GV100, Dual-rail Mellanox EDR<br>Infiniband, IBM / NVIDIA / Mellanox<br>DOE/NNSA/LLNL<br>United States | 1,572,480 | 94.64             | 125.71             | 7,438         |



## Supercomputers

| 6  | Sunway TaihuLight - Sunway MPP, Sunway SW26010<br>260C 1.45GHz, Sunway, NRCPC<br>National Supercomputing Center in Wuxi<br>China                                                                                                                            | 10,649,600 | 93.01 | 125.44 | 15,371 |
|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|-------|--------|--------|
| 7  | Perlmutter - HPE Cray EX235n, AMD EPYC 7763 64C<br>2.45GHz, NVIDIA A100 SXM4 40 GB, Slingshot-10, HPE<br>D0E/SC/LBNL/NERSC<br>United States                                                                                                                 | 761,856    | 70.87 | 93.75  | 2,589  |
| 8  | Selene - NVIDIA DGX A100, AMD EPYC 7742 64C<br>2.25GHz, NVIDIA A100, Mellanox HDR Infiniband, Nvidia<br>NVIDIA Corporation<br>United States                                                                                                                 | 555,520    | 63.46 | 79.22  | 2,646  |
| 9  | Tianhe-2A - TH-IVB-FEP Cluster, Intel Xeon E5-2692v2<br>12C 2.2GHz, TH Express-2, Matrix-2000, NUDT<br>National Super Computer Center in Guangzhou<br>China                                                                                                 | 4,981,760  | 61.44 | 100.68 | 18,482 |
| 10 | Adastra - HPE Cray EX235a, AMD Optimized 3rd<br>Generation EPYC 64C 2GHz, AMD Instinct MI250X,<br>Slingshot-11, HPE<br>Grand Equipement National de Calcul Intensif - Centre<br>Informatique National de l'Enseignement Suprieur<br>(GENCI-CINES)<br>France | 319,072    | 46.10 | 61.61  | 921    |



## **Snellius Supercomputer**

72 GPU nodes

Each node has 4 A100 NVIDIA GPU devices

- In total 288 GPUs



## **GPU CPU Connectivity**





## **GPU Usability**

How to Use GPUs?



## **GPU Usability**

3 Ways to Accelerate Applications

# **Applications**

Libraries

OpenACC Directives

Programming Languages

"Drop-in"
Acceleration

Easily Accelerate Applications

Maximum Flexibility



#### **GPU Libraries**



## **GPU Usability**

3 Ways to Accelerate Applications

# **Applications**

Libraries

OpenACC Directives

Programming Languages

"Drop-in"
Acceleration

Easily Accelerate Applications

Maximum Flexibility



## OpenACC/OpenMP

OpenACC stands for Open Accelerators

OpenMP stands for Open Multi-Processing

Directive-based APIs

Simple compiler hints to parallelize the code



## **GPU Usability**

3 Ways to Accelerate Applications

# **Applications**

Libraries

OpenACC Directives

Programming Languages

"Drop-in"
Acceleration

Easily Accelerate Applications

Maximum Flexibility



### **GPU Programming Languages**





## **Core GPU Programming**

- Nvidia GPUs:
  - CUDA, OpenCL, HIP
- AMD GPUs:
  - OpenCL, HIP
- Intel GPUs:
  - OpenCL



### **Accessing to GPUs in Python**





## PyTorch/TensorFlow

They are powerful and mature deep learning libraries

They benefit from GPUs without knowing GPU programming knowledge

They are open sources

They are taught in machine learning courses



### **CuPy vs NumPy**



```
mPy CuPy
```

```
import numpy as np
X_cpu = np.zeros((10,))
W_cpu = np.zeros((10, 5))
y_cpu = np.dot(x_cpu, W_cpu)
```

```
import cupy as cp
x_gpu = cp.zeros((10,))
W_gpu = cp.zeros((10, 5))
y_gpu = cp.dot(x_gpu, W_gpu)
```



#### Numba

 It is an open-source Just-In Time (JIT) compiler that translates a subset of Python and Numpy into GPU machine code.

 It uses a collection of decorators that can be applied to your functions to instruct Numba to compile them.

For more information: https://numba.pydata.org/



### **PyCUDA**

 It gives you easy, Pythonic access to NVIDIA's CUDA parallel computation API.

There is more flexibility to write custom CUDA kernels

For more information: https://documen.tician.de/pycuda/



#### **NVIDIA GPU Hardware**





### **NVIDIA GPU Hardware**

#### GPU device:





# Flynn's classical taxonomy

|             |          | Instruction stream |          |
|-------------|----------|--------------------|----------|
|             |          | Single             | Multiple |
| Data stream | Single   | SISD               | MISD     |
|             | Multiple | SIMD               | MIMD     |



- Introduced by NVIDIA in 2006, Compute Unified Device Architecture
- General purpose programming model that leverages the parallel compute engine in NVIDIA GPUs
- An extension of C language
- CUDA programs are CPU-GPU programs:
  - CPU part is called host
  - GPU part is called kernel



To execute any CUDA program, there are three main steps:

- Copy the input data from host memory to device memory, also known as host-to-device transfer
- Call the kernel from host and execute the GPU program
- Copy the results from device memory to host memory, also called device-to-host transfer







Threads are organized into two

hierarchical levels:

- Threads are grouped into blocks
- Blocks are grouped into grids
- Blocks and grids can be

1D, 2D and 3D





#### Built-in functions:

- Dimension:
  - · gridDim.x, gridDim.y, gridDim.z
  - blockDim.x, blockDim.z
- Index:
  - · blockIdx.x, blockIdx.y, blockIdx.z
  - threadIdx.x, threadIdx.y, threadIdx.z



### **CUDA** Grid

- gridDim.x = 3
- gridDim.y = 3
- blockDim.x = 3
- blockDim.y = 3





















### Synchronization in CUDA

- There is a mechanism to synchronize all threads in a block:
  - Built-in function \_\_syncthreads()
- There is no mechanism to synchronize all threads across all blocks
  - Decouple the kernel into two separate kernels



#### **GPU Node**

- 4 NVIDIA A100 GPUs per node
  - Multiprocessors: 108
  - Streaming cores: 6912
  - Tensor Cores: 432
  - Global memory: 40 GB
- MIG partitions: 1/7th of A100 GPUs
- One GPU is shared among 7 people
- Note that you have around 5 GB memory:
  - Matrix  $(35,000 * 35,000) = 35,000*35,000*4 \approx 5 \text{ GB}$



First Example:

Parallel Vector (1D array) Addition in PyCUDA



### Calculate Global Index (1D grid, 1D block)



- Global Thread ID: blockIdx.x \* blockDim.x + threadIdx.x
- For global thread ID 26:
  - blockldx.x = 3
  - blockDim.x = 8
  - threadIdx.x = 2
  - Global thread ID = 3 \* 8 + 2 = 26



### **PyCUDA Implementation**

- Implement vector addition in PyCUDA
- Compare its execution time to the sequential version



#### **Automatic Data Transfer**

- Automatic data transfer using PyCUDA driver:
  - In()
  - Out()
  - InOut()
- PyCUDA programs become simpler



Second Example:

Parallel Matrix (2D array) Addition in PyCUDA



### Calculate Global Index (2D grid, 2D block)



Matrix 12\*16

- Global Thread ID:
  - row = blockIdx.y \* blockDim.y + threadIdx.y = 2 \* 4 + 1 = 9
  - column = blockldx.x \* blockDim.x + threadIdx.x = 1 \* 4 + 3 = 7



### **Row-Major Flattening of a Matrix**

- Matrix 3\*3
- For each element (row, col):
  - New ID = row \* (No of col) + col
- For instance element "5" in location (1, 2):
  - New ID = 1 \* 3 + 2 = 5

How we see a 2D array





### **Row-Major Flattening of a Matrix**





### **PyCUDA Implementation**

- Implement matrix addition in PyCUDA
- Compare its execution time to the sequential version



### **Exercise 1**

- Try to transpose a matrix in parallel using PyCUDA
- Compare its execution time to the sequential version



Third Example:

Parallel Matrix (2D array) Multiplication in PyCUDA



### Sequential Matrix Multiplication

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
  }
}</pre>
```



$$\begin{bmatrix} a & b \\ c & d \end{bmatrix} \times \begin{bmatrix} e & f \\ g & h \end{bmatrix} = \begin{bmatrix} ae + bg & af + bh \\ ce + dg & cf + dh \end{bmatrix}$$
A
B
C



### **Parallel Matrix Multiplication**

```
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if(col < width && row < width) {
for (k = 0; k < width; k++)
 sum += a[row * width + k] * b[k * width + col];
 c[row * width + col] = sum;
```



### **PyCUDA Implementation**

- Implement matrix multiplication in PyCUDA
- Compare its execution time to
  - Sequential CPU-based
  - Numpy.matmul()
  - @ operator



# Fourth Example: Reduction in PyCUDA



#### Reduction





#### **Reduction (addition)**





#### **Reduction (addition)**





#### **Reduction (addition)**



$$(tid\%(2^{level})==0) ==> a[tid] += a[tid+2^{level}]$$



#### **PyCUDA Implementation**

- Implement reduction in PyCUDA using one thread block
- Compare its execution time to the sequential version and Python reduce operator



#### **PyCUDA Implementation**

- Extend it to use arbitrary size (i.e., multiple thread blocks)
- Compare its execution time to the sequential version and Python reduce operator



#### **PyCUDA Implementation**

- How to use shared memory in reduction?
- Compare its execution time to the sequential version and Python reduce operator



#### **Exercise 2**

 Reduce an array using other operators (subtraction, multiplication, etc.)



#### **Optimization**

There are different ways to optimize CUDA codes:

- Number of threads per block
- Workload per thread
- Total work per thread block
- Correct memory access and data locality
- **-** ...



#### **Tips for Optimization**

Global Memory Access:



Coalesced



Non-coalesced



#### **Tips for Optimization**

Avoid Warp Divergence:

```
if ( threadIdx.x < 16 )
   ... A ...
else
   ... В ...
```





#### **Tips for Optimization**

- Use shared memory in two cases:
  - When threads in a block need to shared data
  - When there are repeated accesses to one location in global memory
    - In this case, it is possible to use registers as local memory to each thread



## **GPU Development Cycle**





#### **Data Races**

 A data race is a situation where two or more threads may access the same memory location simultaneously and at least one of them is a write

It causes undefined behavior of programs



### **Data Race Example**

```
__global__ void kernel(int *arr)
{

arr += 1;
}
```

One solution is to use built-in atomic operations in GPU programming languages



### **Data Race Example**

```
global void kernel(int *arr, int size)
 if (tid < size-1)
    arr[tid] += arr[tid+1];
 }
```

One solution is to use synchronization methods in GPU programming



### **Barrier Divergence**

 A barrier divergence happens when threads from the same thread block diverge and hit different (syntactical) barriers



#### **Barrier Divergence Example**

```
_global__ void kernel(...){
 if (tid \% 2 == 0){
     syncthreads();
     . . . .
 } else{
     syncthreads(); }
```



#### **Questions**

Thank you for participating! Any questions?

