# Session 3

## Implementing a 2D simulation of Active Brownian Particles (ABP) in GPUs

In the second session of this tutorial we port the C++ and the Python code developed in Session 1 and 2 to GPUs. We keep the structure and naming the same as in the c++ version and only make changes where necessary. We are going to focus on how to implement the **force** and **integrator** algorithms in parallel keeping all the rest of the code as before.

<div align="center">
<img src="./layout_interface.png" style="width: 600px;"/>
</div>

## Parallel computing: a short introduction

From a practical point of view parallel computing can be defined as a form of computation in which many calculations are executed simultaneously. Large problems can often be divided into smaller ones, which are then solved concurrently on different computing resources. 

# Sequential vs Parallel

As we've seen in the previous tutorials a computer program consist in a series of calculations that performs a specified task. For example, our ``evolve`` step in the ``evolver`` class needs to perform **sequentially** the following calculations:
1. *Check is neighbour list needs rebuilding*
2. *Perform the preintegration step, i.e., step before forces and torques are computed*
3. *Apply period boundary conditions*
4. *Reset all forces and toques*
5. *Compute all forces and torques*
6. *Perform the second step of integration*
7. *Apply period boundary conditions*

In general, one can classify the relationship between two pieces of computation as the ones that are: **(a)**  related by a precedence restraint and therefore must be calculated sequentially, like the ``evolve`` step, and **(b)** are not  related by a precedence restraints and therefore can be calculated concurrently. Any program containing tasks that are performed concurrently is a **parallel program**. 

## Parallelism

There are two fundamental types of parallelism:

* Data parallelism: when there are many data items that can be operated *on* at the same time.
* Task parallelism: when there are many tasks that can be operated independently and largely in parallel.

<div align="center">
<img src="./parellel_comp.png" style="width: 500px;"/>

<sub><sup>(Left) Data parallelism focuses on distributing the data across multiple cores. (Right) Task parallelism focuses on distributing functions across multiple cores.</sup></sub>
</div>


In the session we are going to focus on graphics processing unit or GPUs which are especially well-suited to address problems that can be expressed as data-parallel computations.

## GPUs computing: Heterogeneous Architecture

Nowadays a typical compute node consists of a multicore central processing unit (**CPU**) sockets and one or more many-core GPUs. The GPU acts a co-processor to a CPU operating through a **PCI-Express bus**. GPUs were designed to fulfill specific needs, especially for workloads in the graphics industry like video games and rendering.

<div align="center">
<img src="./GPU-transistor2.png" style="width: 500px;"/>
</div>


In terms of specificity a CPU core is designed to optimize the execution of sequential programs. In contrast, a GPU core is optimized for data-parallel tasks focusing on the throughput of parallel programs. Since the GPUs function as a "co-processor", the CPU is called the ``host`` and the GPU is called the ``device``.

In any GPU based application consist of two types of parts:

* Host code
* Device code

As the name suggest the *host code* run on CPUs and *device code* runs on GPUs. Typically the CPU is in charge of initialize the environment for the GPU, i.e, allocate data, copy the data, etc. 

## The CUDA plattaform 

**CUDA** is a parallel computing platform and programming model that makes use of the parallel compute engine in NVIDIA GPUs. From the practical point if view, CUDA language interface are the familiar C, C++, Fortran, openCL programming languages. 

<div align="center">
<img src="./cuda-plataform.png" style="width: 500px;"/>
</div>

### The CUDA Programming Model

Here we are going to learn how to write CUDA-C/C++ code to make use of the GPU. As before our application consist in two "types" of code. The first one is the ``host`` code and the second one is the ``device`` code. The device code (which is exclusively executed in the GPU) is denominated **``kernel``**. The kernels function are identified by the keyword ``__global__``, and by the fact that **never** return any value (``void`` void return type). The following example show the first ``hello_world`` program:

```c
//@file: hello_world.cu
#include "hello_world.hpp"

// device code start
__global__ void hello_world_kernel()
{
    printf("hello");
}
// device code end

// host code start
void call_hello_world_kernel(void)
{
    hello_kernel<<<1,1>>>();
}
// host code end

//@file: hello_world.hpp
void call_hello_world_kernel(void); //forward declaration

//@file: main.cpp
#include "hello_world.hpp"

// host code start
int main()
{
    ///...

    call_hello_world_kernel();
    
    ///...
}
// host code end

```
Kernels codes have several properties: 
1. are identified by the ``__global__`` qualifier with ``void`` return type
2. are only executable on the device
3. are only callable by the host

#### Where is the parallelism here?

So far we haven't seen any parallelism. To understand 


## References
* [https://devblogs.nvidia.com/](https://devblogs.nvidia.com/)
* Cheng, John, Max Grossman, and Ty McKercher. Professional CUDA c programming. John Wiley & Sons, 2014.
