#  OpenMP for GPU


Enter your name and student ID.

 * Name:
 * Student ID:



# 1. OpenMP for GPU
* <a href="http://openmp.org/" target="_blank" rel="noopener">OpenMP</a> is the de fact programming model for multicore environment
* More recently, it supports GPU offloading
* In this notebook you are going to learn OpenMP for GPU
* Consult [the spec](https://www.openmp.org/spec-html/5.0/openmp.html) when necessary
* Take a look at [a talk slide OPENMP IN NVIDIA'S HPC by Jeff Larkin](https://openmpcon.org/wp-content/uploads/openmpcon2021-nvidia.pdf)


# 2. Compilers
* [NVIDIA HPC SDK](https://docs.nvidia.com/hpc-sdk/index.html) (`nvc` and `nvc++`) and recent [LLVM](https://llvm.org/) (`clang` and `clang++`) have a decent support of OpenMP for GPU

## 2-1. Set up NVIDIA HPC SDK
Execute this before you use NVIDIA HPC SDK

In [None]:
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin:$PATH

Check if it works (check if full paths of nvc/nvc++ are shown)

In [None]:
which nvc
which nvc++

## 2-2. Set up LLVM
Execute this before you use LLVM

In [None]:
export PATH=/home/share/llvm/bin:$PATH
export LD_LIBRARY_PATH=/home/share/llvm/lib:/home/share/llvm/lib/x86_64-unknown-linux-gnu:$LD_LIBRARY_PATH

Check if it works (check if full paths of nvc/nvc++ are shown)

In [None]:
which clang
which clang++

* Compilers can work at any host, but make sure you are on the GPU host before running GPU programs

In [None]:
hostname
hostname | grep tauleg || echo "Oh, you are not on the right host, access https://tauleg.zapto.org/ instead"

## 2-3. Summary of compiler options to compile OpenMP programs for GPU
* `nvc`/`nvc++` : `-mp=gpu` option
* `clang`/`clang++` : `-fopenmp -fopenmp-targets=nvptx64` options

# 3. Summary of directives you are going to learn
* [`#pragma omp target`](https://www.openmp.org/spec-html/5.0/openmpsu60.html#x86-2820002.12.5) : offloads the immediately following statement to the device
* [`#pragma omp teams`](https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7) : creates a number of teams (similar to `#pragma omp parallel`)
* [`#pragma omp distribute`](https://www.openmp.org/spec-html/5.0/openmpsu43.html#x66-1580002.9.4) : distributes iterations of the immediately following for loop to teams
* [`#pragma omp parallel`](https://www.openmp.org/spec-html/5.0/openmpse14.html#x54-800002.6) : creates a number of threads within a team
* [`#pragma omp for`](https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2) : distributes iterations of the immediately following for loop to threads of a team
* [`#pragma omp target data`](https://www.openmp.org/spec-html/5.0/openmpsu57.html#x83-2580002.12.2)


# 4. [`#pragma omp target`](https://www.openmp.org/spec-html/5.0/openmpsu60.html) $\sim$ moving control to a GPU
* <font color="blue">syntax</font>
```
#pragma omp target
    S
```
executes $S$ on (_offloads_ $S$ to) a device (hopefully a GPU)


In [None]:
%%writefile omp_target.cc
#include <stdio.h>
int main() {
  printf("hello on host\n");
#pragma omp target
  printf("hello from target (hopefully GPU)\n");
  printf("back on host\n");
  return 0;
}

* Compiling

In [None]:
nvc++ -mp -target=gpu omp_target.cc -o omp_target
# clang++ -fopenmp -fopenmp-targets=nvptx64 omp_target.cc -o omp_target

* Running

In [None]:
./omp_target

* note:
  * while using `target` almost always intends to use a GPU, it can actually run without a GPU (fallback)
  * executing the above program results in an identical result whether your machine has a GPU or not
  * while good for portability, it may be confusing, so you can force it to run on GPU or signal an error when GPU is not available, by setting environment variable `OMP_TARGET_OFFLOAD=MANDATORY`.  `OMP_TARGET_OFFLOAD=DISABLED` has the opposite effect

In [None]:
# force it to run on GPU or signal an error
OMP_TARGET_OFFLOAD=MANDATORY ./omp_target
# force it to run on the host even if GPU is available
OMP_TARGET_OFFLOAD=DISABLED ./omp_target

# 5. [`#pragma omp teams`](https://www.openmp.org/spec-html/5.0/openmpse15.html#x57-910002.7) $\sim$ creating thread blocks
## 5-1. basics
* <font color="blue">syntax</font>
```
#pragma omp target
#pragma omp teams
    S
```
creates a number of _teams_ and the master of each team will execute $S$

* it is similar to `#pragma omp parallel` in the sense that the effect is to have many threads execute the same statement
* you can think of `teams` an extra layer of parallelism outside `parallel` (`parallel` is a construct that creates threads _within_ a team)

In [None]:
%%writefile omp_teams.cc
#include <stdio.h>
int main() {
  printf("hello on host\n");
#pragma omp target
#pragma omp teams
  printf("hello, I am the master of a team\n");
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_teams.cc -o omp_teams
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_teams.cc -o omp_teams

* Running

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_teams

* note:
  * `teams` should appear right inside `target`
  * as such, `target` and `teams` are often used in the combined form (`#pragma omp target teams`)

## 5-2. specifying the number of teams
* you can set the number of teams created by `teams` construct to $x$ either by
  * having `num_teams(x)` clause in the `teams` construct
  * setting `OMP_NUM_TEAMS=x` environment variable when running the command

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 ./omp_teams
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=5 ./omp_teams

## 5-3. getting team ID and the number of teams
* just as `omp_get_thread_num()` and `omp_get_num_threads()` tell you the thread ID and the number of threads of your team, you can get the team ID and the number of teams by
* `omp_get_num_teams()` 
  * `omp_get_team_num()`

In [None]:
%%writefile omp_team_num.cc
#include <stdio.h>
#include <omp.h>
int main() {
  printf("hello on host\n");
#pragma omp target
#pragma omp teams
  printf("in teams: %03d/%03d\n", omp_get_team_num(), omp_get_num_teams());
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_team_num.cc -o omp_team_num
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_team_num.cc -o omp_team_num

* Running

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=5 ./omp_team_num

# 6. [`#pragma omp distribute`](https://www.openmp.org/spec-html/5.0/openmpsu43.html#x66-1580002.9.4) $\sim$ distributing iterations to thread blocks
* <font color="blue">syntax</font>
```
#pragma omp target
#pragma omp teams
    {
      ...
#pragma omp distribute
      for (...) {
        ...
      }
    }
```
distributes iterations of the for-loop across teams


In [None]:
%%writefile omp_distribute.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  int m = (argc > i ? atoi(argv[i]) : 5); i++;
  printf("hello on host\n");
#pragma omp target
#pragma omp teams
  {
    printf("in teams: %03d/%03d\n", omp_get_team_num(), omp_get_num_teams());
#pragma omp distribute
    for (int i = 0; i < m; i++) {
      printf("in distribute: i=%03d executed by %03d/%03d\n",
             i, omp_get_team_num(), omp_get_num_teams());
    }
  }
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_distribute.cc -o omp_distribute
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_distribute.cc -o omp_distribute

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 ./omp_distribute 5

* execute the following command with different number of teams and the command line (the number of iterations) and make sense of the result

# <font color="green"> Problem 1 :  Understand teams and distribute</font>
* a small quiz before things get more confusing
* reason about which lines are executed by how many threads, and as a result, how many lines are printed when you run the above program with <font color="blue"><tt>OMP_NUM_TEAMS=$T$ ./omp_distribute $m$</tt></font>
* answer with an expression of $T$ and $m$
* you can easily check your answer by counting the number of lines using `wc` command

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 ./omp_distribute 5 | wc

BEGIN SOLUTION
END SOLUTION


* execute the following command with different number of teams and the command line (the number of iterations)

* note:
  * if there is no statements between `teams` and `distribute` they can be combined into one directive, just as * recall that `target` can be combined with `teams`, so you can combine all the three 
`parallel` and `for` can be combined

In [None]:
%%writefile omp_target_teams_distribute.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  int m = (argc > i ? atoi(argv[i]) : 5);
  printf("hello on host\n");
#pragma omp target teams distribute
  for (int i = 0; i < m; i++) {
    printf("in distribute: i=%03d executed by %03d/%03d\n",
           i, omp_get_team_num(), omp_get_num_teams());
  }
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_target_teams_distribute.cc -o omp_target_teams_distribute
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_target_teams_distribute.cc -o omp_target_teams_distribute

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 ./omp_target_teams_distribute 7

* note:
  * you can parallelize a loop with just `teams` and `distribute` without `parallel` and `for` described below
  * however, to effectively use GPUs, you need to use `parallel` within each team
  * while implementation dependent, you can think of a team as a single thread block, so only using teams, you end up creating many thread blocks each having only a single thread, resulting in very inefficient use of GPUs 

# 7. [`#pragma omp parallel`](https://www.openmp.org/spec-html/5.0/openmpse14.html#x54-800002.6) $\sim$ having threads in a thread block
## 7-1. `parallel` inside `teams`
* syntax:
```
#pragma omp target
#pragma omp teams
    {
      ...
#pragma omp parallel
      S
    }
```
creates a number of thread within each team

* recall that you used `parallel` to create threads when executing on CPUs
* used inside `teams`, it will create threads within the team, each executing $S$

* here is an example that illustrates it
<font color="blue"><tt>OMP_NUM_TEAMS=$T$ OMP_NUM_THREADS=$H$ ./omp_team_parallel</tt></font>
creates $T$ teams each of which create $H$ threads

In [None]:
%%writefile omp_parallel.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int getenv_int(const char * v) {
  char * s = getenv(v);
  if (!s) {
    fprintf(stderr, "specify environment variable %s\n", v);
    exit(1);
  }
  return atoi(s);
}

int main(int argc, char ** argv) {
  int n_threads= getenv_int("OMP_NUM_THREADS");
  if (n_threads != 1 && n_threads % 32) {
    fprintf(stderr, "OMP_NUM_THREADS (%d) must be 1 or a multiple of 32\n", n_threads);
    exit(1);
  }
  printf("hello on host\n");
#pragma omp target teams
  {
    printf("in teams: %03d/%03d\n", omp_get_team_num(), omp_get_num_teams());
#pragma omp parallel num_threads(n_threads)
    printf("in parallel: %03d/%03d %03d/%03d\n",
           omp_get_team_num(), omp_get_num_teams(),
           omp_get_thread_num(), omp_get_num_threads());
  }
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_parallel.cc -o omp_parallel
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_parallel.cc -o omp_parallel

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_parallel


* <font color="red">important remarks on the number of threads you specify in `parallel` directive</font>
  * on CPU, the number of threads created by `parallel` could be specified either with `OMP_NUM_THREADS=x` environment variable or `num_threads(x)` in `parallel` directive
  * but this seems not possible when executing on GPUs (I don't know whether it is an implementation issue or specification)
  * you have to use `num_threads(x)` if you need to set it, just as done above
  * or you can just omit it to leave it to the system
* also, it seems that with both clang and nvc, <font color=red>_the number of threads must be 1 or a multiple of 32_</font>
  * it does not even signal an error, so you must be careful not to unintentionally specify a wrong number
  * this is another reason to leave it to the system unless necessary


# <font color="green"> Problem 2 :  Understand teams and parallel</font>
* a similar quiz about the combination of teams and parallel
* reason about which lines are executed by how many threads, and as a result, how many lines are printed when you run the above program with <font color="blue"><tt>OMP_NUM_TEAMS=$T$ OMP_NUM_THREADS=$H$ ./omp_parallel</tt></font>
* answer with an expression of $T$ and $H$
* you can easily check your answer by counting the number of lines using `wc` command

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_parallel | wc

BEGIN SOLUTION
END SOLUTION


## 7-2. `parallel` inside `distribute` inside `teams`
* more typically you call `parallel` inside `distribute` (which is necessarily inside `teams`), as you will be parallelizing loops
* there is nothing new syntactically

In [None]:
%%writefile omp_distribute_parallel.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int getenv_int(const char * v) {
  char * s = getenv(v);
  if (!s) {
    fprintf(stderr, "specify environment variable %s\n", v);
    exit(1);
  }
  return atoi(s);
}

int main(int argc, char ** argv) {
  int n_threads= getenv_int("OMP_NUM_THREADS");
  int i = 1;
  int m = (argc > i ? atoi(argv[i]) : 5); i++;
  if (n_threads != 1 && n_threads % 32) {
    fprintf(stderr, "OMP_NUM_THREADS (%d) must be 1 or a multiple of 32\n", n_threads);
    exit(1);
  }
  printf("hello on host\n");
#pragma omp target teams
  {
    printf("in teams: %03d/%03d\n", omp_get_team_num(), omp_get_num_teams());
#pragma omp distribute
    for (int i = 0; i < m; i++) {
      printf("in distribute: i=%03d executed by %03d/%03d\n",
             i, omp_get_team_num(), omp_get_num_teams());
#pragma omp parallel num_threads(n_threads)
      printf("in parallel: i=%03d %03d/%03d %03d/%03d\n",
             i, omp_get_team_num(), omp_get_num_teams(),
             omp_get_thread_num(), omp_get_num_threads());
    }
  }
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_distribute_parallel.cc -o omp_distribute_parallel
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_distribute_parallel.cc -o omp_distribute_parallel

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_distribute_parallel 5

# <font color="green"> Problem 3 :  Understand teams, distribute, and parallel</font>
* a similar quiz about the combination of teams, distribute, and parallel
* reason about which lines are executed by how many threads, and as a result, how many lines are printed when you run the above program with <font color="blue"><tt>OMP_NUM_TEAMS=$T$ OMP_NUM_THREADS=$H$ ./omp_distribute_parallel $m$</tt></font>
* answer with an expression of $T$, $H$, and $m$
* you can easily check your answer by counting the number of lines using `wc` command

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_distribute_parallel 5 | wc

BEGIN SOLUTION
END SOLUTION


# 8. [`#pragma omp for`](https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2) $\sim$ distributing iterations to threads within a thread block
* syntax:
```
#pragma omp target
#pragma omp teams
    ...
#pragma omp distribute
#pragma omp parallel
    ...
#pragma omp for
for (...) {
    ...
}  
```

* used inside `parallel`, it will distribute iterations of the loop to threads


In [None]:
%%writefile omp_for.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int getenv_int(const char * v) {
  char * s = getenv(v);
  if (!s) {
    fprintf(stderr, "specify environment variable %s\n", v);
    exit(1);
  }
  return atoi(s);
}

int main(int argc, char ** argv) {
  int n_threads = getenv_int("OMP_NUM_THREADS");
  int i = 1;
  int m = (argc > i ? atoi(argv[i]) : 5); i++;
  int n = (argc > i ? atoi(argv[i]) : 6); i++;
  if (n_threads != 1 && n_threads % 32) {
    fprintf(stderr, "OMP_NUM_THREADS (%d) must be 1 or a multiple of 32\n", n_threads);
    exit(1);
  }
  printf("hello on host\n");
#pragma omp target teams
  {
    printf("in teams: %03d/%03d\n", omp_get_team_num(), omp_get_num_teams());
#pragma omp distribute
    for (int i = 0; i < m; i++) {
      printf("in distribute: i=%03d executed by %03d/%03d\n",
             i, omp_get_team_num(), omp_get_num_teams());
#pragma omp parallel num_threads(n_threads)
      printf("in parallel: i=%03d %03d/%03d %03d/%03d\n",
             i, omp_get_team_num(), omp_get_num_teams(),
             omp_get_thread_num(), omp_get_num_threads());
#pragma omp for
      for (int j = 0; j < n; j++) {
        printf("in for: i=%03d j=%03d executed by %03d/%03d %03d/%03d\n",
               i, j, omp_get_team_num(), omp_get_num_teams(),
               omp_get_thread_num(), omp_get_num_threads());
      }
    }
  }
  printf("back on host\n");
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_for.cc -o omp_for
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_for.cc -o omp_for

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_for 5 6

# <font color="green"> Problem 4 :  Understand teams, distribute, parallel, and for</font>
* a similar quiz about the combination of teams, distribute, parallel, and for
* reason about which lines are executed by how many threads, and as a result, how many lines are printed when you run the above program with <font color="blue"><tt>OMP_NUM_TEAMS=$T$ OMP_NUM_THREADS=$H$ ./omp_for $m$ $n$</tt></font>
* answer with an expression of $T$, $H$, $m$, and $n$
* you can easily check your answer by counting the number of lines using `wc` command

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_for 5 6 | wc

BEGIN SOLUTION
END SOLUTION


# 9. Common combined directives
* anybody who has a right mind will feel sick with the whole series of different directive names that have little or no consistency
* each of them is nominally an independent, standalone directive, but many of them are almost always used together in practice
* since the purpose is often to execute a loop nest in parallel, most typically they are used in one of the following forms

* combine everything 
```
#pragma omp target teams distribute parallel for
    for (...) {
      ...
    }
```

* parallelize an outer loop with `teams` $+$ `distribute` and an inner loop with `parallel` $+$ `for`

```
#pragma omp target teams distribute
    for (...) {
#pragma omp parallel for
      for (...) {
        ...
      }
    }  
```

# 10. [`#pragma omp target data`](https://www.openmp.org/spec-html/5.0/openmpsu57.html#x83-2580002.12.2) $\sim$ mapping data between the host CPU and GPU
* in the CUDA programming, the only data transfer that more or less automatically occurs is passing call-by-value arguments (scalars and structures)
* arrays and data pointed to by pointers must all be explicitly (1) allocated on GPU memory by `cudaMalloc` and (2) moved between CPU and GPU by `cudaMemcpy`, which quickly becomes tedious and error-prone
* what we are conceptually doing when programming in CUDA is to maintain the mapping between data address on CPU and corresponding data address on GPU and synchronize their contents (data in that address) when necessary
```
a = malloc(...); // data on CPU @ a
cudaMalloc(&a_dev, ...); // data on GPU @ a_dev
cudaMemcpy(a_dev, a, ...); // move contents a[..] -> a_dev[..]
  ...
cudaMemcpy(a, a_dev, ...); // move contents a[..] <- a_dev[..]
```

* `target data` and its `map` clauses make it possible to do this task more easily and declaratively

* <font color="red">Warning:</font> I could not (and do not want to) decipher this [super lawyerish spec document about it](https://www.openmp.org/spec-html/5.0/openmpsu109.html#x142-6180002.19.7) to fully understand the behavior of `map` clauses
* I am trying to explain it hopefully in a more non-lawyer-friendly and straight-to-the-point way, but part of it is not backed up by the spec document but rather based on actual experiments and my imagination and common sense about what the implementation is doing
* when you are not sure, play safe or conduct a similar experiment yourself

* <font color="blue">syntax:</font>
```
#pragma omp target data map(to: ...) map(from: ...) map(tofrom: ...) ...
    S
```
where ... is a variable, array name, or base address + range (e.g., a[0:n])

* basically, these clauses say that specified variables, arrays, or address ranges are valid expressions you can get "expected" values in the during or after $S$
* more specifically, 
  * those specified in `map(to: ...)` become valid on GPU during $S$
  * those specified in `map(from: ...)` become valid on CPU after $S$
* to accomplish that, the <font color="blue">_mapping_</font> between CPU address and GPU address are maintained by the runtime system and contents may be moved to or from GPU as necessary
  * data specified in `map(to: ...)` may be copied to GPU (CPU -&gt; GPU) before $S$
  * data specified in `map(from: ...)` may be copied from GPU (GPU -&gt; CPU) after $S$
* `map(tofrom: ...)` has the effect of both; it makes data available to GPU during $S$ and to CPU after $S$

* it helps you understand if you think it has two effects
  * one is "transfer data" that may be accessed from GPU
  * the other is "redirecting pointers" so that the same expression (e.g., a, a[i], p->x) accesses different locations depending on whether you are on GPU or CPU

* you typically use this directive together with `#pragma omp target` and you can in fact specify these clauses in `#pragma omp target`

## 10-1. local variables and arrays
* local variables and arrays that do not appear in any `map` clause are sent to GPU automatically
* so, normally, you don't have to write anything to use (i.e., read) local variables/arrays visible in the scope of `#pragma target` directive
note that a local arrays (`a`) and a structure (`p`) seems available without any declaration
* the following program demonstrates that

In [None]:
%%writefile omp_map_local.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
struct point { float x; float y; };
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  point p = { t + 3, t + 4 };
  // you do not have to explicitly say anything about t, a, or p.
  // they are automatically available on GPU
#pragma omp target
  {
    printf("t = %f\n", t);
    printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
    printf("p = { %f, %f }\n", p.x, p.y);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_local.cc -o omp_map_local
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_local.cc -o omp_map_local

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_local

## 10-2. need map(from: $x$) or map(tofrom: $x$) to get the result back
* the following code fails to obtain the result written to variable `t`
  * to my surprise, values written to `a` and `p` are available back on CPU
  * I didn't try to decipher [the lawyerish spec document](https://www.openmp.org/spec-html/5.0/openmpsu109.html#x142-6180002.19.7) to understand this behavior
  * for now, I think it's a safe bet to always specify variables through which you want to obtain results from GPU when you are not sure

* you need to specify `map` clause for `t`, either with `map(from: t)` when you don't have to send the value set by CPU to GPU, or with `map(tofrom: t)` when you have to


In [None]:
%%writefile omp_map_from.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
struct point {
  float x;
  float y;
};

int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  point p = { t + 3, t + 4 };
#pragma omp target 
  {
    t *= 2.0;
    for (int i = 0; i < 3; i++) a[i] *= 2.0;
    p.x *= 2.0; p.y *= 2.0;
  }
  printf("t = %f\n", t);
  printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
  printf("p = { %f, %f }\n", p.x, p.y);
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_from.cc -o omp_map_from
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_from.cc -o omp_map_from

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_from

# <font color="green"> Problem 5 :  Use map(from: ..) or map(tofrom: ..) to get the result back</font>
* add an appropriate `map` clause above so the CPU can get all the results back

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_from

## 10-3. global variables and arrays
* global variables and arrays are similar to local variables and arrays in that they are sent to GPU automatically when they do not appear in any `map` clause 
* again, the opposite is not true

In [None]:
%%writefile omp_map_global.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
struct point {
  float x;
  float y;
};

float t;
float a[3];
point p;

int main(int argc, char ** argv) {
  int i = 1;
  t = (argc > i ? atof(argv[i]) : 10.0); i++;
  for (int i = 0; i < 3; i++) { a[i] = t + i; }
  p.x = t + 3; p.y = t + 4;
#pragma omp target
  {
    printf("t = %f\n", t);
    printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
    printf("p = { %f, %f }\n", p.x, p.y);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_global.cc -o omp_map_global
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_global.cc -o omp_map_global

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_global

## 10-4. what happens on pointers?
* interestingly, a local pointer pointing to another local variable or an array mapped by a map clause (or a lack thereof) gets automatically "redirected" so that it points to the GPU version
ints to (`a`) are automatically mapped on GPU
* in the following program, data access through a pointer `pa` are valid without any map clause, as the data it

In [None]:
%%writefile omp_map_ptr.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  float * pa = a;
#pragma omp target
  {
    printf(" a = { %f, %f, %f }\n", a[0], a[1], a[2]);
    printf("pa = { %f, %f, %f }\n", pa[0], pa[1], pa[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_ptr.cc -o omp_map_ptr
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_ptr.cc -o omp_map_ptr

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_ptr

* it's interesting to see the addresses of these data
* the addresses of array `a` are naturally different between CPU and GPU
* remarkably, the addresses held in a pointer variable `pa` are _adjusted_ so it now points to the GPU version of `a`

In [None]:
%%writefile omp_map_ptr_with_addr.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  float * pa = a;
  printf("[host]  a @ %p = { %f, %f, %f }\n", a, a[0], a[1], a[2]);
  printf("[host] pa @ %p = { %f, %f, %f }\n", pa, pa[0], pa[1], pa[2]);
#pragma omp target
  {
    printf("[dev ]  a @ %p = { %f, %f, %f }\n", a, a[0], a[1], a[2]);
    printf("[dev ] pa @ %p = { %f, %f, %f }\n", pa, pa[0], pa[1], pa[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_ptr_with_addr.cc -o omp_map_ptr_with_addr
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_ptr_with_addr.cc -o omp_map_ptr_with_addr

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_ptr_with_addr

* this adjustment happens because `a` is mapped on the GPU as well, due to expressions involving `a`, such as `a[0]`, `a[1]`, etc. appear in the target statement
d you get an error
* if you remove the first statement to leave only the expressions involving `pa`, the adjustment does not occur

In [None]:
%%writefile omp_map_ptr_err.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  float * pa = a;
#pragma omp target
  {
    printf("pa = { %f, %f, %f }\n", pa[0], pa[1], pa[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_ptr_err.cc -o omp_map_ptr_err
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_ptr_err.cc -o omp_map_ptr_err

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_ptr_err

## 10-5. a pointer buried in another data
* another situation you need to explicitly handle data mapping is when a pointer is buried in another data structure (e.g., a struct containing a pointer)
* such a pointer is not automatically _adjusted_ even if it happens to point to a local variable or an array that will be mapped automatically or by an explicit `map` clause

* here is an example

In [None]:
%%writefile omp_map_ptr_in_data.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
struct cell {
  float x;
  float * a;
};

int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  cell c = { t + 3, a };
#pragma omp target
  {
    printf("  t = %f\n", t);
    printf("  a = { %f, %f, %f }\n", a[0], a[1], a[2]);
    printf("c.x = %f\n", c.x);
    printf("c.a = %p\n", c.a);
    printf("c.a = { %f, %f, %f }\n", c.a[0], c.a[1], c.a[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_ptr_in_data.cc -o omp_map_ptr_in_data
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_ptr_in_data.cc -o omp_map_ptr_in_data

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_ptr_in_data

# <font color="green"> Problem 6 :  make pointer in another data structure valid</font>
* specify a map clause to indicate that you want to read `c.a[0:3]` in GPU
* <font color="red">if you do that, however, a surprising side effect happens (another thing I couldn't get yet witness by yourself and fix it
from the spec)</font>

In [None]:
%%writefile omp_map_ptr_in_data.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
struct cell {
  float x;
  float * a;
};

int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float a[3] = { t, t + 1, t + 2 };
  cell c = { t + 3, a };
#pragma omp target
  {
    printf("  t = %f\n", t);
    printf("  a = { %f, %f, %f }\n", a[0], a[1], a[2]);
    printf("c.x = %f\n", c.x);
    printf("c.a = %p\n", c.a);
    printf("c.a = { %f, %f, %f }\n", c.a[0], c.a[1], c.a[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_ptr_in_data.cc -o omp_map_ptr_in_data
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_ptr_in_data.cc -o omp_map_ptr_in_data

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_ptr_in_data

## 10-6. pointer to heap-allocated data
* it's not that everything is handled so nicely, of course
* the most basic situation you need to handle yourself is a pointer to heap-allocated data (by `malloc` or `new`, or anything other than local/global variables/arrays visible and used in `target`, as a matter of fact)
* in these cases you need to explicitly specify a pointer and a range you want to make valid on GPU, by a range expression like <font color="blue">_p_[_start_:_end_]</font> or <font color="blue">_p_[_start_:_end_:_stride_]</font>

* here is an example

In [None]:
%%writefile omp_map_heap.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float * a = new float[3];     // heap-allocated data
  for (int i = 0; i < 3; i++) { a[i] = t + i; }
#pragma omp target
  {
    printf("t = %f\n", t);
    printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_heap.cc -o omp_map_heap
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_heap.cc -o omp_map_heap

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_heap

# <font color="green"> Problem 7 :  Use `map` clause (with a range expression) to make pointer to heap valid</font>
* add an appropriate `map` clause so the GPU can get data in array `a` from CPU


In [None]:
%%writefile omp_map_heap.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float * a = new float[3];     // heap-allocated data
  for (int i = 0; i < 3; i++) { a[i] = t + i; }
#pragma omp target
  {
    printf("t = %f\n", t);
    printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu omp_map_heap.cc -o omp_map_heap
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_map_heap.cc -o omp_map_heap

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY ./omp_map_heap

## 10-7. GOOD NEWS: `nvc(++) -gpu=mem:managed` makes heap-allocated data automatically shared
* If you give `-gpu=mem:managed` option to NVIDIA HPC SDK compiler (`nvc` or `nvc++`), heap-allocated data --- data allocated by `malloc` or `new` --- get automatically shared
* This makes working on pointer-based data structures particularly easy

* Notes:
  * This is presumably implemented by replacing calls to `malloc` by `cudaMallocManaged`
  * Data allocated by `mmap` is NOT shared on this platform
  * More recent OS supporting the hierarchical memory management (HMM) share mmap-allocated data, too
    * [Details](https://developer.nvidia.com/blog/simplifying-gpu-application-development-with-heterogeneous-memory-management/)
  * More recent GPUs supporting hardware unified memory share local variables and global variables data, too


In [None]:
%%writefile omp_nomap_heap.cc
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
int main(int argc, char ** argv) {
  int i = 1;
  float t = (argc > i ? atof(argv[i]) : 10.0); i++;
  float * a = new float[3];     // heap-allocated data
  for (int i = 0; i < 3; i++) { a[i] = t + i; }
#pragma omp target
  {
    printf("t = %f\n", t);
    printf("a = { %f, %f, %f }\n", a[0], a[1], a[2]);
  }
  return 0;
}

In [None]:
nvc++ -mp=gpu -gpu=mem:managed omp_nomap_heap.cc -o omp_nomap_heap

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_nomap_heap

# 11. Summary : when is `map` clause necessary?
 |allocated as/by       |syntax                        |CPU -> GPU         |GPU -> CPU         | Remarks |
 |----------------------|------------------------------|-------------------|-------------------|---------|
 |local/global variable |`int v;`                      |                   |`map(from:v)`      |         |
 |local/global array    |`int a[N];`                   |                   |`map(from:a[p:q])` |         |
 |malloc/new            |`int * h = (int *)malloc(..);`|`map(to:h[p:q])`   |`map(to:h[p:q])`   | \*      |
 |                      |`int * h = new int[N];`       |`map(to:h[p:q])`   |`map(to:h[p:q])`   | \*      |
 |mmap                  |`int * h = (int *)mmap(..);`  |`map(to:h[p:q])`   |`map(to:h[p:q])`   |         |

* \* unnecessary when `nvc++ -gpu=mem:managed` option
* Therefore, if you
  * restrict GPU-to-CPU communication to data allocated by malloc or new (i.e., not through local or global variables), and 
  * do not use mmap,
  
then map clauses are unnecessary by using the `nvc++ -gpu=mem:managed` option.
* That is, the data will largely be transparently shared between the GPU and CPU


# 12. Visualizing execution
* Let's perform the same experiment we did for multicore before, this time on GPU
* The program below executes the function `iter_fun`
```
#pragma omp target teams distribute parallel for num_teams(n_teams) num_threads(n_threads_per_team)
  for (long i = 0; i < L; i++) {
    iter_fun(a, b, i, M, N, R, T);
  }
```

* `iter_fun(a, b, i, M, N, R, T)` repeats x = a x + b many (M * N) times and record time every N iterations

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_gpu_sched_rec.cc
#include <assert.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <omp.h>
#include <nv/target>

long cur_time_ns() {
  struct timespec ts[1];
  if (clock_gettime(CLOCK_REALTIME, ts) == -1) err(1, "clock_gettime");
  return ts->tv_sec * 1000000000L + ts->tv_nsec;
}

#if __NVCOMPILER
/* get SM id (for NVIDIA compiler).
   return -1 if called on CPU */
__host__ __device__ static unsigned int get_smid(void) {
  if target(nv::target::is_device) {
    unsigned int sm;
    asm("mov.u32 %0, %%smid;" : "=r"(sm));
    return sm;
  } else {
    return (unsigned int)(-1);
  }
}
#endif

#if __clang__
/* get SM id (for Clang LLVM compiler).
   return -1 if called on CPU */
__attribute__((unused))
static unsigned int get_smid(void) {
#if __CUDA_ARCH__
  unsigned int sm;
  asm("mov.u32 %0, %%smid;" : "=r"(sm));
  return sm;
#else
  return (unsigned int)(-1);
#endif
}

/* get GPU clock (for Clang LLVM compiler).
   return -1 if called on CPU */
__attribute__((unused))
static long long int clock64(void) {
#if __CUDA_ARCH__
  long long int clock;
  asm volatile("mov.s64 %0, %%clock64;" : "=r" (clock));
  return clock;
#else
  return (unsigned int)(-1);
#endif
}
#endif

__attribute__((unused))
static long long int get_gpu_clock(void) {
  long long int t = 0;
#pragma omp target map(from: t)
  t = clock64();
  return t;
}

typedef struct {
  double x;
  int team[2];
  int thread[2];
  int sm[2];
} record_t;

/* the function for an iteration
   perform
   x = a x + b
   (M * N) times and record current time
   every N iterations to T.
   record thread and cpu to R.
 */
void iter_fun(double a, double b, long i, long M, long N,
              record_t * R, long * T) {
  // initial value (not important)
  double x = i;
  // record in T[i * M] ... T[(i+1) * M - 1]
  T = &T[i * M];
  // record starting thread/cpu
  R[i].team[0] = omp_get_team_num();
  R[i].thread[0] = omp_get_thread_num();
  R[i].sm[0] = get_smid();
  // repeat a x + b many times.
  // record time every N iterations
  for (long j = 0; j < M; j++) {
    T[j] = clock64();
    for (long k = 0; k < N; k++) {
      x = a * x + b;
    }
  }
  // record ending SM (must be = thread0)
  R[i].team[1] = omp_get_team_num();
  R[i].thread[1] = omp_get_thread_num();
  R[i].sm[1] = get_smid();
  // record result, just so that the computation is not
  // eliminated by the compiler
  R[i].x = x;
}

void dump(record_t * R, long * T, long L, long M) {
  long t0 = LONG_MAX;
  long k = 0;
  assert(L * M > 0);
  // find min clock
  for (long i = 0; i < L; i++) {
    for (long j = 0; j < M; j++) {
      t0 = (T[k] < t0 ? T[k] : t0);
      k++;
    }
  }
  assert(t0 < LONG_MAX);
  k = 0;
  for (long i = 0; i < L; i++) {
    printf("i=%ld x=%f team0=%d thread0=%d sm0=%d team1=%d thread1=%d sm1=%d",
           i, R[i].x,
           R[i].team[0], R[i].thread[0], R[i].sm[0],
           R[i].team[1], R[i].thread[1], R[i].sm[1]);
    for (long j = 0; j < M; j++) {
      printf(" %ld", T[k] - t0);
      k++;
    }
    printf("\n");
  }
}

int getenv_int(const char * v) {
  char * s = getenv(v);
  if (!s) {
    fprintf(stderr, "specify environment variable %s\n", v);
    exit(1);
  }
  return atoi(s);
}

int main(int argc, char ** argv) {
  int idx = 1;
  long L   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  long M   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  long N   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  double a = (idx < argc ? atof(argv[idx]) : 0.99); idx++;
  double b = (idx < argc ? atof(argv[idx]) : 1.00); idx++;
  int n_teams = getenv_int("OMP_NUM_TEAMS");
  int n_threads_per_team = getenv_int("OMP_NUM_THREADS");
  record_t * R = (record_t *)calloc(L, sizeof(record_t));
  long * T = (long *)calloc(L * M, sizeof(long));
  long t0 = get_gpu_clock();
#pragma omp target teams distribute parallel for num_teams(n_teams) num_threads(n_threads_per_team) map(tofrom: R[:L]) map(tofrom: T[:L*M])
  for (long i = 0; i < L; i++) {
    iter_fun(a, b, i, M, N, R, T);
  }
  long t1 = get_gpu_clock();
  printf("%ld GPU clocks\n", t1 - t0);
  dump(R, T, L, M);
  return 0;
}


In [None]:
BEGIN SOLUTION
END SOLUTION
nvc++ -mp=gpu -cuda omp_gpu_sched_rec.cc -o omp_gpu_sched_rec
# clang++ -Wall -fopenmp -fopenmp-targets=nvptx64 omp_gpu_sched_rec.cc -o omp_gpu_sched_rec

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_TARGET_OFFLOAD=MANDATORY OMP_NUM_TEAMS=3 OMP_NUM_THREADS=32 ./omp_gpu_sched_rec > a.dat

* Execute the following cell to visialize it
* In the graph,
  * horizontal axis is the time from the start in the number of clock cycles on GPU
  * vertical axis is the iteration number (i)
  * the color represents the thread that executed the iteration


In [None]:
BEGIN SOLUTION
END SOLUTION
import sched_vis
omp_gpu_sched_vis.sched_plt(["a.dat"])
# omp_gpu_sched_vis.sched_plt(["a.dat"], start_t=1.5e7, end_t=2.0e7, show_every=1)

# <font color="green"> Problem 8 :  Understanding scheduling by visualization</font>
* Set the number of teams to 1 and increase the number of threads (per team) from 32 to larger numbers, to find how many iterations can execute almost simultaneously in a single team (i.e., SM)
  * use `show_every` parameter to reduce the number of iterations visualized 
* Then you fix the number of threads per team and increase the number of teams, again to find how many iterations can execute almost simultaneously in the device
* Find the equivalent number on CPU and compare them
  * You will confirm the number for GPU is much larger than that for CPU, with no surprise
  * Make no mistake; CPU has other axes of parallelism (SIMD and superscalar) that cannot be tapped just by using multicores (omp parallel), which we will see later in this course (do not interpret the ratio between the two as the ratio of the peak performance between the two)
  * Still, it's safe to say GPU "simplifies" high-performance programming, in the sense that the required effort to tap all available hardware-level parallelism is much lower if the program has ample loop-level parallelism (the number of independently executable iterations)


# <font color="green"> Problem 9 :  Putting them together: calculating an integral</font>
Write an OpenMP program that calculates

$$ \int \int_D \sqrt{1 - x^2 - y^2}\,dx\,dy $$

where

$$ D = \{\;(x, y)\;|\;0\leq x \leq 1, 0\leq y \leq 1, x^2 + y^2 \leq 1 \}$$

<font color=red>on GPU.</font>

* Note: an alternative way to put it is to calculate

$$ \int_0^1 \int_0^1 f(x)\,dx\,dy $$

where

$$ f(x) = \left\{\begin{array}{ll}\sqrt{1 - x^2 - y^2} & (x^2 + y^2 \leq 1) \\ 0 & (\mbox{otherwise}) \end{array}\right. $$

* Use a nested loop to calculate the double integral
* Use `target`, `teams`, `distribute`, `parallel`, and `for` to execute it on GPU
* The result should be close to $\pi/6 = 0.52359..$ (1/8 of the volume of the unit ball)
* Play with the number of infinitesimal intervals for integration and the number of threads so that you can observe a speedup

* Take the number of thread blocks (passed to `num_teams(..)`) and the number of threads per block (passed to `num_threads(x)`) in the command line

* Compare the execution speed of OpenMP (CPU), CUDA (GPU), and OpenMP (GPU) in various settings
  * a single CPU thread vs single CUDA thread
  * a single CPU thread vs multiple CUDA threads in a single thread block 
  * multiple CPU threads vs multiple CUDA threads in multiple thread blocks


In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_gpu_integral.cc


In [None]:
nvc++ -O4 -mp=gpu omp_gpu_integral.cc -o omp_gpu_integral

In [None]:
OMP_TARGET_OFFLOAD=MANDATORY ./omp_gpu_integral