
#  OpenMP Programming Tutorial and Hands-on


Enter your name and student ID.

 * Name:
 * Student ID:



# 1. OpenMP
* <a href="http://openmp.org/" target="_blank" rel="noopener">OpenMP</a> is the de fact programming model for multicore environment
* More recently, it supports GPU offloading
* We are going to learn OpenMP both for CPU (multicore) and GPU programming
* See <a href="https://www.openmp.org/spec-html/5.0/openmp.html" target="_blank" rel="noopener">the spec</a>


# 2. Compilers
* We use [NVIDIA HPC SDK ver. 24.9](https://docs.nvidia.com/hpc-sdk/index.html) (`nvc` and `nvc++`) and [LLVM ver. 18.1.8](https://llvm.org/) (`clang` and `clang++`) for C/C++ compilers, as they support OpenMP GPU offloading

## 2-1. Set up NVIDIA CUDA and HPC SDK
Execute this before you use NVIDIA HPC SDK

In [None]:
export PATH=/opt/nvidia/hpc_sdk/Linux_x86_64/24.9/compilers/bin:$PATH
export PATH=/usr/local/cuda/bin:$PATH

* Check if it works (check if full paths of nvc/nvc++ are shown)
* make sure
  * `which nvc` and `which nvc++` show `/opt/nvidia/...`
  * `which nvcc` shows `/usr/local/...`

In [None]:
which nvc
which nvc++
which nvcc

## 2-2. Set up LLVM
Execute this before you use LLVM

In [None]:
export PATH=/home/share/llvm/bin:$PATH
export LD_LIBRARY_PATH=/home/share/llvm/lib:/home/share/llvm/lib/x86_64-unknown-linux-gnu:$LD_LIBRARY_PATH

Check if it works (check if full paths of clang/clang++ are shown)

In [None]:
which clang
which clang++

# 3. Compiling and running OpenMP programs
* Summary
  * `clang`/`clang++` : give `-fopenmp` option
  * `nvc`/`nvc++` : give `-mp` option
  * Set `OMP_NUM_THREADS` environment variable when running the executable

In [None]:
%%writefile omp_hello.c
#include <stdio.h>

int main() {
  printf("hello\n");
#pragma omp parallel
  printf("world\n");
  printf("good bye\n");
  return 0;
}

* Compiling with clang
* Add `-fopenmp` option to compile OpenMP programs
* Other generally useful options
  * `-Wall` warns many suspicous code
  * `-O3` maximally optimize code for performance

In [None]:
clang -fopenmp omp_hello.c -o omp_hello_clang

* Compiling with nvc
* Add `-mp` option to compile OpenMP programs
* Other generally useful options
  * `-Wall` warns many suspicous code
  * `-O4` maximally optimizes code for performance

In [None]:
nvc -mp omp_hello.c -o omp_hello_nvc

* Running

* Set environment variable `OMP_NUM_THREADS` to the number of threads created by `#pragma omp parallel`

In [None]:
OMP_NUM_THREADS=3 ./omp_hello_clang

In [None]:
OMP_NUM_THREADS=3 ./omp_hello_nvc

# <font color="green"> Problem 1 :  Change the number of threads</font>
* Execute them with various numbers of threads and see what happens

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_hello_clang

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_hello_nvc

# 4. `#pragma omp parallel`
* [#pragma omp parallel](https://www.openmp.org/spec-html/5.0/openmpse14.html#x54-800002.6) creates a _team_ of threads, each of which executes the statement below
* Note that only the statement that is right below the pragma is executed by the team of threads
* Of course, the statement can be a compound statement and/or include a function call, so each thread can actually execute arbitrary number of statements
* See [Determining the Number of Threads for a parallel Region](https://www.openmp.org/spec-html/5.0/openmpsu35.html#x55-880002.6.1) for more details on the number of threads created by `#pragma omp parallel`

# <font color="green"> Problem 2 :  Executing multiple statements by threads</font>
* Change following program so that both "world" and "good bye" are printed as many times as the number of threads

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_hello.c
#include <stdio.h>

int main() {
  printf("hello\n");
#pragma omp parallel
  printf("world\n");
  printf("good bye\n");
  return 0;
}

* Below, choose `clang` or `nvc` depending on your taste by commenting out the other one
* Below, I chose `clang` by commenting out `nvc`

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_hello.c -o omp_hello
# nvc -mp omp_hello.c -o omp_hello

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_hello

# 5. `omp_get_num_threads()` and `omp_get_thread_num()`
* When threads are executing a statement with `#pragma omp parallel`,
  * they are said to be in a _parallel region_
  * they are called a _team_ of threads

* While a thread is executing a parallel region,
  * [omp_get_num_threads()](https://www.openmp.org/spec-html/5.0/openmpsu111.html#x148-6450003.2.2) returns the number of threads in the team 
  * [omp_get_thread_num()](https://www.openmp.org/spec-html/5.0/openmpsu113.html#x150-6570003.2.4) returns the unique id of the calling thread within the team (0, 1, ..., the number threads in the team - 1)
* You need `#include <omp.h>` to use these functions or any OpenMP API functions, for that matter


# <font color="green"> Problem 3 :  Using `omp_get_num_threads()` and `omp_get_thread_num()`</font>
* Change following program so that each thread prints its id in the team and the number of threads in the team, like this.  The exact order of lines may differ.  Strictly speaking, even characters in two lines can be mixed into a single line.

```
hello
0/5 world
4/5 world
1/5 world
3/5 world
2/5 world
good bye
```

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_hello_id.c
#include <stdio.h>

int main() {
  printf("hello\n");
#pragma omp parallel
  {
    printf("world\n");
    printf("good bye\n");
  }
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_hello_id.c -o omp_hello_id
# nvc -mp omp_hello_id.c -o omp_hello_id

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_hello_id

# 6. `#pragma omp for`
* `#pragma omp parallel` merely creates a team of threads executing the same statement
* In this sense, `#pragma omp parallel` alone cannot make a program run faster with multiple cores
* A program can be made faster only when you _divide_ the work among threads (work-sharing)
* `#pragma omp for` lets you divide iterations of a loop into threads created by `#pragma omp parallel`

# <font color="green"> Problem 4 :  How does `#pragma omp for` divide iterations to threads?</font>
* Execute the following cell and observe which iteration is executed by which thread
* Based on the observation, change the number of iterations and threads and predict the mapping between iterations and threads

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_for.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
#pragma omp parallel
  {
    printf("I am thread %d in a team of %d threads\n",
           omp_get_thread_num(), omp_get_num_threads());
#pragma omp for
    for (int i = 0; i < 24; i++) {
      usleep(100 * 1000 * i);
      printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
      fflush(stdout);
    }
  }
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_for.c -o omp_for
# nvc -mp omp_for.c -o omp_for

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=4 ./omp_for

## 6-1. for loops allowed by `#pragma omp for`
* There is a severe syntax restriction on the kind of for loops `#pragma omp for` can apply for
* See [Canonical Loop Form](https://www.openmp.org/spec-html/5.0/openmpsu40.html#x63-1260002.9.1) for the spec
* In short, it should look like `for (var = _init_; var < _limit_; var += _inc_)` where _init_, _limit_, and _inc_ are all loop-invariant (do not change throughout the loop)

## 6-2. Combined pragma (parallel + for)
* `#pragma omp parallel` and `#pragma omp for` are often used together
* If `#pragma omp for` immediately follows `#pragma omp parallel`, they can be combined into a single pragma `#pragma omp parallel for`

In [None]:
%%writefile omp_parallel_for.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel for
  for (int i = 0; i < 24; i++) {
    usleep(100 * 1000 * i);     /* sleep 100 x i milliseconds */
    printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
    fflush(stdout);
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_parallel_for.c -o omp_parallel_for
# nvc -mp omp_parallel_for.c -o omp_parallel_for

In [None]:
OMP_NUM_THREADS=4 ./omp_parallel_for

# 7. Scheduling a work-sharing for loop
* As you witnessed, the default scheduling policy in our environment (may be implementation dependent) seems static scheduling (assign roughly the same number of contiguous iterations to each thread)
* Is it enough? Clearly, it does not do a good job when iterations take a different amount of time
* You can change the policy by [schedule clause](https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2)

## 7-1. Visualizing scheduling
* The program below executes the function `iter_fun`
```
#pragma omp parallel for
  for (long i = 0; i < L; i++) {
    iter_fun(a, b, i, M, N, R, T);
  }
```

* `iter_fun(a, b, i, M, N, R, T)` repeats x = a x + b many (M * N) times and record time every N iterations

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_sched_rec.c
#include <err.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <unistd.h>
#include <omp.h>

long cur_time_ns() {
  struct timespec ts[1];
  if (clock_gettime(CLOCK_REALTIME, ts) == -1) err(1, "clock_gettime");
  return ts->tv_sec * 1000000000L + ts->tv_nsec;
}

typedef struct {
  double x;
  int thread[2];
  int cpu[2];
} record_t;

/* the function for an iteration
   perform
   x = a x + b
   (M * N) times and record current time
   every N iterations to T.
   record thread and cpu to R.
 */
void iter_fun(double a, double b, long i, long M, long N,
              record_t * R, long * T) {
  // initial value (not important)
  double x = i;
  // record in T[i * M] ... T[(i+1) * M - 1]
  T = &T[i * M];
  // record starting thread/cpu
  R[i].thread[0] = omp_get_thread_num();
  R[i].cpu[0] = sched_getcpu();
  // repeat a x + b many times.
  // record time every N iterations
  for (long j = 0; j < M; j++) {
    T[j] = cur_time_ns();
    for (long k = 0; k < N; k++) {
      x = a * x + b;
    }
  }
  // record ending SM (must be = thread0)
  R[i].thread[1] = omp_get_thread_num();
  R[i].cpu[1] = sched_getcpu();
  // record result, just so that the computation is not
  // eliminated by the compiler
  R[i].x = x;
}

void dump(record_t * R, long * T, long L, long M, long t0) {
  long k = 0;
  for (long i = 0; i < L; i++) {
    printf("i=%ld x=%f thread0=%d cpu0=%d thread1=%d cpu1=%d",
           i, R[i].x, R[i].thread[0], R[i].cpu[0], R[i].thread[1], R[i].cpu[1]);
    for (long j = 0; j < M; j++) {
      printf(" %ld", T[k] - t0);
      k++;
    }
    printf("\n");
  }
}

int main(int argc, char ** argv) {
  int idx = 1;
  long L   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  long M   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  long N   = (idx < argc ? atol(argv[idx]) : 100);  idx++;
  double a = (idx < argc ? atof(argv[idx]) : 0.99); idx++;
  double b = (idx < argc ? atof(argv[idx]) : 1.00); idx++;
  record_t * R = (record_t *)calloc(L, sizeof(record_t));
  long * T = (long *)calloc(L * M, sizeof(long));
  long t0 = cur_time_ns();
#pragma omp parallel for
  for (long i = 0; i < L; i++) {
    iter_fun(a, b, i, M, N, R, T);
  }
  long t1 = cur_time_ns();
  printf("%ld nsec\n", t1 - t0);
  dump(R, T, L, M, t0);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp -D_GNU_SOURCE omp_sched_rec.c -o omp_sched_rec
# nvc -mp omp_sched_rec.c -o omp_sched_rec

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=4 ./omp_sched_rec > a.dat

* Execute the following cell to visialize it
* In the graph,
  * horizontal axis is the time from the start in nanosecond
  * vertical axis is the iteration number
  * the color represents the thread that executed the iteration


In [None]:
BEGIN SOLUTION
END SOLUTION
import sched_vis
sched_vis.sched_plt(["a.dat"])
# sched_vis.sched_plt(["a.dat"], start_t=1.5e7, end_t=2.0e7)

# <font color="green"> Problem 5 :  Understanding scheduling by visualization</font>
* Add `schedule` clause to the program (`schedule(runtime)` allows you to set the schedule in the command line)
* Change the number of threads and schedule and observe how iterations are executed
* Set the number of threads very large (higher than the physical number of cores) and see what happens
  * Hint : you can get the number of cores by `nproc` command 
* In the above program, each iteration performs exactly the same amount of computation (i.e., x = a x + b (M * N) times), thus takes almost exactly the same time
* See what happens if this is not the case
  * Specifically, make iteration `i` repeats x = a x + b (M * (i * N)) times (i.e., change the inner loop in `iter_fun` to `for (long k = 0; k < i * N; k++) { ...`)

* `sched_plt` function below takes optional parameters `start_t` and `end_t` specifying the horizontal range to display
* If you zoom _very_ closely to a particular point, you can see individual points and intervals between them, from which you can deduce how long it takes to perform `x = a x + b` once

In [None]:
nproc

# <font color="green"> Problem 6 :  Specifying the scheduling policy by schedule clause</font>
* In the following (artificial) loop, iteration _i_ roughly sleeps for (100 x _i_) milliseconds and this is almost exactly the time it takes
1. predict the executing time of the parallel for loop with the default (static) scheduling policy
1. vary the scheduling policy and reason about their execution times


In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_schedule.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
  /* ----- add schedule clause below ----- */
#pragma omp parallel for
  for (int i = 0; i < 12; i++) {
    usleep(100 * 1000 * i);     /* sleep 100 x i milliseconds */
    printf("iteration %d executed by thread %d\n", i, omp_get_thread_num());
    fflush(stdout);
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_schedule.c -o omp_schedule
# nvc -mp omp_schedule.c -o omp_schedule

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=4 ./omp_schedule

* Predict the execution time with the default (static) policy
* Predict the execution time with the dynamic policy?
* Compare them with what you observed

* Explain your reasoning below

BEGIN SOLUTION
END SOLUTION


# 8. Collapse clause
* `#pragma omp for` can specify a [collapse clause](https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1290002.9.2) to apply work-sharing for a limited type of nested loops
* With a `#pragma omp for clause(2)`, OpenMP considers the doubly-nested loop that comes after this pragma the subject of work-sharing (i.e., distribute iterations of the doubly-nested loop to threads); you must have a _perfectly-nested_, rectangular doubly-nested loop after this clause
* A perfectly-nested loop is a nested loop whose outer loops (all loops except for the innermost one) do not have any statement except the inner loop. e.g.
```
for (i = 0; i < 100; i++) {
  for (j = 0; j < 100; j++) {
    S(i,j);
  }
}
```
is perfectly nested whereas
```
for (i = 0; i < 100; i++) {
  S;
  for (j = 0; j < 100; j++) {
    T;
  }
}
```
is not.  
* A perfectly nested loop is conceptually a flat loop with a mechanical transformation.
```
for (ij = 0; ij < 100 * 100; ij++) {
  i = ij / 100;
  j = ij % 100;
  S(i,j);
}
```
* A rectangular loop is a loop whose iteration counts of inner loops never depend on outer loops.  For example,
```
for (i = 0; i < 100; i++) {
  for (j = 0; j < i; j++) {
    S(i,j);
  }
}
```
is not a rectangular loop.
* Generally speaking, OpenMP `#pragma omp parallel` + `#pragma omp for` cannot handle nested parallelism very well, but collapse clause alleviates the problem to some extent
* Consider using tasks below for more general form of nested parallelism

# <font color="green"> Problem 7 :  Apply collapse and schedule</font>
* Apply collapse and schedule to the following loop
* Trick: write `schedule(runtime)` and you can change the scheduling policy at execution time by setting environment variable `OMP_SCHEDULE=` in the command line. See [OMP_SCHEDULE environment variable](https://www.openmp.org/spec-html/5.0/openmpsu41.html#x64-1370002.9.2.1) for details
* Reason about the execution time of various schedule policies and with/without collapse

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_collapse.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
  /* apply collapse and schedule */
#pragma omp parallel for
  for (int i = 0; i < 5; i++) {
    for (int j = 0; j < 5; j++) {
      usleep(100 * 1000 * (i + j));
      printf("iteration (%d, %d) executed by thread %d\n", i, j, omp_get_thread_num());
      fflush(stdout);
    }
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}




In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_collapse.c -o omp_collapse
# nvc -mp omp_collapse.c -o omp_collapse

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_collapse

# 9. Task parallelism
* Task is a more general mechanism to extract parallelism and distribute computation (called a task) _dynamically_ to threads in a team created by `#pragma omp parallel`
* A thread can create a task at any point in the execution of a parallel region and they are dispatched to available threads at runtime
* As a thread can create a task at any point, a task can create another task. that is, parallelism can be arbitrarily nested and the number of tasks can be difficult to predict (unlike the number of iterations of a for loop)
* A common pattern
  1. enter a parallel region by `#pragma omp parallel`
  1. ensure the statement is executed by only a single (root) thread with [#pragma omp master](https://www.openmp.org/spec-html/5.0/openmpse24.html#x118-4380002.16)
  1. create tasks at any point by [#pragma omp task](https://www.openmp.org/spec-html/5.0/openmpsu46.html#x70-2000002.10.1)
  1. a task waits for tasks it created to finish by [#pragma omp taskwait](https://www.openmp.org/spec-html/5.0/openmpsu93.html#x124-4690002.17.5)

* Let's see the effect of `#pragma omp master` first (without creating any task)

In [None]:
%%writefile omp_master.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel
  {
#pragma omp master
    printf("inside the master pragma: I am thread %d of a team of %d threads\n",
           omp_get_thread_num(), omp_get_num_threads());
    printf("out of the master pragma: I am thread %d of a team of %d threads\n",
           omp_get_thread_num(), omp_get_num_threads());
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_master.c -o omp_master
# nvc -mp omp_master.c -o omp_master

In [None]:
OMP_NUM_THREADS=3 ./omp_master

* Since this is a common idiom, they can be combined into one pragma (`#pragma omp parallel master`)
  * <font color=red>This feature is not supported by NVIDIA compiler, however</font>
* The program below creates a parallel region whose entire region is executed only by the master and thus does not serve any useful purpose but mere a demonstration of the feature

In [None]:
%%writefile omp_parallel_master.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel master
  printf("I am thread %d of a team of %d threads\n",
         omp_get_thread_num(), omp_get_num_threads());
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_parallel_master.c -o omp_parallel_master
# NVIDIA compiler does not support this program
# nvc -mp omp_parallel_master.c -o omp_parallel_master

In [None]:
OMP_NUM_THREADS=3 ./omp_parallel_master

* Let's create a few tasks now

In [None]:
%%writefile omp_task.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
  {
    printf("I am thread %d of a team of %d threads\n",
           omp_get_thread_num(), omp_get_num_threads());
#pragma omp task
    {
      printf("task A executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
      usleep(500 * 1000);
    }
#pragma omp task
    {
      printf("task B executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
      usleep(1000 * 1000);
    }
#pragma omp taskwait
    printf("two tasks done, executed by %d of %d\n", omp_get_thread_num(), omp_get_num_threads());
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_task.c -o omp_task
# nvc -mp omp_task.c -o omp_task

In [None]:
OMP_NUM_THREADS=3 ./omp_task

* Tasks are particularly good at parallel recursions, as the following program demonstrates
* This is a common pattern that appears in many algorithms, particularly divide-and-conquer algorithms

In [None]:
%%writefile omp_rec_task.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

void recursive_tasks(int n, int tid) {
  printf("task %d by %d of %d\n",
         tid, omp_get_thread_num(), omp_get_num_threads());
  fflush(stdout);
  if (n == 0) {
    usleep(300 * 1000);
  } else {
#pragma omp task
    recursive_tasks(n - 1, 2 * tid + 1);
#pragma omp task
    recursive_tasks(n - 1, 2 * tid + 2);
#pragma omp taskwait
  }
}
int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
  {
    recursive_tasks(5, 0);
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_rec_task.c -o omp_rec_task
# nvc -mp omp_rec_task.c -o omp_rec_task

In [None]:
OMP_NUM_THREADS=10 ./omp_rec_task

# <font color="green"> Problem 8 :  A quiz about recursive tasks</font>
* Answer the following questions
* How many tasks are created by `recursive_tasks(n, 0)`?  Include the caller of `recursive_tasks(n, 0)` as a task.  i.e., consider `recursive_tasks(0, 0)` creates one task
* How many of them are leaf tasks?
* Express them in terms of $n$

BEGIN SOLUTION
END SOLUTION

* Approximately what is the ideal execution time of `recursive_tasks(5, 0)` when using 10 threads?
* Compare it with what you observed

BEGIN SOLUTION
END SOLUTION

* Name an algorithm or two for which recursive tasks will be useful for parallelizing it and explain why you think so

BEGIN SOLUTION
END SOLUTION

# 10. Taskloop
* As you can easily imagine, tasks can handle general nested loops if they can handle recursions
* Recent OpenMP actually has a construct just for that, which is [#pragma omp taskloop](https://www.openmp.org/spec-html/5.0/openmpsu47.html#x71-2080002.10.2)
  * <font color=red>This feature is not supported by NVIDIA compiler</font>
* Here is a demonstration showing it can handle non perfectly-nested loops

In [None]:
%%writefile omp_taskloop.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  double t0 = omp_get_wtime();
#pragma omp parallel
#pragma omp master
#pragma omp taskloop
  for (int i = 0; i < 5; i++) {
    printf("i = %d starts\n", i);
    fflush(stdout);
#pragma omp taskloop
    for (int j = 0; j < 5; j++) {
      usleep(100 * 1000 * (i + j));
      printf("iteration (%d, %d) executed by thread %d\n", i, j, omp_get_thread_num());
      fflush(stdout);
    }
  }
  double t1 = omp_get_wtime();
  printf("%f sec\n", t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_taskloop.c -o omp_taskloop
# nvc -mp omp_taskloop.c -o omp_taskloop

In [None]:
OMP_NUM_THREADS=3 ./omp_taskloop

<font color="red">NOTE:</font>

* Implementing task requires a more general mechanism than work-sharing for statement; the former should be able to distribute tasks generated in the course of execution, whereas the former merely needs to distribute iterations that are easily identifiable at the point of entering `#pragma omp for`, thanks to the ["canonical form" restriction](https://www.openmp.org/spec-html/5.0/openmpsu40.html#x63-1260002.9.1)
* Task scheduling is always dynamic whereas work-sharing for (particularly with static scheduling) gives you more control and predictability about which thread executes which iteration
* This is a reason why two mechanisms which are seemingly redundant exist, besides a historical reason that initially there was not a tasking construct in OpenMP

# 11. Data sharing
* OpenMP is a shared memory programming model, which means threads see updates among each other
* That is, when a thread updates a variable _x_ that is then read by another, the reader thread will see the updated value
* This is the default behavior of OpenMP ([Data Environment](https://www.openmp.org/spec-html/5.0/openmpse27.html#x135-5430002.19) in OpenMP spec)
* It is not always convenient, however
* `#pragma omp parallel` can thus specify whether local variables in the scope (i.e., defined outside the statement) are privatized (i.e., made private to each thread)

# <font color="green"> Problem 9 :  Observe the effect of privatization</font>
* Execute the following and make sense of the output
* Add the `private(x)` clause and observe the difference

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_private.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  int x = 123;
  printf("before : x = %d\n", x);
  /* add private(x) clause below and see the difference */
#pragma omp parallel
  {
    int id = omp_get_thread_num();
    printf("thread %d : x = %d\n", id, x);
  }
  printf("after : x = %d\n", x);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_private.c -o omp_private
# nvc -mp omp_private.c -o omp_private

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=10 ./omp_private

* `private(x)` essentially ignores the original variable `x` defined outside the parallel region and behaves as if a variable of the same name is defined by each thread
* `firstprivate(x)` is like `private(x)`, except that `x` of each thread is initialized by the value of `x` just before entering the parallel region

# <font color="green"> Problem 10 :  Observe the effect of `private` and `firstprivate`</font>
* Execute the following and observe the output
* Add the `private(x)` clause and execute it
* Add the `firstprivate(x)` clause and execute it
* Make sense of the differences

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_firstprivate.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  int x = 123;
  printf("before : x = %d\n", x);
  /* add private(x)/firstprivate(x) clause and see the difference */
#pragma omp parallel
  {
    int id = omp_get_thread_num();
    x++;
    printf("thread %d : x = %d\n", id, x);
  }
  printf("after : x = %d\n", x);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_firstprivate.c -o omp_firstprivate
# nvc -mp omp_firstprivate.c -o omp_firstprivate

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=3 ./omp_firstprivate

# 12. Race condition
* Execute the above code without private or firstprivate many times
* Observe that the value of `x` after the parallel region is not always 123 + 5 (the number of threads executing the region), even different across runs
* For example, with two threads, the following execution order may cause such a behavior
  1. thread A reads 123
  1. thread B reads 123
  1. thread A writes 124
  1. thread B reads 124
* A similar case occurs whenever a thread's read-followed-by-write is intervened by another thread's update
* More generally, the following situation is called a "race condition" and if there is a race condition in your program, it almost always means your program is broken
  * Two or more threads concurrently access the same variable, and
  * at least one of them writes to it
Here, "concurrently access" means these accesses are not guaranteed to be separated in time by a synchronization primitive

* In all but trivial parallel programs, threads need to communicate with each other to accomplish a task
* Threads _communicate_ by having one thread write to a variable and having another read it
* If we simply do it without any mechanism to guarantee that they are separated in time, it is a race

* Below, we describe three ways to _safely_ communicate among threads without making race conditions

  * `#pragma omp critical`
  * `#pragma omp atomic`
  * reduction

# 13. `#pragma omp critical`
* [#pragma omp critical](https://www.openmp.org/spec-html/5.0/openmpsu89.html#x120-4470002.17.1) guarantees the statement following the pragma is executed not overlapping in time

# <font color="green"> Problem 11 :  Apply `#pragma omp critical`</font>
* Execute the following program a few times and observe that the result is undeterministic and often not what we want (i.e., 123 + the number of threads)
* Then, add `#pragma omp critical` to the statement `x++` and see the result

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_critical.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  int x = 123;
  printf("before : x = %d\n", x);
#pragma omp parallel
  {
    int id = omp_get_thread_num();
    x++;
  }
  printf("after : x = %d\n", x);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_critical.c -o omp_critical
# nvc -mp omp_critical.c -o omp_critical

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=100 ./omp_critical

# 14. `#pragma omp atomic`
* [#pragma omp atomic](https://www.openmp.org/spec-html/5.0/openmpsu95.html#x126-4840002.17.7) is similar to `#pragma omp critical` but its effect is slightly different and its applicability limited (see below)

# <font color="green"> Problem 12 :  Apply `#pragma omp atomic`</font>
* Add `#pragma omp atomic` to the statement `x++` and see the result

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_atomic.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  int x = 123;
  printf("before : x = %d\n", x);
#pragma omp parallel
  {
    int id = omp_get_thread_num();
    x++;
  }
  printf("after : x = %d\n", x);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_atomic.c -o omp_atomic
# nvc -mp omp_atomic.c -o omp_atomic

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=100 ./omp_atomic

* The statement that follows this pragma cannot be an arbitrary expression
* See [atomic Construct](https://www.openmp.org/spec-html/5.0/openmpsu95.html#x126-4840002.17.7) for the spec
* Typically, it is an update to a variable, such as
```  
x += expr;
```  
* What is guaranteed by `#pragma omp atomic` is different from what `#pragma omp critical` guarantees
```
#pragma omp atomic
x += expr;
``` 
guarantees that the read and write to _x_ are never intervened by another update labeled `#pragma omp atomic` whereas
```
#pragma omp critical
x += expr;
``` 
guarantees that the entire statement `x += expr` does not overlap with another statement labeled critical.
* When applicable, `#pragma omp atomic` is more efficient than `#pragma omp critical` because the evaluation of expr can overlap

# 15. Reduction clause
* [Reduction](https://www.openmp.org/spec-html/5.0/openmpsu107.html#x140-5800002.19.5) is the best way to resolve race conditions where applicable and it often is
* It is applicable when threads altogether calculate $v = v_0 \oplus v_1 \oplus ... \oplus v_{n-1}$ where $v_i$ can be computed independently and $\oplus$ is an associative operator (such as +)
* In serial loop, this could be written by
```
v = initial value;
for (i = 0; i < n; i++) {
  v_i = ...
  v = v + v_i;
}
```
* If we parallelize the above loop, updating $v$ will result in a race condition
* This can be safely parallelized by introducing `reduction(+ : v)`

# <font color="green"> Problem 13 :  Apply reduction</font>
* Add `reduction` clause to `#pragma omp parallel` below and observe the result

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_reduction.c
#include <stdio.h>
#include <unistd.h>
#include <omp.h>

int main() {
  int x = 123;
  printf("before : x = %d\n", x);
#pragma omp parallel
  {
    int id = omp_get_thread_num();
    x++;
  }
  printf("after : x = %d\n", x);
  return 0;
}

In [None]:
BEGIN SOLUTION
END SOLUTION
clang -fopenmp omp_reduction.c -o omp_reduction
# nvc -mp omp_reduction.c -o omp_reduction

In [None]:
BEGIN SOLUTION
END SOLUTION
OMP_NUM_THREADS=100 ./omp_reduction

# 16. How reduction clause works and why it is preferable when applicable
* Where applicable, reduction is generally much faster than using `#pragma omp atomic` or `#pragma omp critical`
* This is because, internally, each thread computes its partial results using a private variable and combines their results only once at the end
* That is, each update, e.g., `x += expr` updates a thread's private version of variable x instead of updating the shared variable
* Omitting details you can think of what reduction clause is doing is to convert something like

```
int x = G;
#pragma omp parallel reduction(+ : x)
{
   ... x += expr; ...
}
  
```

into something like

```
int x = G;
#pragma omp parallel
{
  int x_priv = 0; // (I) initialize a private version of x
  {
   ... x_priv += expr; ...
  }
#pragma omp atomic
  x += x_priv; // (C) combine the partial results in the private variable into the global variable
}
```
* This is valid because of the associativity of the operation

# 17. User-defined reduction
* Reduction is a general concept for efficiently executing many computations of $v_i$'s in parallel, when the final outcome we wish to compute is $v = v_0 \oplus v_1 \oplus ... \oplus v_{n-1}$ 
* It is applicable whenever the order of combining partial results via $\oplus$ does not affect the final outcome (e.g., +)
* Yet the builtin reduction clause of OpenMP can only specify a few builtin operations for a few builtin types (e.g., int, float, etc.)
* You sometimes desire to apply the efficient execution mechanism of the reduction for more general types (perhaps types you defined)
* [User-defined reduction](https://www.openmp.org/spec-html/5.0/openmpsu107.html#x140-5800002.19.5) exists exactly for that
* You need to define an expression to
  * (C) combine two partial results into one (more specifically, combine a partial result assumed to be in a variable omp_in into another variable omp_out)
  * (I) initialize a thread-private version of the variable to which reduction is applied, named omp_priv
* For example, in the case of the builtin + operator, 
  * (C) would be omp_out += omp_in
  * (I) would be omp_priv = 0

## 17-1. Apply a user-defined reduction
* Here is a simple (broken) parallel for loop that is meant to do a reduction on 3-element vector
* Define a reduction with `#pragma omp declare reduction` and apply it to the parallel loop

In [None]:
%%writefile omp_ud_reduction.c
#include <stdio.h>
#include <unistd.h>
#include <math.h>
#include <omp.h>

/* 3-element vector */
typedef struct {
  double a[3];
} vec_t;

/* x += y */
void vec_add(vec_t * x, vec_t * y) {
  for (int i = 0; i < 3; i++) {
    x->a[i] += y->a[i];
  }
}

/* x = {0,0,0} */
void vec_init(vec_t * x) {
  for (int i = 0; i < 3; i++) {
    x->a[i] = 0;
  }
}


/* add an appropriate #pragma omp declare reduction ... here */
  
int main() {
  vec_t v;
  vec_init(&v);
  double t0 = omp_get_wtime();
  /* add an appropriate reduction clause, so that
     the result is always {10000,10000,10000} */
#pragma omp parallel for
  for (int i = 0; i < 30000; i++) {
    v.a[i % 3]++;
  }
  double t1 = omp_get_wtime();
  printf("ans = {%.1f, %.1f, %.1f} in %f sec\n", v.a[0], v.a[1], v.a[2], t1 - t0);
  return 0;
}

In [None]:
clang -fopenmp omp_ud_reduction.c -o omp_ud_reduction
# nvc -mp omp_ud_reduction.c -o omp_ud_reduction

In [None]:
OMP_NUM_THREADS=10 ./omp_ud_reduction

# <font color="green"> Problem 14 :  Putting them together: calculating an integral</font>
Write an OpenMP program that calculates

$$ \int \int_D \sqrt{1 - x^2 - y^2}\,dx\,dy $$

where

$$ D = \{\;(x, y)\;|\;0\leq x \leq 1, 0\leq y \leq 1, x^2 + y^2 \leq 1 \}$$

* Note: an alternative way to put it is to calculate

$$ \int_0^1 \int_0^1 f(x)\,dx\,dy $$

where

$$ f(x) = \left\{\begin{array}{ll}\sqrt{1 - x^2 - y^2} & (x^2 + y^2 \leq 1) \\ 0 & (\mbox{otherwise}) \end{array}\right. $$

* Use a nested loop to calculate the double integral
* Try work-sharing for, taskloop, recursive tasks to parallelize it
* The result should be close to $\pi/6 = 0.52359..$ (1/8 of the volume of the unit ball)
* Play with the number of infinitesimal intervals for integration and the number of threads so that you can observe a speedup
* As you are using a shared cloud environment, you do not have to be serious about speedup (nearly perfect speedup is unlikely when other students are simultaneously using the same machine and/or the cloud is doing many other stuff (e.g., servicing the page you are looking at right now)

* If you want to work with an editor you are accustomed to rather than web browser, see [this page](https://taura.github.io/programming-languages/html/jupyter.html?lang=en)

In [None]:
BEGIN SOLUTION
END SOLUTION
%%writefile omp_integral.c


* Compile it

In [None]:
BEGIN SOLUTION
END SOLUTION

* and run it

In [None]:
BEGIN SOLUTION
END SOLUTION