<a href="https://colab.research.google.com/github/trefftzc/cis677/blob/main/Intro_to_OpenMP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to OpenMP
Examples taken from the book "Programming your GPU with OpenMP" by Tom Deakin and Tomothy G. Mattson

In [None]:
%%writefile hello.c
#include <stdio.h>
#include <omp.h>
int main()
{
  #pragma omp parallel
  {
    int id = omp_get_thread_num();
    printf("hello %d",id);
    printf(" world %d\n",id);
  }
}


Writing hello.c


Now, compile the file hello.c
If one is using the gcc compiler, the appropriate flag is
 -fopenmp

In [None]:
!gcc hello.c -o hello -fopenmp

Now, execute the program:

In [None]:
!./hello

hello 1 world 1
hello 0 world 0


The default number of threads in a COLAB environment is 2.
The command lscpu in linux allows us to learn more about the characteristics of the CPU running the system.

In [None]:
!lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) CPU @ 2.20GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            0
    BogoMIPS:            4399.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clf
                         lush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_
                         good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fm
                         a cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
                         ervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd i

## Calculating Pi
This mathematical expression produces the constant pi.

$$ \int_{0}^{1} \frac{4.0}{(1 + x^2)} dx = \pi $$

It can be approximated as the sum or rectangles

$$ \sum_{i = 0}^{N} F(x_i) Δx ≈ \pi $$

In [None]:
%%writefile pi.c
//
// Sequential version
//
#include <stdio.h>
#include <omp.h>

static long num_steps = 1024*1024*1024;

int main()
{
  double x, pi, step, sum = 0.0;
  step = 1.0 / (double) num_steps;

  for(int i = 0;i < num_steps;i++) {
    x = (i+0.5) * step;
    sum += 4.0 / (1.0 + x*x);
  }
  pi = step * sum;
  printf("Pi = %lf , with %ld steps\n ",pi,num_steps);
}

Writing pi.c


In [None]:
!gcc pi.c -o pi -O3

In [None]:
!./pi

Pi = 3.141593 , with 1073741824 steps
 

Using the time command to find out how long does the execution of the program takes.


In [None]:
!time ./pi

Pi = 3.141593 , with 1073741824 steps
 
real	0m2.553s
user	0m2.498s
sys	0m0.003s


In [None]:
%%writefile pi_spmd.c
//
// SPMD version
//
#include <stdio.h>
#include <omp.h>

static long num_steps = 1024*1024*1024;

int main()
{
  int numthreads;
  double pi, step, full_sum = 0.0;
  step = 1.0 / (double) num_steps;

#pragma omp parallel
  {
    int id = omp_get_thread_num();
    double x, partial_sum = 0.0;
    #pragma omp single
      numthreads = omp_get_num_threads();

    for(int i = id;i < num_steps;i += numthreads) {
      x = (i+0.5) * step;
      partial_sum += 4.0 / (1.0 + x*x);
    }
    #pragma omp critical
      full_sum += partial_sum;
}
  pi = step * full_sum;
  printf("Pi = %lf , with %ld steps\n ",pi,num_steps);
}

Overwriting pi_spmd.c


The compilation:

In [None]:
!gcc pi_spmd.c -o pi_spmd -O3 -fopenmp

The execution and its timing.

In [None]:
!time ./pi_spmd

Pi = 3.141593 , with 1073741824 steps
 
real	0m1.924s
user	0m3.635s
sys	0m0.006s


## The meaning of the different omp directives

1.   #pragma omp parallel: The block that follows will be converted into a thread. Every available thread will execute this block in parallel.
2.   #pragma omp single: Only one of the threads will execute the line or block that follows this pragma
3.   #pragma omp critical: Only one thread will execute the line below at a time to avoid concurrent access problems.

## The meaning of the different omp functions


1.   omp_get_thread_num() Every thread has an unique id. The range starts and 0 and goes all the way up to the number of available threads minus one.
2.   omp_get_num_threads() Find the total number of threads that will execute





## A second version with OpenMP

In [None]:
%%writefile pi_parallel_for.c
//
// Parallel for version
//
#include <stdio.h>
#include <omp.h>

static long num_steps = 1024*1024*1024;

int main()
{
  int numthreads;
  double x,pi, step, sum = 0.0;
  step = 1.0 / (double) num_steps;

#pragma omp parallel
  {
    int id = omp_get_thread_num();
    double x, partial_sum = 0.0;
    #pragma omp single
      numthreads = omp_get_num_threads();

    #pragma omp for private(x) reduction(+:sum)
    for(int i = 0;i < num_steps;i++) {
      x = (i+0.5) * step;
      sum = sum + 4.0 / (1.0 + x*x);
    }
}
  pi = step * sum;
  printf("Pi = %lf , with %ld steps\n ",pi,num_steps);
}

Overwriting pi_parallel_for.c


The compilation:

In [None]:
!gcc pi_parallel_for.c -o pi_parallel_for -O3 -fopenmp

The execution and the timing:

In [None]:
!time ./pi_parallel_for

Pi = 3.141593 , with 1073741824 steps
 
real	0m2.089s
user	0m3.632s
sys	0m0.006s


## Using tasks to parallelize recursive code

In [None]:
%%writefile pi_with_tasks.c
//
// Parallel recursive version with tasks
//
#include <omp.h>
#include <stdio.h>
static long num_steps = 1024*1024*1024;
#define MIN_BLK 1024*256

double pi_comp(int Nstart,int Nfinish,double step)
{
  double x, sum = 0.0,sum1,sum2;
  // Base case.
  // This is a relatively small sub-block
  // Don't use recursion.
  if (Nfinish -Nstart < MIN_BLK)
  {
    for (int i = Nstart;i < Nfinish;i++)
    {
      x = (i + 0.5) * step;
      sum = sum + 4.0 / (1.0 + x * x);
    }
  }
  else // Recursive case
  {
    int iblk = Nfinish - Nstart;
    #pragma omp task shared(sum1)
      sum1 = pi_comp(Nstart,Nfinish-iblk/2,step);
    #pragma omp task shared(sum2)
      sum2 = pi_comp(Nfinish - iblk/2, Nfinish, step);
    #pragma om taskwait
      sum = sum1 + sum2;
  }
  return sum;
}

int main()
{
  double step,pi,sum;
  step = 1.0 / (double) num_steps;
  #pragma omp parallel
    #pragma omp single
      sum = pi_comp(0,num_steps,step);
  pi = step * sum;
  printf("Pi = %lf , with %ld steps\n ",pi,num_steps);
}

Overwriting pi_with_tasks.c


The compilation:

In [None]:
!gcc pi_with_tasks.c -o pi_with_tasks -O3 -fopenmp

The execution and timing:

In [None]:
!time ./pi_with_tasks

Pi = 0.000000 , with 1073741824 steps
 
real	0m1.553s
user	0m2.897s
sys	0m0.004s
