<a href="https://colab.research.google.com/github/rkurniawati/CDER-notebooks/blob/main/Introduction_to_OpenMP_with_Java_and_Pyjama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Introduction to OpenMP with Java and Pyjama

This notebook is created by Ruth Kurniawati (Westfield State University) based on the material from CDER Workshop 2021. 

# Pyjama

OpenMP library is only available for C/C++ and Fortran languages. For Java, Pyjama provides support for OpenMP-like directive. More information about Pyjama can be found in the paper below:

Vikas, Nasser Giacaman, and Oliver Sinnen. 2013. Pyjama: OpenMP-like implementation for Java, with GUI extensions. In <i>Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores</i> (<i>PMAM '13</i>). Association for Computing Machinery, New York, NY, USA, 43–52. DOI:https://doi.org/10.1145/2442992.2442997

Note that Pyjama is NOT mature yet. In particular, it lacks full core OpenMP features and has some peculiarities/limitations (see below).


## Pyjama setup

In this section, we will download and setup Pyjama Java source code compiler and runtime library. The Pyjama compiler and runtime library used in this notebook has additional bug fixes done by [Tennnessee Tech](https://www.csc.tntech.edu/pdcincs/index.php/installation) which have not been contributed back to the original Pyjama project. 

In [1]:
!wget https://www.csc.tntech.edu/pdcincs/resources/modules/tools/updated/Pyjama.zip
!unzip Pyjama.zip
!ln Pyjama/Pyjama-3.1.0.jar Pyjama/Pyjama.jar

--2021-08-16 14:28:15--  https://www.csc.tntech.edu/pdcincs/resources/modules/tools/updated/Pyjama.zip
Resolving www.csc.tntech.edu (www.csc.tntech.edu)... 149.149.134.5
Connecting to www.csc.tntech.edu (www.csc.tntech.edu)|149.149.134.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 688047 (672K) [application/zip]
Saving to: ‘Pyjama.zip’


2021-08-16 14:28:16 (1.28 MB/s) - ‘Pyjama.zip’ saved [688047/688047]

Archive:  Pyjama.zip
   creating: Pyjama/
  inflating: Pyjama/Pyjama-3.1.0.jar  
  inflating: Pyjama/set_pyjama.bat   
 extracting: Pyjama/set_pyjama.sh    


## Hello world

In this example, we will verify that the Pyjama installation is working and able to create multiple threads as specified in the `#omp parallel num_threads` directive. 

In [2]:
%%writefile HelloWorld.java
public class HelloWorld
{	
    public static void main(String[] args) {
        
        int threadCount = Integer.parseInt(args[0]);

        //#omp parallel num_threads(threadCount)
        {
            int myID = Pyjama.omp_get_thread_num();
            int tCount = Pyjama.omp_get_num_threads();
            System.out.println("Hello from "+myID +" of "+tCount);
        }
    }
}

Writing HelloWorld.java


First, let's use Pyjama to process the `#omp` directive in the program. 

In [3]:
!java -jar Pyjama/Pyjama.jar HelloWorld.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/16	14:28:29
-----------------------------------------------------
Processing file: HelloWorld.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


Now, we're ready to run the HelloWorld program. To do this, you will need to make the Pyjama.jar available in the classpath so that the Pyjama OpenMP-like runtime library is available to the HelloWorld program.

In [5]:
!java -cp Pyjama/Pyjama.jar:. HelloWorld 4

Hello from 1 of 4
Hello from 2 of 4
Hello from 0 of 4
Hello from 3 of 4


## A brief overview on how Pyjama works

Pyjama is a Java compiler-runtime system that provides OpenMP-like support for Java program. The Pyjama.jar contains both the compiler and the runtime. When you run the Pyjama.jar using `java -jar Pyjama.jar HelloWorld.java`, you invoke the compiler, which will process the `#omp` directives in the `HelloWorld.java` and generate a Java file where the directives are translated into calls into the Pyjama runtime. 

Here's how you can see the options that the Pyjama.jar compiler will take:

In [None]:
!java -jar Pyjama/Pyjama.jar -h

usage: Pyjama
 -cp,--classpath <PATH>   Specify where to find user class files and
                          annotation processors
 -d,--directory <DIR>     output file directory
 -h,--help                print usage of Pyjama compiler
 -j2c,--javatoclass       (default)compile .java file to paralleled .class
                          file
 -j2j,--javatojava        compile .java file to paralleled .java file.
                          Remember new parallel java file will overwrite
                          old sequential java file, if there is no target
                          directory is specified.
 -p2c,--pjtoclass         compile .pj file to paralleled .class file
 -p2j,--pjtojava          compile .pj file to paralleled .java file
Error: no input files.
usage: Pyjama
 -cp,--classpath <PATH>   Specify where to find user class files and
                          annotation processors
 -d,--directory <DIR>     output file directory
 -h,--help                print usage of Pyjama com

Let's look at the generated Java file for our HelloWorld.java. Before we use the `j2j` option, note that we'd want to specify the directory where the generated Java file should be located, otherwise the original file will be overwritten. First, let's create a directory called `generated`.

In [None]:
!mkdir generated

Let's generate a Java file instead of a class file using `Pyjama.jar`:

In [None]:
!java -jar Pyjama/Pyjama.jar -d generated -j2j HelloWorld.java 

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/05	03:47:44
-----------------------------------------------------
Processing file: HelloWorld.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
-----------------------------------------------------
Paralleled .java code is generated.
Processing Done.


Now, execute the command below to examine the content of the generated file. 

In [None]:
%pfile generated/HelloWorld.java

Object `generated/HelloWorld.java` not found.


# Examples

Below you'll find some example Java multithreaded programs written using OpenMP-like directive. 

## Parallel Sum

Here's an example program that sum-up an array. The work division is done in the code based on the thread id. 

In [33]:
%%writefile ParallelSum.java
public class ParallelSum {
    public static void main(String[] args) {

        int numThreads = Integer.parseInt(args[0]);
        int n = Integer.parseInt(args[1]);
        double sum = 0;
        double [] results = new double[numThreads];
        double [] arr = new double[n];
        for (int i = 0; i < n; i++) arr[i] = 1;

        //#omp parallel num_threads(numThreads) shared(numThreads, n, arr, results)
        {
            int id = Pyjama.omp_get_thread_num();
            int chunk = (int)n/numThreads;
            int start = id * chunk;
            int end;

            if (id == numThreads - 1) end = n;
            else end = start + chunk;
            
            for (int i = start; i < end; i++)
                results[id] += arr[i];
        }
        for (int i = 0; i < numThreads; i++)
            sum += results[i];
        System.out.println("sum = " + sum);
    }
}

Overwriting ParallelSum.java


In [34]:
!java -jar Pyjama/Pyjama.jar ParallelSum.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/16	15:05:25
-----------------------------------------------------
Processing file: ParallelSum.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


First, let's make sure that it works with just one thread, summing up 100,000,000 numbers.

In [35]:
!java -cp Pyjama/Pyjama.jar:. ParallelSum 1 100000000

sum = 1.0E8


Let's try running this program with different number of threads and compare the timings. 

In [36]:
!time java -cp Pyjama/Pyjama.jar:. ParallelSum 1 100000000

sum = 1.0E8

real	0m0.875s
user	0m0.559s
sys	0m0.412s


In [37]:
!time java -cp Pyjama/Pyjama.jar:. ParallelSum 2 100000000

sum = 1.0E8

real	0m0.826s
user	0m0.584s
sys	0m0.401s


In [38]:
!time java -cp Pyjama/Pyjama.jar:. ParallelSum 4 100000000

sum = 1.0E8

real	0m0.840s
user	0m0.599s
sys	0m0.417s


Note that you may not notice any speed up if you specify more threads than the available CPUs. To see the available CPUs, use the following command:

In [20]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:            0
CPU MHz:             2199.998
BogoMIPS:            4399.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_sin

## Parallel sum with automatic work subdivision 

Below you'll find the same ParallelSum program, but this time we let Pyjama divide the work among the threads using `omp for` directive. Note also that we have to use the `reduction` clause.

In [39]:
%%writefile ArraySum.java
import java.lang.*;
import java.lang.Math;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;

public class ArraySum {
	public static void main(String[] args) {
		int numThreads = 0;
		int numItems = 0;
		if (args.length < 2) {
			System.err.println("usage: Hello <numThreads> <numItems>");
			System.exit(0);
		}
		try {
			numThreads = Integer.parseInt(args[0]);
			numItems = Integer.parseInt(args[1]);
		} catch (Exception ex) {
			System.err.println("Bad argument");
			System.exit(1);
		}
		double[] a = new double[numItems];

		fillArray(a);

		double sum = 0;
		
		//#omp parallel for shared(a) num_threads(numThreads) reduction(+:sum)
		for (int i = 0; i < a.length; i++) {
			sum += a[i];
		}
		System.out.println("Sum is " + String.valueOf(sum));
	
	}
	private static void fillArray(double[] a) {
		for (int i = 0; i < a.length; i++) {
			a[i] = 1; // can choose random: Math.random() * 100;
		}
	}
}


Overwriting ArraySum.java


In [40]:
!java -jar Pyjama/Pyjama.jar ArraySum.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/16	15:05:58
-----------------------------------------------------
Processing file: ArraySum.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


Let's compare the runtime of this program with 1 and 2 threads (feel free to adjust this to another number based on the number of CPUs available to you):

In [43]:
!time java -cp Pyjama/Pyjama.jar:. ArraySum 1 100000000

Sum is 1.0E8

real	0m0.891s
user	0m0.564s
sys	0m0.396s


In [45]:
!time java -cp Pyjama/Pyjama.jar:. ArraySum 2 100000000

Sum is 1.0E8

real	0m0.837s
user	0m0.604s
sys	0m0.395s


## Parallel Min/Max

Here's another example of `reduction` using `max` functions.

In [49]:
%%writefile ParallelMinMax.java
import java.util.concurrent.ThreadLocalRandom;

public class ParallelMinMax {

    public static void main(String[] args) {

        int numThreads = Integer.parseInt(args[0]);
        int n = Integer.parseInt(args[1]);
        int max_val = 0;
        int [] arr = new int[n];
        for (int i = 0; i < n; i++) 
            arr[i] = ThreadLocalRandom.current().nextInt(0, 101);
        
        //#omp parallel num_threads(numThreads) shared(n, arr) reduction(max:max_val)
        {
            //#omp for
            for (int i = 0; i < n; i++)
            {
                if (arr[i] > max_val)
                    max_val = arr[i];
            }
        }

        // use this only for debugging with small n
        // for (int i = 0; i < n; i++) System.out.print(arr[i] + " ");

        System.out.println("\nmax value = " + max_val);
        System.out.println("\nmin value = " + min_val);
    }
}


Overwriting ParallelMinMax.java


In [50]:
!java -jar Pyjama/Pyjama.jar ParallelMinMax.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/16	15:10:21
-----------------------------------------------------
Processing file: ParallelMinMax.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


Let's run this with 4 threads to find the maximum of 100,000,000 random numbers between 0 and 100.

In [51]:
!java -cp Pyjama/Pyjama.jar:. ParallelMinMax 4 100000000


max value = 100


Now, modify the code to also find the minimum!

## A more complex example

In this example, we'll estimate Pi (3.14159..) using the Monte Carlo method. The area inside a circle with the radius 1 in the cartesian coordinate is Pi * r^2 = Pi * 1^2 = Pi. If we generate random x and y in [0..1) range and track how many of them are inside the circle (sqrt(x^2 + y^2) <= 1). The ratio between the number of (x, y) points in the circle and the total number of points is Pi/4. So, Pi will be 4 * (number of points inside)/(total number of points). 

Also note that the call to sqrt is not needed since sqrt(1) = 1. 

In [54]:
%%writefile Monte.java
import java.lang.*;
import java.lang.Math;
import java.util.Random;
import java.util.concurrent.ThreadLocalRandom;

public class Monte {
	public static void main(String[] args) {
		int numIters = 0;
		try {
			numIters = Integer.parseInt(args[0]);
		} catch (Exception ex) {
			System.err.println("Exception when parsing argument");
			System.exit(1);
		}

		int numIn = 0;
		int numOut = 0;
		//#omp parallel num_threads(4) shared(numIters) reduction(+:numIn) reduction(+:numOut)
		{
			ThreadLocalRandom rand = ThreadLocalRandom.current();
			//#omp for
			for (int i = 0; i < myIters; i++) {
				// get random number from 0 to 1
				double x = rand.nextDouble();
				double y = rand.nextDouble();
				double hyp = x*x + y*y;
				if (hyp <= 1.0) {
					numIn++;
				} else {
					numOut++;
				}
			}
		}
		float p = (((float)numIn)/(numIn+numOut));
		float fourp = 4*p;
		System.out.println("Pi is " + String.valueOf(fourp));
	}
}


Overwriting Monte.java


In [55]:
!java -jar Pyjama/Pyjama.jar Monte.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/16	15:20:13
-----------------------------------------------------
Processing file: Monte.java
-----------------------------------------------------
/Monte.java:23: error: cannot find symbol
			for (int i = 0; i < myIters; i++) {
			                    ^
  symbol:   variable myIters
  location: class Monte


In [56]:
!java -cp Pyjama/Pyjama.jar:. Monte 100000000

Pi is 3.141376


# Pyjama Limitations

- It supports only a subset of OpenMP directives, the list of directives supported can be found [here](https://github.com/ParallelAndReconfigurableComputing/Pyjama/blob/master/src/pj/parser/java5OMP.jj). Note that not all of the keywords listed there are fully supported, for example:
  - There is an implied `default(none)` clause. All shared variables have to be declared using `shared` clause. 
  - All Pyjama directives must begin with `//#omp`. No extra white spaces between `//` and `#omp`.
  - There is no support for user-defined reduction operators. 
- Pyjama has the following limits on Java support:
  - Currently the parser only support Java 1.5. Even though the `javac` compiler you use is a later version and support all the latest syntax, you cannot use Java construct beyond Java 1.5 since the Pyjama compiler will not support it. Specifically, it doesn't support lambdas. 
  - It doesn't support Java packages.
- Other quirks (not a complete list):
  - Arrays in Pyjama parallel regions, has to be declared as Java-style array (brackets before variable name, not after): 
    - CORRECT: `int [] arr_var = new int[n];`
    - INCORRECT: `int arr_var [] = new int[n];`
  - Can’t use Pyjama function names, eg `omp_get_num_threads`, as variables
  - Issues with `-j2j` (source to source) option:
    - if you don't also specify the `-d` (output directory) option along with `-j2j`, Pyjama will happily overwrite your source code with its generated code.
    - this option will append `src` to the path if you specify source code in other than the current directory.
