<a href="https://colab.research.google.com/github/rkurniawati/pyjama-patternlets/blob/master/Java_OpenMP_Patternlets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Java OpenMP Patternlet Notebook

Adapted to Java by Ruth Kurniawati (Westfield State University) based on the [PDC book](https://pdcbook.calvin.edu/pdcbook/RaspberryPiHandout/) from [CSInParallel](https://csinparallel.org/index.html). 

This notebook contains OpenMP patternlet examples in the Java programming language. Patterns are reusable solution for commonly occuring problems. The OpenMP patternlets are reusable solution that were originally written in the C language with the OpenMP library by Joel Adams: 

Adams, Joel C. "Patternlets: A Teaching Tool for Introducing Students to Parallel Design Patterns." 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 2015.

However, OpenMP library is only available for C/C++ and Fortran languages. For Java, Pyjama provides support for OpenMP-like directive. More information about Pyjama can be found in the paper below:

Vikas, Nasser Giacaman, and Oliver Sinnen. 2013. Pyjama: OpenMP-like implementation for Java, with GUI extensions. In <i>Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores</i> (<i>PMAM '13</i>). Association for Computing Machinery, New York, NY, USA, 43–52. DOI:https://doi.org/10.1145/2442992.2442997

# Multicore Systems and Multi-Threading

Before proceeding with the examples, let's investigate the computer that this notebook is running on. For this, let's use the `lscpu` command. 




In [None]:
!lscpu

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping:            0
CPU MHz:             2299.998
BogoMIPS:            4599.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs 

## Cores, Processes and Threads



If you run `lscpu` in the notebook, you may see output similar to below:

```
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              2
On-line CPU(s) list: 0,1
Thread(s) per core:  2
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:            0
CPU MHz:             2199.998
BogoMIPS:            4399.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            56320K
NUMA node0 CPU(s):   0,1
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat md_clear arch_capabilities
```

This output means that the computer has 2 CPUs -- however, there is actually only one physical core but this core can execute 2 execution threads. 

The `lscpu` command tells us a LOT of useful information, including the number of available cores. In this case we know that there is one socket (or chip) with 1 physical core, where each core can support 2 thread. On larger systems, it is common to see multiple threads supported per core. This is an example of **simultaneous multi-threading** (SMT, or Hyperthreading on Intel systems). A **core** can be thought of as the compute unit of the CPU. It includes registers, an ALU, and control units.

Before we can discuss what a thread is, we must first discuss what a process is. A **process** can be thought of as an abstraction of a running program. When you type a command into the command line and press Enter, the Bash shell launches a process associated with that program executable. Each process contains a copy of the code and data of the program executable, and its own allocation of the stack and heap.

A **thread** is a light-weight process. While each thread gets its own stack allocation, it shares the heap, code and data of the parent process. As a result, all the threads in a multi-threaded process can access a common pool of memory. This is why multi-threading is commonly referred to as shared memory programming. A single-threaded process is also referred to as a serial process or program.

### Process Execution
A multicore CPU allows multiple processes to execute simultaneously, or in **parallel**. While the terms concurrency and parallel are related, it is useful to think of concurrency as a software/OS-level concept, while parallel as a hardware/execution concept. A multi-threaded program, while capable of parallel execution, runs concurrently on a system with only a single CPU core.

## Thread Execution

The primary goal of creating multi-threaded programs is to decrease the speed of a program’s execution. In a program that is perfectly parallelizable (that is, all components are paralleizable), it is usually possible to distribute the work associated with a program equally among all the threads. For a program _p_ whose work is equally distributed among _t_ threads, it will take roughly _p_/_t_ time, if executed on _t_ cores.

For example, if a multi-threaded process that is perfectly parallelized takes 100 seconds to execute on one core, on a multi-core system with 4 cores the program will take approximately 100/4 to execute.

## Leveraging Multiple Cores
While multicore processors are ubiquitous in today’s world, most of the popular programming languages were designed to support single-thread execution. However, several native libraries are available for supporting multi-threading in popular languages like C/C++ and FORTRAN.

One of these libraries is the Open MultiProcessing (OpenMP), a popular API for shared memory programming, and a standard since 1997. A key benefit of OpenMP over explicit threading libraries like POSIX threads is the ability to incrementally add parallelism to a program. For standard threaded programs, it is usually necessary to write a lot of extra code to add multi-threading to a program. Instead, OpenMP employs a series of pragmas, or special compiler directives, that tell the compiler how to parallelize the code.

OpenMP library is only available for C/C++ and Fortran languages. For Java, Pyjama compiler and runtime provide support for OpenMP-like directive. More information about Pyjama can be found in the paper below:

Vikas, Nasser Giacaman, and Oliver Sinnen. 2013. Pyjama: OpenMP-like implementation for Java, with GUI extensions. In <i>Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores</i> (<i>PMAM '13</i>). Association for Computing Machinery, New York, NY, USA, 43–52. DOI:https://doi.org/10.1145/2442992.2442997

In the rest of this notebook, we will look at Java programs that use OpenMP-like directives provided by the Pyjama to explore small patterns (_patternlets_) in parallel programming. 

# Pyjama OpenMP library

Before proceeding to the code sample, we need to make Pyjama compiler and runtime available to this notebook. 

The Pyjama library used in this notebook is obtained from the [CDER project](https://tcpp.cs.gsu.edu/curriculum/?q=node/21183), specifically [the version with additional bug fixes provided by Tennnessee Tech](https://www.csc.tntech.edu/pdcincs/index.php/installation). 

## Pyjama setup

First, let's download and setup Pyjama Java source code compiler and runtime library. The commands below will download a ZIP file from Tennessee Tech, unzip it, and create the Pyjama/Pyjama.jar link to point to the specific jar file extracted from the ZIP file.

In [None]:
!wget -O Pyjama.zip https://www.csc.tntech.edu/pdcincs/resources/modules/tools/updated/Pyjama.zip
!unzip Pyjama.zip
!ln Pyjama/Pyjama-3.1.0.jar Pyjama/Pyjama.jar

--2021-08-12 03:20:23--  https://www.csc.tntech.edu/pdcincs/resources/modules/tools/updated/Pyjama.zip
Resolving www.csc.tntech.edu (www.csc.tntech.edu)... 149.149.134.5
Connecting to www.csc.tntech.edu (www.csc.tntech.edu)|149.149.134.5|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 688047 (672K) [application/zip]
Saving to: ‘Pyjama.zip’


2021-08-12 03:20:24 (1.82 MB/s) - ‘Pyjama.zip’ saved [688047/688047]

Archive:  Pyjama.zip
   creating: Pyjama/
  inflating: Pyjama/Pyjama-3.1.0.jar  
  inflating: Pyjama/set_pyjama.bat   
 extracting: Pyjama/set_pyjama.sh    


## Hello world

In this example, we will verify that the Pyjama installation is working and able to create multiple threads as specified in the `#omp parallel num_threads` directive.

In [None]:
%%writefile HelloWorld.java
public class HelloWorld
{
	
	public static void main(String[] args) 
	{
 
    Pyjama.omp_set_num_threads(10);
		//#omp parallel
		{
			int id = Pyjama.omp_get_thread_num();
      int numThreads = Pyjama.omp_get_num_threads();
      System.out.print("Hello from thread " + id+ ", ");
      System.out.println("of a total of "+ numThreads+ " threads.");
		}
	}
}

Overwriting HelloWorld.java


Note that the OpenMP directive is specified inside a single line comment that starts with `//`. For the directive to be recognized by Pyjama compiler, the `//` has to be followed immediately by `#omp`. 

First, let's use the Pyjama compiler to process the `#omp parallel` directive in the program.

In [None]:
!java -jar Pyjama/Pyjama.jar HelloWorld.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	20:09:54
-----------------------------------------------------
Processing file: HelloWorld.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


Now, we're ready to run the HelloWorld program. To do this, you will need to make the Pyjama.jar available in the classpath so that the Pyjama OpenMP-like runtime library is available to the HelloWorld program.

In [None]:
!java -cp Pyjama/Pyjama.jar:. HelloWorld

Hello from thread 7, Hello from thread 6, of a total of 10 threads.
of a total of 10 threads.
Hello from thread 4, Hello from thread 3, Hello from thread 0, of a total of 10 threads.
of a total of 10 threads.
of a total of 10 threads.
Hello from thread 5, of a total of 10 threads.
Hello from thread 2, Hello from thread 8, of a total of 10 threads.
Hello from thread 9, of a total of 10 threads.
Hello from thread 1, of a total of 10 threads.
of a total of 10 threads.


You should expect 10 lines of `Hello from thread x, of a total of y threads`. The lines may be interspersed with one another if you have more than one processors running the code. An example output is below.

```
Hello from thread 9, Hello from thread 5, Hello from thread 8, Hello from thread 2, of a total of 10 threads.
of a total of 10 threads.
Hello from thread 6, Hello from thread 3, of a total of 10 threads.
Hello from thread 4, of a total of 10 threads.
of a total of 10 threads.
of a total of 10 threads.
Hello from thread 7, of a total of 10 threads.
Hello from thread 0, of a total of 10 threads.
Hello from thread 1, of a total of 10 threads.
of a total of 10 threads.
```

# A Simple Parallel Program

## The SPMD Patternlet


A patternlet is a small program that succinctly illustrates common patterns in parallel programming. The first patternlet we will study is Single Program, Multiple Data (SPMD). Let’s start by examining Spmd2.java, a program that uses OpenMP pragmas to make it easy to run a portion of the program on multiple threads. Note that the variables `id`

In [None]:
%%writefile Spmd2.java
class Spmd2 {
    public static void main(String[] args) {
        if (args.length >= 1) {
            Pyjama.omp_set_num_threads(Integer.parseInt(args[0]));
        }
        System.out.println();

        int id, numThreads;
        //#omp parallel shared(id, numThreads)
        {
            numThreads = Pyjama.omp_get_num_threads();
            id = Pyjama.omp_get_thread_num();
            System.out.println("Hello from thread "+ id +" of " + numThreads);
        }

        System.out.println();
    }
}


Overwriting Spmd2.java


The `omp parallel` directive tells the Pyjama compiler that the block of code within the curly braces be run on separate threads. Prior to the line with the directive, the program is run serially. When block of code marked with the `omp parallel` directive executes, Pyjama generates a a team of threads (known as forking). Each thread is assigned its own id and runs separate copies of the code between the curly braces. At the end of the block scope, Pyjama combines all the threads together to a single-threaded process (known as joining). Conceptually, the process looks like the following.

<img src=https://pdcbook.calvin.edu/pdcbook/RaspberryPiHandout/_images/ForkJoin_SPMD.png >

## Running the Program

Just like in the HelloWorld example, first we need to use the Pyjama compiler to process the `#omp` directive. 

In [None]:
!java -jar Pyjama/Pyjama.jar Spmd2.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	18:55:15
-----------------------------------------------------
Processing file: Spmd2.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


Now, we're ready to run the Spmd2 program. Let's specify that you'd like to have 10 threads by supplying this number in the command line argument. 

In [None]:
!java -cp Pyjama/Pyjama.jar:. Spmd2 10 



Hello from thread 0 of 10
Hello from thread 1 of 10
Hello from thread 2 of 10
Hello from thread 3 of 10
Hello from thread 4 of 10
Hello from thread 5 of 10
Hello from thread 6 of 10
Hello from thread 7 of 10
Hello from thread 8 of 10
Hello from thread 9 of 10


Try running the program a few times with 10 threads (press the run button in the cell above). Observe the output. Occasionally something will be amiss. Do you notice it?

## Race Conditions

Watch this [video](
https://d32ogoqmya1dw8.cloudfront.net/files/csinparallel/raceconditions_workshop.mov) to help you understand what's going on. Note that the video is made for the C++ version of the program, however the underlying issue is the same. 

The Spmd2 program has a race condition where there are more than one threads modifying a shared variable. Which shared variable(s) is/are causing the problem?

## Fixing the code

For this example, the race condition can be avoided by ensuring that each threads has its own copy of `id` and `numThreads` variables. Instead of declaring them as `shared` in the `omp parallel` directive, use the `private` clause as shown below.

In [None]:
%%writefile Spmd2.java
class Spmd2 {
    public static void main(String[] args) {
        if (args.length >= 1) {
            Pyjama.omp_set_num_threads(Integer.parseInt(args[0]));
        }
        System.out.println();

        int id, numThreads;
        //#omp parallel private(id, numThreads)
        {
            numThreads = Pyjama.omp_get_num_threads();
            id = Pyjama.omp_get_thread_num();
            System.out.println("Hello from thread "+ id +" of " + numThreads);
        }

        System.out.println();
    }
}


Overwriting Spmd2.java


Let's compile and run this modified program.

In [None]:
!java -jar Pyjama/Pyjama.jar Spmd2.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	19:19:29
-----------------------------------------------------
Processing file: Spmd2.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


In [None]:
!java -cp Pyjama/Pyjama.jar:. Spmd2 10 


Hello from thread 7 of 10
Hello from thread 4 of 10
Hello from thread 2 of 10
Hello from thread 9 of 10
Hello from thread 8 of 10
Hello from thread 0 of 10
Hello from thread 5 of 10
Hello from thread 1 of 10
Hello from thread 3 of 10
Hello from thread 6 of 10



Were you able to reproduce the race condition using the corrected program? Why should you also declare `numThreads` as a private variable?

# Running Loops in Parallel

Next we will consider a program that has a loop in it. An iterative for loop is a remarkably common pattern in all programming, primarily used to perform a calculation N times, often over a set of data containing N elements, using each element in turn inside the for loop.

If there are no dependencies between the iterations (i.e. the order of them is not important), then the code inside the loop can be split between forked threads. However, the programmer must first decide how to partition the work between the threads. Specifically, how many and which iterations of the loop will each thread complete on its own?

The **data decomposition** pattern describes the way how work is distributed across multiple threads. This chapter presents two patternlets, parallelLoop-equalChunks and parallelLoop-chunksOf1, that describe two common data decomposition strategies.

## Parallel Loop, Equal Chunks

Let's experiment with another OpenMP directive that will divide the work in a loop into equal chunks. 

In [None]:
%%writefile ParallelLoopEqualChunks.java
class ParallelLoopEqualChunks {
    final static int REPS = 16;
    public static void main(String[] args) {
        if (args.length >= 1) {
            Pyjama.omp_set_num_threads(Integer.parseInt(args[0]));
        }
        System.out.println();

        //#omp parallel for  
        for (int i = 0; i < REPS; i++) {
            int id = Pyjama.omp_get_thread_num();
            System.out.println("Thread "+id+" performed iteration "+i);
        }

        System.out.println();
    }
}


Writing ParallelLoopEqualChunks.java


The `omp parallel for` directive tells the Pyjama OpenMP compiler to do the following:
- Generate a team of threads (default is equal to the number of cores)
- Assign each thread an equal number of iterations (a chunk) of the for loop.
- At the end of the scope of the for loop, join all the theads back to a single-threaded process.

As in our previous example, the code up to the `omp parallel for` directive is run serially. The code that is in the scope of the `omp parallel for` directive (everything inside the for loop) is run in parallel, with a subset of iterations assigned to each thread. After the implicit join at the end of the for loop, the program once again is a single-threaded process that executes serially to completion.

In the above program, REPS is set to 16. If the program is run with 4 threads, then each thread gets 4 iterations of the loop (see illustration below):

<img src="https://pdcbook.calvin.edu/pdcbook/RaspberryPiHandout/_images/ParallelFor_Chunks-4_threads-1.png">

## Try It Out

Try compile and run the program using the following commands below.

In [None]:
!java -jar Pyjama/Pyjama.jar ParallelLoopEqualChunks.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	19:27:40
-----------------------------------------------------
Processing file: ParallelLoopEqualChunks.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


In [None]:
!java -cp Pyjama/Pyjama.jar:. ParallelLoopEqualChunks 4


Thread 2 performed iteration 8
Thread 2 performed iteration 9
Thread 2 performed iteration 10
Thread 2 performed iteration 11
Thread 3 performed iteration 12
Thread 1 performed iteration 4
Thread 1 performed iteration 5
Thread 3 performed iteration 13
Thread 3 performed iteration 14
Thread 3 performed iteration 15
Thread 0 performed iteration 0
Thread 1 performed iteration 6
Thread 1 performed iteration 7
Thread 0 performed iteration 1
Thread 0 performed iteration 2
Thread 0 performed iteration 3



Try running the program a few times with 4 threads. How does the work in the for loop get assigned to the threads?

### Unequal Iterrations

Also try using a different number of threads. Pick a number so that the number iterations cannot be equally divided by the number of threads, such as 5. 

In [None]:
!java -cp Pyjama/Pyjama.jar:. ParallelLoopEqualChunks 5


Thread 2 performed iteration 7
Thread 2 performed iteration 8
Thread 2 performed iteration 9
Thread 0 performed iteration 0
Thread 0 performed iteration 1
Thread 1 performed iteration 4
Thread 0 performed iteration 2
Thread 0 performed iteration 3
Thread 3 performed iteration 10
Thread 3 performed iteration 11
Thread 3 performed iteration 12
Thread 1 performed iteration 5
Thread 1 performed iteration 6
Thread 4 performed iteration 13
Thread 4 performed iteration 14
Thread 4 performed iteration 15



What happens to the extra iterations?

This equal-chunk decomposition is especially useful in the following scenarios:
- Each iteration of the loop takes the same amount of time to finish
- The loop involves accesses to data in consecutive memory locations (e.g. an array), allowing the program to take advantage of spatial locality.

## Parallel Loop, Chunks of 1

In some cases, it makes sense to have iterations assigned to threads in “round-robin” style. In other words, iteration 0 goes to thread 0, iteration 1 goes to thread 1, iteration 2 goes to thread 2, and so on.

Let's examine the code below that directs Pyjama to do this.

In [None]:
%%writefile ParallelLoopChunksOf1.java
class ParallelLoopChunksOf1 {
    final static int REPS = 16;
    public static void main(String[] args) {
        if (args.length >= 1) {
            Pyjama.omp_set_num_threads(Integer.parseInt(args[0]));
        }
        System.out.println();

        //#omp parallel for schedule(static,1)
        for (int i = 0; i < REPS; i++) {
            int id = Pyjama.omp_get_thread_num();
            System.out.println("Thread "+id+" performed iteration "+i);
        }

        System.out.println();
    }
}

Writing ParallelLoopChunksOf1.java


The code is nearly identical to the previous program. The difference is in the `omp` directive. The `omp parallel for` directive has a new `schedule` clause which specifies the way iterations should be assigned to threads. The `static` keyword indicates that the the compiler should assign work to each thread at compile time (a **static** scheduling policy). The `1` indicates that the chunk size should be 1 iteration. Therefore, the above code would have 16 total chunks.

In the case where the number of chunks exceed the number of theads, each successive chunk is assigned to a thread in round-robin fashion.

### Try It Out

Let's compile and run the code. 

In [None]:
!java -jar Pyjama/Pyjama.jar ParallelLoopChunksOf1.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	19:36:51
-----------------------------------------------------
Processing file: ParallelLoopChunksOf1.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


In [None]:
!java -cp Pyjama/Pyjama.jar:. ParallelLoopChunksOf1 4


Thread 3 performed iteration 3
Thread 1 performed iteration 1
Thread 1 performed iteration 5
Thread 1 performed iteration 9
Thread 1 performed iteration 13
Thread 3 performed iteration 7
Thread 3 performed iteration 11
Thread 3 performed iteration 15
Thread 2 performed iteration 2
Thread 2 performed iteration 6
Thread 2 performed iteration 10
Thread 2 performed iteration 14
Thread 0 performed iteration 0
Thread 0 performed iteration 4
Thread 0 performed iteration 8
Thread 0 performed iteration 12



## Another way to do round-robin scheduling

At this point, you might have guessed that `#omp parallel` creates the threads and `#omp for` divides the work according to the schedule specified (the default is `static` with chunk size equals to the number of iterations divided by the number of threads). With this in mind, we can write code that will do round-robin scheduling of the for loop without using `#omp for`. 



In [None]:
%%writefile ParallelLoopChunksOf1.java
class ParallelLoopChunksOf1 {
    final static int REPS = 16;
    public static void main(String[] args) {
        if (args.length >= 1) {
            Pyjama.omp_set_num_threads(Integer.parseInt(args[0]));
        }
        System.out.println();

        //#omp parallel
        {
            int numThreads = Pyjama.omp_get_num_threads();
            int id = Pyjama.omp_get_thread_num();
            for (int i = id; i < REPS; i+=numThreads) {
                System.out.println("Thread "+id+" performed iteration "+i);
            }
        }
        System.out.println();
    }
}



Overwriting ParallelLoopChunksOf1.java


In [None]:
!java -jar Pyjama/Pyjama.jar ParallelLoopChunksOf1.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/11	19:46:28
-----------------------------------------------------
Processing file: ParallelLoopChunksOf1.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


In [None]:
!java -cp Pyjama/Pyjama.jar:. ParallelLoopChunksOf1 4


Thread 3 performed iteration 3
Thread 0 performed iteration 0
Thread 0 performed iteration 4
Thread 0 performed iteration 8
Thread 0 performed iteration 12
Thread 2 performed iteration 2
Thread 1 performed iteration 1
Thread 3 performed iteration 7
Thread 2 performed iteration 6
Thread 1 performed iteration 5
Thread 1 performed iteration 9
Thread 1 performed iteration 13
Thread 2 performed iteration 10
Thread 2 performed iteration 14
Thread 3 performed iteration 11
Thread 3 performed iteration 15



Compare the work assignment in this program with the previous program that uses `#omp for` directive.

## Dynamic Scheduling

In some cases, it is beneficial for the assignment of loop iterations to occur at run-time. This is especially useful when each iteration of the loop can take a different amount of time. Dynamic scheduling, or assigning iterations to threads at run time, allows threads that have finished work to start on new work, while letting threads that are still busy continue to work in peace.

An example of program that will benefit from this scheduling is shown below. This program counts the number of prime numbers between 1 and n. The amount of work involved in checking if a number is a prime number depends on the value of the number -- larger number will require more work. 



In [None]:
%%writefile SimpleDynamicScheduling.java
class SimpleDynamicScheduling {

    static void sleepALittle(int numMillis) {
        try { 
            Thread.sleep(numMillis); 
        } catch(InterruptedException e) {
            // do nothing
        }
    }

    public static void main(String[] args) {
        int numThreads = Pyjama.omp_get_num_procs();
        if (args.length >= 1) {
            numThreads = Integer.parseInt(args[0]);
        }

        long startTime = System.currentTimeMillis();
        int count = 1;

        //#omp parallel for num_threads(numThreads) /* schedule(dynamic)  */
        for(int i = 1; i <= 100; i++) {
            sleepALittle(i);
        }
    
        long endTime = System.currentTimeMillis();        
        System.out.println("Time = " + (endTime-startTime) + " ms");

    }
}

Overwriting SimpleDynamicScheduling.java


To employ a dynamic scheduling policy, you can specify `schedule(dynamic)` or `schedule(dynamic, chunkSize)` instead of `schedule(static)` or `schedule(static, chunkSize)`. Try specifying dynamic scheduling and different chunk size and see what happens to the program's runtime.



In [None]:
!java -jar Pyjama/Pyjama.jar SimpleDynamicScheduling.java

Pyjama Compiler Version: 3.1.0
-----------------------------------------------------
2021/08/12	03:22:19
-----------------------------------------------------
Processing file: SimpleDynamicScheduling.java
-----------------------------------------------------
Processing 1st Phase: Parse and Normalisation
Processing 2nd Phase: Symbol scoping visiting
Processing 3rd Phase: Pyjama code translation visiting
Processing 4th Phase: Generating java code
Paralleled .class file is generated.
Processing Done.


In [None]:
!java -cp Pyjama/Pyjama.jar:. SimpleDynamicScheduling 4

Time = 2219 ms
