# Difference between login node and compute node on a cluster

| Feature       | Login Node | Compute Node |
|--------------|-----------|-------------|
| Purpose      | User interaction, job submission, file management | Running computational jobs.Compute nodes are used HPC calculations on cluster through a job script|
| Accessibility | Direct user access via SSH | Accessed via job scheduler like Portable Batch management System(PBS) |
| Resources    | Limited CPU, memory | High-performance CPUs, GPUs, memory |
| Multi-user   | Shared | Often exclusive to a job |

OpenMP:   
OpenMP (Open Multi-Processing) is a set of compiler directives and runtime functions that allow you to write parallel code easily in C, C++, and Fortran. It enables shared-memory multiprocessing.

## Code for change_present working directory in Job script to cluster

In [None]:
#PBS -N first-example 
#PBS -q teachingq  
#PBS -l select=1:ncpus=1:mpiprocs=1 
#PBS -l walltime=00:01:00  
#PBS -o log.out  
#PBS -e log.err  
echo -e "Job started from $(pwd)"  
echo "Changing directory to..."  
PBS_O_WORKDIR=~/parllel_computing/  
cd $PBS_O_WORKDIR  
echo -e "$(pwd)"  
cat my_file  

# Command for submitting a job script to cluster
qsub job.script

PBS would assign job id. Ex: 
(base) [sy37tovi@mlogin01 parllel_computing]$ qsub change_pwd_cluster
927540.mmaster02

qstat will show running jobs. qdel <job-id> will delete a
specific job that is currently running (status R) or waiting (status Q).

Status of  job script:
| Job ID          | Name           | User      | Time Use | S | Queue      |
|-----------------|----------------|-----------|----------|---|------------|
| 929936.mmaster02| first-example  | sy37tovi  |        0 | R | teachingq  |


| S | Meaning   | Description                                                               |
|------|-----------|---------------------------------------------------------------------------|
| Q    | Queued    | Job is waiting in the queue to be scheduled.                             |
| R    | Running   | Job is currently running on compute nodes.                               |
| C    | Completed | Job has finished running (you might see this only briefly or in logs).   |
| E    | Exiting   | Job is in the process of ending (wrapping up, cleaning up output, etc).  |
| H    | Held      | Job is being held and won't be scheduled until released.                 |
| S    | Suspended | Job has been suspended (paused) — usually by an admin or scheduler.      |
| W    | Waiting   | Job is waiting for its start time (e.g., a scheduled future run).        |


# Difference between shared memory and distributed memory

| Feature          | Shared Memory | Distributed Memory |
|-----------------|--------------|-------------------|
| **Definition**   | All processors access a common memory space. | Each processor has its own local memory. |
| **Communication** | Data is shared through global memory. | Processors communicate via message passing (e.g., MPI). |
| **Programming Model** | OpenMP, Pthreads | MPI (Message Passing Interface) |
| **Scalability**  | Limited by memory bandwidth and bus speed. | Scales well across multiple nodes. |
| **Performance**  | Fast communication but limited scalability. | Higher latency due to network communication. |
| **Hardware Example** | Multi-core CPUs, SMP (Symmetric Multiprocessing) systems. | Cluster of computers, supercomputers. |
| **Best For** | Programs running on a single machine with multiple cores. | Large-scale parallel computing across multiple nodes. |

# C code for open mpi example(open_mpi.c):

In [None]:
#include <stdio.h>
#include <omp.h>

int main(int argc, char** argv){

	#pragma omp parallel
	{
		int numthreads,num;
		numthreads=omp_get_num_threads();
		num=omp_get_thread_num();
        printf("Hello from thread %d of %d\n", num, numthreads);
	}
	return 0;
}

Compilation command:gcc -o output.bin open_mpi.c -O0 -fopenmp

| **Part**              | **Meaning**                                               |
|-----------------------|-----------------------------------------------------------|
| `gcc`                 | Calls the GNU C compiler.                                 |
| `-o hello-openmp.bin` | Specifies the output file name as `hello-openmp.bin`.      |
| `main.c`              | The source file to compile.                               |
| `-O0`                 | Disables optimizations (useful for debugging).            |
| `-fopenmp`            | Enables OpenMP support for parallel programming.          |


Understanding Nodes, CPUs, and Threads:
Node: A node is a machine or a computer in a cluster that usually has multiple CPU cores.

CPU/Core: This is a single processing unit inside a node. Typically, each core can run one thread at a time.

Thread: This is a unit of execution within a program. In OpenMP, threads are typically created within a process, and each thread can be mapped to a separate core.

# Job script(job_script_4threads) to execute the shell script for executing the output in cluster in 'open_mp_example' directory

In [None]:
#PBS -N first-example
#PBS -q teachingq
#PBS -l select=1:ncpus=4:mpiprocs=1
#PBS -l walltime=00:01:00
#PBS -o log.out1
#PBS -e log.err1
export OMP_NUM_THREADS=4

echo -e "Job started from $(pwd)."
echo "Changing directory to..."
PBS_O_WORKDIR=/home/sy37tovi/parllel_computing/open_mp_example
cd $PBS_O_WORKDIR
echo -e "$(pwd)"

./output.bin

Job started from /home/sy37tovi/pbs.929936.mmaster02.x8z.
Changing directory to...
/home/sy37tovi/parllel_computing/open_mp_example
Hello from thread 0 of 4
Hello from thread 3 of 4
Hello from thread 1 of 4
Hello from thread 2 of 4

# C code example for identifying function/ usage of threads

In [None]:
#include <stdio.h>
#include <omp.h>

int main(int argc, char** argv){

    int numthreads = omp_get_num_threads();
    int num = omp_get_thread_num();

    printf("Hello from the master thread %d of %d\n\n", num, numthreads);

    #pragma omp parallel
    {
        numthreads = omp_get_num_threads();
        num = omp_get_thread_num();
        printf("     Hello from the forked thread %d of %d\n", num, numthreads);
    }

    printf("\nHello (again) from the master thread %d of %d\n", num, numthreads);

    return 0;
}

 gcc -o output2.bin open_mpi2.c -O0 -fopenmp # Saves the compiled file as output2.bin

# Job script(job_script_4threads2) to execute the shell script for execting the output in 'open_mp_example' directory

In [None]:
#PBS -N first-example
#PBS -q teachingq
#PBS -l select=1:ncpus=4:mpiprocs=1
#PBS -o log.out2
#PBS -e log.err2
export OMP_NUM_THREADS=4

echo -e "Job started from $(pwd)."
echo "Changing directory to..."
PBS_O_WORKDIR=/home/sy37tovi/parllel_computing/open_mp_example
cd $PBS_O_WORKDIR
echo -e "$(pwd)"

./output2.bin

int numthreads = omp_get_num_threads();
int num = omp_get_thread_num();
At this point, the program hasn't entered the parallel region yet. So:
omp_get_num_threads() returns 1

omp_get_thread_num() returns 0
printf("Hello from the master thread %d of %d\n\n", num, numthreads);
Hello from the master thread 0 of 1

In the #pragma omp parallel block, OpenMP forks multiple threads.

Each thread will:

Get its own thread number (omp_get_thread_num())

Get total threads (omp_get_num_threads())

Hello from the forked thread 0 of 4
Hello from the forked thread 1 of 4
Hello from the forked thread 2 of 4
Hello from the forked thread 3 of 4

In [None]:
#PBS -N first-example 
#PBS -q teachingq  
#PBS -l select=1:ncpus=1:mpiprocs=1 
#PBS -l walltime=00:01:00  
#PBS -o log.out  
#PBS -e log.err  
echo -e "Job started from $(pwd)"  
echo "Changing directory to..."  
PBS_O_WORKDIR=~/parllel_computing/  
cd $PBS_O_WORKDIR  
echo -e "$(pwd)"  
cat my_file  

Command for submitting the job script:
 qsub job_script_4threads2

 After running, the output will be saved in log.out2 as mentioned in the job script
 

Job started from /home/sy37tovi/pbs.929938.mmaster02.x8z.
Changing directory to...
/home/sy37tovi/parllel_computing/open_mp_example
Hello from the master thread 0 of 1

     Hello from the forked thread 0 of 4
     Hello from the forked thread 2 of 4
     Hello from the forked thread 1 of 4
     Hello from the forked thread 3 of 4

Hello (again) from the master thread 3 of 4

Here, this line:     
printf("\nHello (again) from the master thread %d of %d\n", num, numthreads);

prints 3rd thread as master thread eventhough 0 is the master thread.

By default, variables like num and numthreads are shared in OpenMP. So all threads are writing to the same memory — which leads to race conditions.

We fix this by rewriting the c script by initiality the variables agaain after the parllel block.

In [None]:
#include <stdio.h>
#include <omp.h>

int main(int argc, char** argv){

    int numthreads = omp_get_num_threads();
    int num = omp_get_thread_num();

    printf("Hello from the master thread %d of %d\n\n", num, numthreads);

    #pragma omp parallel
    {
        numthreads = omp_get_num_threads();
        num = omp_get_thread_num();
        printf("     Hello from the forked thread %d of %d\n", num, numthreads);
    }

    // FIX: Recomputing these values after parallel section to get the master thread
    numthreads = omp_get_num_threads();
    num = omp_get_thread_num();
    printf("\nHello (again) from the master thread %d of %d\n", num, numthreads);

    return 0;

Now the output is: 
Job started from /home/sy37tovi/pbs.929983.mmaster02.x8z.
Changing directory to...
/home/sy37tovi/parllel_computing/open_mp_example
Hello from the master thread 0 of 1

     Hello from the forked thread 0 of 4
     Hello from the forked thread 1 of 4
     Hello from the forked thread 3 of 4
     Hello from the forked thread 2 of 4

Hello (again) from the master thread 0 of 1

# omp parllel private(num)
#pragma omp parallel
This tells the compiler to create a parallel region — meaning multiple threads will execute the block of code that follows.

private(num)
This clause declares that the variable num is private to each thread. That means:

Each thread gets its own uninitialized copy of num.

Modifications to num by one thread do not affect the value of num seen by other threads.

The original value of num (before the parallel region) is not visible to threads inside the parallel region.

The modified value of num inside the parallel region is not preserved after the parallel region ends.


# omp parllel for partioning/assigning the iterations for different threads

If we run the for loop in the parllel block then n iterations happen for each threads, which might be redundant in most of the cases. With omp parllel for, we can avoid this by assigning different iterations to threads running parllely

In [None]:
#include <stdio.h>
#include <omp.h>

int main(int argc, char** argv) {
    int i;
    int N = 10;

    #pragma omp parallel for
    for (i = 0; i < N; i++) {
        int num = omp_get_thread_num();
        printf("Thread %d does iteration %d\n", num, i);
    }
    return 0;
}

Compilation command: gcc -o output2.bin open_mp_parllel_for.c -O0 -fopenmp


output:

Job started from /home/sy37tovi/pbs.932784.mmaster02.x8z.

Changing directory to...

/home/sy37tovi/parllel_computing/open_mp_for_loop

Thread 0 does iteration 0

Thread 0 does iteration 1

Thread 0 does iteration 2

Thread 2 does iteration 6

Thread 2 does iteration 7

Thread 3 does iteration 8

Thread 3 does iteration 9

Thread 1 does iteration 3

Thread 1 does iteration 4

Thread 1 does iteration 5

# Scheduling the parllel computing in for loop  through static, dynamic and runtime 
Present in: /parllel_computing/open_mp_for_loop/scheduling folder :

In openmp, scheduling controls how loop iterations are divided among threads in a parllel for loop.

Why is scheduling important?
Different types of work may take different time per iteration, so a bad schedule might lead to:

Some threads finishing early and sitting idle.

Unbalanced load → reduced performance.

| **Schedule Type** | **How it Works**                                                                 |
|-------------------|-----------------------------------------------------------------------------------|
| `static`          | Iterations are divided evenly before the loop starts. Fast but rigid.            |
| `dynamic`         | Threads grab chunks of iterations as they finish work. Better for load balancing.|
| `guided`          | Like dynamic, but chunk sizes shrink over time. Good for decreasing workloads.   |
| `auto`            | Lets the compiler/runtime decide the best scheduling strategy.                   |
| `runtime`         | Uses the `OMP_SCHEDULE` environment variable to decide at runtime.               |

By default, omp parllel for assumed as 'static'


What is the difference betweeen

– schedule(static,1),

– schedule(dynamic), and

– schedule(dynamic,3)?


schedule(static,1):
    Static schedule → iterations are divided upfront among threads in a round-robin fashion.

    Chunk size = 1 → each thread gets one iteration at a time, but assignments are pre-determined.

    Example (4 threads, 8 iterations):

    Thread	Iterations

    0	    0, 4

    1	    1, 5

    2	    2, 6

    3	    3, 7

schedule(dynamic):

    Threads ask for work as they finish.

    Default chunk size (usually 1 if not specified).

    Best when workload per iteration is unpredictable.

    Example (dynamic load balancing):
    Thread 0 finishes early → asks for more iterations. No fixed assignment like in static.

schedule(dynamic,3):

    Threads ask for work as they finish

    Threads grab 3 iterations at a time (called a chunk).
    Behavior:
        Thread 0 grabs 0–2

        Thread 1 grabs 3–5

        Thread 2 grabs 6–8

        And so on...
    




# Example: Montecarlo simulation algorithm for open mp reduction and thread safety

Montecarlo simulation is a statistical technique used to understand the impact of risk and uncertainty in mathematical models and decision-making. It works by using random sampling and running many simulations to estimate results by taking average of those results.

Instead of solving a problem analytically, we simulate the problem many times with random inputs and look at the average outcome.

# Idea behind the montecarlo simulation:
The program estimates the value of π using a Monte Carlo method.

Imagine a square with side length 1.

Inside the square, fit a quarter circle of radius 1 (from (0,0) to (1,1)).

Randomly throw points inside the square.

The ratio of points inside the circle to total points approximates the area of the quarter circle.

Since:

Area of quarter circle = (π * r²) / 4 = π / 4

Area of square = 1

Then:

π ≈ 4 × (number of points inside circle / total number of points)


In [None]:
#include <stdio.h>
#include <stdlib.h>

int main(int argc, char** argv){
        unsigned int i,hit=0;
        unsigned int simulations;

        if (argc<2 || atoi(argv[1])<=0){
                 printf("Usage: ./<exe> <unsigned int = problem size>\n");  
                simulations=10;
        }else {
                simulations=atoi(argv[1]);
        }
        double x,y;

        #pragma omp parllel for private(x,y) reduction(+:hit)
        for(i=0; i<simulations;i++)
        {
                x = ((double)rand_r())/RAND_MAX;
                y = ((double)rand_r())/RAND_MAX;
                if ((x*x)+(y*y)<=1){
                        hit++;
                }
        }
        printf("Pi=%16.16f\n",(4.0*hit)/simulations);
        return 0;
}

reduction(+:hit)
hit is being incremented (hit++) inside the loop.

If multiple threads tried to increment the same hit variable at once, you'd get incorrect results (race condition).

reduction(+:hit) means:

Each thread gets its own local copy of hit, initialized to 0.

After the loop, OpenMP adds up all the local hits into the shared/global hit variable.

# Jobscript for montecarlo.c

In [None]:
#PBS -N first-example
#PBS -q teachingq
#PBS -l select=1:ncpus=4:mpiprocs=1
#PBS -l walltime=00:01:00
#PBS -o log.out1
#PBS -e log.err1
export OMP_NUM_THREADS=4

echo -e "Job started from $(pwd)."
echo "Changing directory to..."
PBS_O_WORKDIR=/home/sy37tovi/parllel_computing/montecarlo_parllel_computing
cd $PBS_O_WORKDIR
echo -e "$(pwd)"

./montecarlo.bin 400000000 // executing with 400000000 simulations

# Strong scaling test:

A strong scaling test measures how the performance of a parallel program improves when you increase the number of processors (threads/cores) while keeping the problem size fixed.