# HW3 Notebook

## Submission instructions
- **Publication Date: 5/1.**
- **Submission Date: 22/1**.
- **Submission in groups of up to 2 students (individually or in pairs).** 
- **Submission on the course website, in zip format including this directory with the relevant output, specifically:** 
  - the source files.
  - this notebook (run_hw3.ipynb) after executing all the cells. 
  - output files of queued jobs that might be created during the execution. 
  
- **Pay attention to keeping the timers wrapping the same work as the provided codes give.**


### Fill the name and ID of the submitters:
#### Student Name: Yosef Goren Stdudent ID: 211515606

#### Student Name: Konstantin Kishinevsky Stdudent ID: 

**Note:** If you submit in pairs, it is sufficient that only single student submit the assaignment on the course website. \
Remove one line if submitted individually, or keep it empty.

# Part 1: Vectorization and Compiler Optimizaiton
**In this section you are required to work on a CPU node with Intel® Xeon® Scalable 6128 processors.** \
**Instruction Set Extension for this processor: Intel® SSE4.2, Intel® AVX, Intel® AVX2, Intel® AVX-512.** \
To submit a job to such a node, we use: \
```qsub -l nodes=1:gold6128:ppn=2``` (we ask for allocation of 1 compute node with the property of 'gold6128'). 


## Problem 1: PI computation with icc optimizations (30 points)

In [None]:
import os
os.chdir(os.path.expanduser('~')+'/HW3/icc_optimizaitons')

**Compile the code using icc without any optimizations (-O0) and execute the code:**

In [None]:
! chmod 755 ../q-cpu; chmod 755 no_opt.sh;if [ -x "$(command -v qsub)" ]; then ./../q-cpu no_opt.sh; else ./no_opt.sh; fi

Report the execution time:

**Compile the code using icc with the default optimization level (-O2):**

In [None]:
! chmod 755 ../q-cpu; chmod 755 default_opt.sh;if [ -x "$(command -v qsub)" ]; then ./../q-cpu default_opt.sh; else ./default_opt.sh; fi

Report the execution time:

What optimizations were enabled?
Was the main loop of the code **vectorized**? Explain. (use the compiler reports)

**Edit the compilation line in _ipo_simd_opt.sh_ to achieve better results.** \
Include **InterProcedural Optimizations (IPO)** and include different **SIMD instruction processor extensions** (try -xSSE4.2, -xAVX, -xCORE-AVX2, -xCORE-AVX512 or -xHost options. Consider adding -qopt-zmm-usage=high). You also can try using -O3 instead of -O2. Consider also including Profile-Guided Optimizations (PGO) that we mentioned in class. 

In [None]:
! chmod 755 ../q-cpu; chmod 755 ipo_simd_opt.sh;if [ -x "$(command -v qsub)" ]; then ./../q-cpu ipo_simd_opt.sh; else ./ipo_simd_opt.sh; fi

Report the execution time:

What optimizations were enabled?
Was the main loop of the code **vectorized**? explain. (use the compiler reports).

Now, we will use **OpenMP _simd_ pragmas** to enable vectorization without using IPO. \
**Edit _openmp_simd/pi.c_ and _openmp_simd/fx.c_ to enable vectorization of the main loop by adding openmp pragmas**.
You can add the SIMD instruction processor extensions to the compiler as before, just avoid using -ipo. You also can try -O3 instead of -O2. \
Try to achieve the best vectorization possible with OpenMP. 

In [None]:
import os
os.chdir(os.path.expanduser('~')+'/HW3/icc_optimizaitons/openmp_simd')

In [None]:
! chmod 755 ../../q-cpu; chmod 755 openmp_simd.sh;if [ -x "$(command -v qsub)" ]; then ./../../q-cpu openmp_simd.sh; else ./openmp_simd.sh; fi

Report the execution time:

# Part 2: Offloading with OpenMP
**In this section you are requested to work on a GPU node.** \
To submit a job to such a node, we use: \
```qsub -l nodes=1:gpu:ppn=2``` 
- You are encouraged to check the node specifications with _lscpu_, _cat /proc/cpuinfo_, _numactl --hardware_ and more. 
- When you implement an offloading version, run with _export LIBOMPTARGET_DEBUG=1_ to see some important details, including the available devices on the node and their specifications (amount of memory, number of compute units, etc).
- It might be helpful to use the GPU Analysis with VTuneTM Profiler. Check the following links: 
- https://www.intel.com/content/www/us/en/develop/documentation/oneapi-gpu-optimization-guide/top/tools/vtune.html 
- https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top/analyze-performance/accelerators-group/gpu-offload-analysis.html).


## Problem 2: Heat Problem (30 points)
The following code in _serial_heat.c_ implementes a finite difference method to solve the heat equation.
1) **Run the given serial code** on a standard GPU node and report the run time. 
2) Edit _parallel_cpu_heat.c_ to implement an **OpenMP version with parallelism on CPU only (multi-core)**, and run on a standard GPU node.
3) Edit _parallel_gpu_heat.c_ to implement an **OpenMP accelerated version with offloading to GPU device**, and run on the GPU node.

- For the CPU and GPU versions try to provide your best implementations. 
- In this exercise we compile with -O2 but you can enhance compiler optimizations if it helps you achieve better performance (do not go crazy with compiler optimizatiosn here, this part is shaped mainly to exercise OpenMP GPU offloading).

In [None]:
import os
os.chdir(os.path.expanduser('~')+'/HW3/heat/')

In [None]:
! chmod 755 ../q-gpu; chmod 755 run_serial.sh;if [ -x "$(command -v qsub)" ]; then ./../q-gpu run_serial.sh; else ./run_serial.sh; fi

In [None]:
! chmod 755 ../q-gpu; chmod 755 run_cpu.sh;if [ -x "$(command -v qsub)" ]; then ./../q-gpu run_cpu.sh; else ./run_cpu.sh; fi

In [None]:
! chmod 755 ../q-gpu; chmod 755 run_gpu.sh;if [ -x "$(command -v qsub)" ]; then ./../q-gpu run_gpu.sh; else ./run_gpu.sh; fi

**Record the run times in the next cells respectively:**

In [None]:
Parallel CPU Time: _______ sec 

In [None]:
Parallel GPU Time: _______ sec 

**What are your conclusions from this work?**

## Problem 3: Jacobi Problem (40 points)
The following code in _serial_jacobi.c_ implementes the jacobi solver for solving a linear equation system.
1) **Run the given serial code** on a standard GPU node and report the run time. 
2) Edit _parallel_gpu_jacobi.c_ (and the other files if needed) to implement an **OpenMP accelerated version with offloading to GPU device**, and run on the GPU node.

- Try to provide your best implementation. 
- In this exercise we compile with -O2 but you can enhance compiler optimizations if it helps you achieve better performance (do not go crazy with compiler optimizatiosn here, this part is shaped mainly to exercise OpenMP GPU offloading).

In [None]:
import os
os.chdir(os.path.expanduser('~')+'/HW3/jacobi/')

In [None]:
! chmod 755 ../q-gpu; chmod 755 run_serial.sh;if [ -x "$(command -v qsub)" ]; then ./../q-gpu run_serial.sh; else ./run_serial.sh; fi

In [None]:
! chmod 755 ../q-gpu; chmod 755 run_gpu.sh;if [ -x "$(command -v qsub)" ]; then ./../q-gpu run_gpu.sh; else ./run_gpu.sh; fi

**Record the run times in the next cells respectively:**

In [None]:
Parallel GPU Time: _______ sec 