### **Module 3: Using HPC for Computational Tasks**
- **Basics of Python:** Basics of Python and why do you use Python. Anaconda and Spyder
- **Running Jobs on HPC Systems:** Learn how to submit, monitor, and manage jobs on an HPC cluster.
- **Batch Processing and Job Scheduling:** Understand how batch processing works and how to use job schedulers like SLURM or PBS.
- **Optimizing Job Performance:** Tips on optimizing computational tasks to make the most of HPC resources.
- **Handling Large Data Sets:** Learn techniques for managing and processing large data sets on HPC systems.
- **Common HPC Workflows:** Explore typical workflows and examples of tasks that can be completed using HPC.

**Learning Outcome:** Students will gain the skills to effectively run and manage computational tasks on HPC systems.

---

# Introduction to Python, Anaconda, and Spyder

## 1. Introduction to Python

Python is a versatile, high-level programming language known for its simplicity and readability. It’s widely used in various fields like web development, data science, artificial intelligence, and more.

### Why Learn Python?
- **Easy to Learn and Use**: Python’s syntax is clear and simple, making it an ideal choice for beginners.
- **Versatile**: Python can be used for a wide range of applications, from web development to data analysis.
- **Strong Community Support**: Python has a large and active community, providing extensive resources and libraries.

### Basic Python Syntax




In [None]:
# This is a comment
print("Hello, World!")  # Output: Hello, World!

# Variables
x = 5
y = "Python"
print(x, y)  # Output: 5 Python

# Simple Arithmetic
a = 10
b = 3
print(a + b)  # Addition: Output: 13
print(a - b)  # Subtraction: Output: 7
print(a * b)  # Multiplication: Output: 30
print(a / b)  # Division: Output: 3.3333...

### Python Data Structures Examples

#### 1. Dictionary

A dictionary in Python is a collection of key-value pairs. Each key is unique, and it is used to store and retrieve values.



In [None]:
student = {
    "name": "John Doe",
    "age": 21,
    "major": "Computer Science"
}

# Accessing values
print(student["name"])  # Output: John Doe

# Adding a new key-value pair
student["grade"] = "A"

# Updating a value
student["age"] = 22

# Deleting a key-value pair
del student["major"]

print(student)
# Output: {'name': 'John Doe', 'age': 22, 'grade': 'A'}


In [None]:
'''A set is an unordered collection of unique elements. Sets are useful for removing duplicates 
and performing mathematical set operations like union, intersection, etc.'''
# Creating a set
fruits = {"apple", "banana", "cherry"}

# Adding an element
fruits.add("orange")

# Removing an element
fruits.remove("banana")

# Checking membership
print("apple" in fruits)  # Output: True

# Set operations
set1 = {1, 2, 3}
set2 = {3, 4, 5}

# Union
print(set1 | set2)  # Output: {1, 2, 3, 4, 5}

# Intersection
print(set1 & set2)  # Output: {3}


In [None]:
'''A tuple is an immutable sequence of elements. 
Once created, the elements of a tuple cannot be changed.'''
# Creating a tuple
coordinates = (10.0, 20.0, 30.0)

# Accessing elements
print(coordinates[0])  # Output: 10.0

# Unpacking a tuple
x, y, z = coordinates
print(x, y, z)  # Output: 10.0 20.0 30.0

# Tuples are immutable, so the following line would raise an error:
# coordinates[0] = 15.0  # Uncommenting this will raise a TypeError


In [None]:
'''A list is an ordered collection of elements. Lists are mutable, 
meaning their elements can be changed after creation.'''
# Creating a list
colors = ["red", "green", "blue"]

# Adding an element
colors.append("yellow")

# Removing an element
colors.remove("green")

# Accessing elements
print(colors[1])  # Output: blue

# Slicing a list
print(colors[0:2])  # Output: ['red', 'blue']

# Sorting a list
colors.sort()
print(colors)  # Output: ['blue', 'red', 'yellow']

# List comprehension is a concise way to create lists.
# List of squares of numbers from 0 to 9
squares = [x**2 for x in range(10)]
print(squares)  # Output: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]



In [None]:
'''Python allows you to nest data structures, 
such as lists within dictionaries, or dictionaries within lists.'''
# Creating a dictionary with lists as values
grades = {
    "math": [90, 85, 88],
    "science": [92, 81, 79],
    "english": [87, 94, 91]
}

# Accessing a list from the dictionary
print(grades["math"])  # Output: [90, 85, 88]

# Adding a new score to the science list
grades["science"].append(85)
print(grades["science"])  # Output: [92, 81, 79, 85]


In [None]:
# Creating a list of dictionaries
students = [
    {"name": "John", "age": 21},
    {"name": "Alice", "age": 23},
    {"name": "Bob", "age": 22}
]

# Accessing the first student's name
print(students[0]["name"])  # Output: John

# Adding a new student
students.append({"name": "Eve", "age": 20})
print(students)
# Output: [{'name': 'John', 'age': 21}, {'name': 'Alice', 'age': 23}, {'name': 'Bob', 'age': 22}, {'name': 'Eve', 'age': 20}]


## 2. Introduction to Anaconda (Optional)

Anaconda is a free and open-source distribution of Python and R programming languages. It’s used for scientific computing and data science, simplifying package management and deployment.

#### Key Features of Anaconda:
- **Comes with Python**: Anaconda installs Python automatically.
- **Package Management**: Anaconda includes conda, a package manager that makes it easy to install, update, and manage libraries and dependencies.
- **Integrated Development Environments (IDEs)**: Anaconda comes with IDEs like Jupyter Notebook and Spyder.

#### Table of Commands for Managing Anaconda Environments and Packages

| Command | Description |
| --- | --- |
| `conda install <package>` | Installs a single package or a list of packages. (e.g., `conda install numpy`, `conda install numpy pandas matplotlib`) |
| `conda update <package>` | Updates a specific package to the latest version. (e.g., `conda update numpy`) |
| `conda update --all` | Updates all packages in the current environment. |
| `conda remove <package>` | Removes a single package or a list of packages from the current environment. (e.g., `conda remove numpy`) |
| `conda env export > environment.yaml` | Exports the current environment to a YAML file named `environment.yaml`. |
| `conda env create -f environment.yaml` | Creates an environment from a YAML file. (e.g., creates an environment from `environment.yaml`) |
| `anaconda-navigator` | Launches Anaconda Navigator from the terminal. |

---


## Introduction to Spyder

Spyder (Scientific Python Development Environment) is an open-source Integrated Development Environment (IDE) designed for Python. It is particularly useful for scientific computing and data science, offering tools for editing, debugging, and profiling Python code.

#### Key Features of Spyder
- **Interactive Console**: Allows you to run Python code interactively, see the output immediately, and experiment with your code.
- **Variable Explorer**: Lets you view and manage the variables in your code, including data types and values, in a convenient, table-like format.
- **Integrated Development Environment (IDE)**: Combines a powerful code editor, debugging tools, and more in one interface.
- **Support for Scientific Libraries**: Spyder comes with support for scientific libraries like NumPy, SciPy, Matplotlib, and others, making it ideal for data analysis.

#### Getting Started with Spyder

##### Step 1: Launching Spyder

1. Open Anaconda Navigator.
2. Click on "Launch" under the Spyder icon.

##### Step 2: Writing Your First Script

1. In the Spyder interface, you’ll see the code editor on the left, the console at the bottom right, and the variable explorer at the top right.
2. Start by typing the following code in the code editor:




In [None]:
#python
# Simple Python Script in Spyder
import numpy as np

# Creating an array
array = np.array([1, 2, 3, 4, 5])

# Printing the array
print("Array:", array)

# Performing a basic operation
squared_array = array ** 2
print("Squared Array:", squared_array)

3. You can save the script
4. Run the script 
##### Step 3: Using the Variable Explorer
After running the script, take a look at the Variable Explorer on the top right:

You’ll see array and squared_array listed with their corresponding data types and values.
Double-click on a variable to view its contents in detail.
##### Step 4: Interactive Coding in the Console
Spyder’s IPython console allows for interactive coding. You can run commands and see the output instantly.

1. Click on the console at the bottom.
2. Try running the following commands interactively:

In [None]:
# Console Example
import math

# Calculate the square root of 16
sqrt_value = math.sqrt(16)
print("Square Root of 16:", sqrt_value)

# Calculate the sine of 90 degrees (converted to radians)
sine_value = math.sin(math.radians(90))
print("Sine of 90 degrees:", sine_value)

You’ll see the results immediately in the console. This is a great way to test small pieces of code quickly.

Example 1: Plotting Data
You can use Spyder to create and visualize data plots interactively.

In [None]:
# Importing Matplotlib
import matplotlib.pyplot as plt

# Creating data
x = [0, 1, 2, 3, 4, 5]
y = [0, 1, 4, 9, 16, 25]

# Plotting the data
plt.plot(x, y)
plt.title("Simple Plot")
plt.xlabel("x-axis")
plt.ylabel("y-axis")
plt.show()

Spyder has built-in debugging tools that make it easy to step through your code, inspect variables, and understand what’s happening. //(ADD A VIDEO LATER ON THIS TO ILLUSTRATE)
1. Place a breakpoint by clicking to the left of the line numbers in the editor.
2. Run the script in debug mode by clicking the debug button (green play button with a bug icon).
3. Step through the code line by line using the debug toolbar.
This helps in understanding how your code executes and in catching any errors or bugs.



In [None]:
# Debugging Example
def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)

# Test the function
result = factorial(5)
print("Factorial of 5:", result)


# Optimizing Job Performance on HPC Systems

## Introduction to High-Performance Computing (HPC) //ADD PICTURE OF HPC SETUP OR WORKFLOW

High-Performance Computing (HPC) refers to the use of supercomputers and parallel processing techniques to solve complex computational problems. HPC systems allow for massive computational power, enabling researchers and engineers to perform simulations, data analysis, and other tasks that would be infeasible on standard computers.

However, to make the most of HPC resources, it's crucial to optimize your computational tasks. Proper optimization can lead to significant improvements in performance, resource utilization, and overall efficiency.

## 1. Understanding HPC Architecture

### Key Components of HPC Systems
- **Nodes**: Individual computers within an HPC system. Each node typically contains CPUs, memory, and storage.
- **CPUs and Cores**: The central processing units (CPUs) in each node, often consisting of multiple cores that can run tasks in parallel.
- **Interconnect**: The network that connects the nodes, allowing them to communicate and share data.
- **Storage**: High-speed storage systems that provide the necessary read/write speeds for large datasets.

Understanding the architecture of your HPC system is crucial for optimizing performance. Different systems may have varying numbers of cores per node, memory configurations, and interconnect speeds, all of which influence how you should optimize your jobs.

## 2. Efficient Resource Allocation

### Right-Sizing Jobs
- **Request Only What You Need**: When submitting jobs, request only the number of cores, memory, and wall time you need. Over-requesting resources can lead to inefficient use of the HPC system and longer queue times.
- **Use Node Resources Efficiently**: Ensure that you are fully utilizing the CPUs and memory on each node. For multi-threaded applications, consider using thread affinity settings to bind threads to specific cores. (produce a video later on)

### Example: Job Script for a Multi-Core Task

```bash
#!/bin/bash
#SBATCH --job-name=my_job          # Job name
#SBATCH --nodes=2                  # Number of nodes
#SBATCH --ntasks-per-node=16       # Number of tasks per node
#SBATCH --time=04:00:00            # Wall time limit (HH:MM:SS)
#SBATCH --mem=64GB                 # Memory per node
#SBATCH --output=output_%j.log     # Standard output log

# Load necessary modules
module load python/3.8

# Run the program
srun --mpi=pmi2 python my_script.py


Considerations for Resource Allocation
- **Memory Usage:** Monitor the memory usage of your jobs and adjust your requests accordingly. Use tools like top or htop to observe real-time memory usage.
- **Job Arrays:** If you have many similar tasks, use job arrays to submit them in a single batch job. This reduces the overhead of job submission and improves scheduling efficiency.

### Parallelization Strategies
**1.  Data Parallelism**
Split Data Across Nodes: Divide your dataset into smaller chunks that can be processed in parallel across multiple nodes. This is particularly effective for large-scale simulations or data analysis tasks.

**2.  Task Parallelism**
Divide and Conquer: Break down a large computational task into smaller, independent tasks that can be executed concurrently. This strategy is effective for problems that can be naturally decomposed, such as Monte Carlo simulations.

**3. Hybrid Parallelism**
Combine MPI and OpenMP: For applications that require both distributed memory (MPI) and shared memory (OpenMP) parallelism, consider a hybrid approach. MPI can be used to communicate between nodes, while OpenMP manages parallelism within each node.

In [None]:
// an example: Hybrid MPI/OpenMP Code Structure
//in C
#include <mpi.h>
#include <omp.h>
#include <stdio.h>

int main(int argc, char *argv[]) 
{
    MPI_Init(&argc, &argv);  // Initialize MPI
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    #pragma omp parallel
    {
        int thread_id = omp_get_thread_num();
        printf("Hello from thread %d of process %d\n", thread_id, rank);
    }

    MPI_Finalize();  // Finalize MPI
    return 0;
}

### Optimize Data I/O

1. **Minimize I/O Operations**:
   - Reduce the frequency of reading/writing to disk. Use in-memory computations where possible to avoid slow disk I/O.

2. **Use High-Speed Storage**:
   - Store large datasets on high-speed storage (e.g., SSDs or parallel file systems like Lustre) to minimize I/O bottlenecks.

3. **Efficient File Handling**:
   - When reading or writing large files, use efficient file formats (e.g., HDF5, NetCDF) that support parallel I/O.

### Load Balancing and Task Scheduling

1. **Load Balancing**:
   - Ensure that computational tasks are evenly distributed across all available cores/nodes. Uneven load distribution can cause some nodes to sit idle while others are overworked.

2. **Task Scheduling**:
   - Schedule tasks in a way that maximizes resource utilization. For example, staggered start times for large jobs can help avoid peak load times on the cluster.

---


The dataset is loaded into memory, processed with a mathematical operation, and then written to disk only once. This minimizes the number of I/O operations.

In [None]:
'''python'''
''' Efficient Data Processing Using In-Memory Computation '''

''' sometimes, Instead of frequently reading and writing data to disk, 
you can load data into memory, process it, 
and write the results back to disk in one go. 
This reduces the overhead associated with disk I/O.'''
import numpy as np

# Generate a large dataset (e.g., a 10000x10000 array)
data = np.random.rand(10000, 10000)

# Perform some in-memory computations
processed_data = np.sqrt(data)  # Example operation: compute square root

# Write the processed data to disk
np.save('processed_data.npy', processed_data)


The HDF5 format is used to store a large array. Writing and reading operations are performed in a way that optimizes I/O performance, especially when working with large datasets.
//grammar error in the code

In [None]:
''' Using The dataset is loaded into memory, processed with a mathematical operation, and then written to disk only once. 
This minimizes the number of I/O operations.'''
'''Writing and Reading Large Files with HDF5
HDF5 is a file format that supports the storage of large datasets 
and is optimized for fast I/O operations. 
It's particularly useful when dealing with big data.'''

import h5py
import numpy as np

# Create a large dataset
data = np.random.rand(10000, 10000)

# Write data to an HDF5 file
with h5py.File('data.h5', 'w') as hf:
    hf.create_dataset('dataset_1', data=data)

# Read data back from the HDF5 file
with h5py.File('data.h5', 'r') as hf:
    read_data = hf['dataset_1'][:]

print(read_data.shape)


Using Buffered I/O for Large Text Files

Buffered I/O allows you to read and write large files in chunks, reducing memory usage and speeding up I/O operations.

For the program below, the file is written in chunks rather than all at once, and it’s read back using a 1 MB buffer. This method is efficient for handling large files, as it reduces memory usage and speeds up both reading and writing operations.

In [None]:
# Writing to a large text file using buffered I/O
with open('large_file.txt', 'w') as f:
    for i in range(1000000):
        f.write(f"Line {i}\n")

# Reading from the large text file in chunks
with open('large_file.txt', 'r') as f:
    buffer_size = 1024 * 1024  # 1 MB buffer
    while True:
        data = f.read(buffer_size)
        if not data:
            break
        # Process the data
        print(f"Read {len(data)} bytes")


Parallel I/O with MPI

Example: Using MPI for Parallel File Writing

MPI (Message Passing Interface) can be used to write data to a file in parallel, distributing the workload across multiple processors or nodes.

This example below demonstrates how to use MPI for parallel file writing, where each process writes its own portion of data to the file. The use of MPI.Barrier() ensures that all processes have finished writing before moving on.

In [None]:
from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# Each process creates its own portion of the data
data = np.arange(1000 * rank, 1000 * (rank + 1))

# Parallel write to a binary file
with open('parallel_data.bin', 'wb') as f:
    f.seek(rank * len(data) * data.itemsize)  # Seek to the correct position
    f.write(data.tobytes())

# Use MPI barrier to ensure all processes have finished writing
comm.Barrier()

# Read back the data in parallel (for validation)
with open('parallel_data.bin', 'rb') as f:
    f.seek(rank * len(data) * data.itemsize)
    read_data = np.frombuffer(f.read(len(data) * data.itemsize), dtype=data.dtype)

print(f"Process {rank} read data: {read_data}")


## Monitoring and Profiling Jobs

### Monitor Job Performance

1. **Use Job Monitoring Tools**:
   - Monitor your job's memory, and I/O usage using available HPC tools (e.g., `top`, `htop`, `qstat`). This helps identify bottlenecks and inefficient resource usage.
   
   //SS OF RESPONSE ON TERMINAL FOR top htop qstat

2. **Adjust Based on Feedback**:
   - If your job consistently underutilizes resources (e.g., low CPU usage), adjust the resource requests for future jobs.

### Profiling and Debugging

1. **Profiling Tools**:
   - Use profiling tools (e.g., `gprof`, `nvprof` for GPU jobs) to analyze the performance of your code. Profiling helps identify time-consuming functions and potential areas for optimization.

2. **Debugging**:
   - Use debugging tools to identify and fix issues in parallel code, ensuring that all processes and threads run as expected.

---

This example uses the psutil library to monitor the memory usage of a Python script. It prints the memory usage at different stages of the computation, helping you identify parts of your code that consume a lot of memory.

In [None]:
# Monitoring Job Performance, Profiling, and Debugging: Interactive Coding Examples

## 1. Monitor Job Performance

### Example 1: Using Python to Monitor Memory Usage

'''You can use Python to periodically monitor the memory usage 
of your script to identify potential memory leaks or inefficient 
memory usage.

python'''
import psutil
import time

# Function to monitor memory usage
def monitor_memory():
    process = psutil.Process()
    mem_info = process.memory_info()
    print(f"Memory Usage: {mem_info.rss / (1024 * 1024):.2f} MB")

# Example function that uses memory
def compute_large_array():
    monitor_memory()  # Initial memory usage
    large_array = [x ** 2 for x in range(10**6)]
    monitor_memory()  # Memory usage after creating the array
    del large_array
    monitor_memory()  # Memory usage after deleting the array

# Run the function and monitor memory usage
compute_large_array()


Using qstat to Monitor HPC Jobs

If you are running jobs on an HPC system, you can use qstat (or similar job monitoring tools) to check the status of your jobs, including resource usage like CPU and memory.

In [None]:
'''The qstat command provides a snapshot of your job's current status 
on the HPC cluster. It helps you monitor CPU usage, 
memory consumption, and job progress.'''

# Command to check the status of your jobs
qstat -u your_username

# Command to get detailed information about a specific job
qstat -f job_id

Adjusting Resource Requests in a Job Script

If your job consistently underutilizes resources, you might need to adjust your resource requests to avoid wasting computational power.

After analyzing the performance of your previous jobs using qstat, you might find that the job underutilizes CPU or memory. Adjust the --ntasks, --cpus-per-task, and --mem parameters in your SLURM job script based on the feedback to optimize resource usage.

In [None]:
#!/bin/bash
# Example SLURM job script

#SBATCH --job-name=example_job
#SBATCH --ntasks=1                 # Number of tasks (should match CPU usage)
#SBATCH --cpus-per-task=4          # Number of CPU cores per task
#SBATCH --mem=8G                   # Memory per node
#SBATCH --time=02:00:00            # Time limit

module load python/3.8
srun python my_script.py


Profiling with cProfile

Python's cProfile module allows you to profile your code, identifying time-consuming functions and optimizing them for better performance.

The example below uses cProfile to profile the example_function. The profiler will output the time spent in each function, allowing you to identify and optimize performance bottlenecks.


In [None]:
import cProfile

def example_function():
    total = 0
    for i in range(1, 1000000):
        total += i ** 2
    return total

# Profile the example function
cProfile.run('example_function()')


Using nvprof for GPU Profiling

If you are running code on a GPU, you can use NVIDIA's nvprof tool to profile your GPU code and identify performance bottlenecks.

nvprof provides detailed insights into GPU utilization, memory transfers, and kernel execution times. This information helps you optimize your code for better GPU performance.

In [None]:
# Command to profile a GPU-enabled script
nvprof python my_gpu_script.py

Debugging with Python's pdb

The pdb module is Python's built-in debugger, allowing you to step through your code, inspect variables, and fix issues.

The example below sets a breakpoint with pdb.set_trace() before a line that will raise an error. Running this script will enter the pdb interactive debugging mode, where you can inspect variables, step through code, and understand the source of the error.

In [None]:
import pdb

def faulty_function():
    x = 10
    y = 0
    pdb.set_trace()  # Set a breakpoint here
    z = x / y  # This will raise a ZeroDivisionError
    return z

faulty_function()

### **Frequently Asked Questions**

1. **Why is a tuple used when we have list?** 

**Answer:** Tuples are immutable while lists are mutable. We'd use a tuple when we do not want the values to be changed.  


2. **Why do we prefer to write I/O in one go?**

**Answer:** Writing I/O operations in one go reduces the overhead of opening, reading, and writing multiple times. This cuts down on the number of system calls, which can be slow, and it helps prevent issues like only part of the data required being processed. 


3. **What is in-memory computation, and why is it done?**  

**Answer:** In-memory computation refers to storing data in a computing system's main memory (RAM) rather than reading from or writing to slower disk storage. It improves the performance of applications by avoiding the need to wait for data to be fetched from the disk.


4. **What is a multi-threaded application?** 

**Answer:**  A multi-threaded application uses multiple threads of execution within a single process. These threads can run concurrently, letting the program handle several tasks simultaneously. This also improves efficiency and responsiveness.


5. **What is thread affinity setting and why is it used?** 

**Answer:** Thread affinity refers to assigning a specific thread to a particular CPU core (or set of cores). It helps by keeping the thread on the same core, which can improve performance.


6. **What is the difference between distributed memory and shared memory?**

**Answer:** 
1. Distributed Memory: Each processor or computer node has its own private memory, and they communicate with each other by sending messages (for example, using MPI).

2. Shared Memory: All processors share a common memory space, allowing them to exchange data directly (for example, using OpenMP).


7. **What are I/O bottlenecks?**

**Answer:** I/O bottlenecks happen when the speed at which data moves between storage devices (like hard drives or networks) and the computer is too slow compared to the processing speed. For example: If you send a 150 page document to the printer, your computer can process and send the data quickly, but the printer can only print a few pages per minute. Since the printer is slower than the computer, the printing process becomes a bottleneck leading to delays.


8. **How does buffered I/O reduce memory usage and improve speed?**

**Answer:** Buffered I/O improves performance by using a temporary storage area called a buffer. It can be slow and inefficient to process small pieces of data one by one. So, the system collects a larger amount of data in the buffer before reading or writing it all at once. This minimizes the constant switching between storage and processing and reduces the number of I/O operations, speeds up data transfer, and optimizes memory usage. 


9. **How do efficient file formats (example HDF5, NetCDF) support parallel I/O?**

**Answer:** File formats like HDF5 and NetCDF are made so that many processes can access and work with the same file at the same time. This lets different parts of the file be read or written simultaneously which makes it much faster to handle large datasets.


10. **What is profiling and how does it help in identifying inefficient resource usage?**

**Answer:** Profiling is the process of analyzing a program’s performance by monitoring how it uses resources like CPU, memory, and disk I/O. It helps identify bottlenecks, inefficient code, or operations that slow down execution. This can help developers optimize specific parts of the program to address the problems identified. 


11. **what is GPU and GPU profiling?** 

**Answer:** A GPU is a specialized processor designed to handle complex calculations required for tasks like rendering images and performing parallel computations. It is a hardware component with thousands of small cores for complex computations. It also has high-bandwidth memory. GPU profiling is the process of analyzing how efficiently a program utilizes the GPU.


12. **How does combining MPI (Message Passing Interface) with OpenMP (Open Multi-Processing) enhance parallelism in applications?**

**Answer:** By combining MPI and OpenMP, applications can take advantage of two layers of parallelism. MPI enables multiple nodes to work on different parts of a problem at the same time. OpenMP, on the other hand, allows multiple cores within a single computer(or node) to run tasks concurrently. When combined together, they make it possible to split complex tasks efficiently across many processors. 
