Message passing interface(MPI) is primarily designed for distributed memory machines, though it can be useful in shared memory systems.It was first standardized in 1994.

The MPI paradigm:
    Processes can access local memory only but can communicate using messages.

    The messages are passed from the local memory of one processor to the local memory of another processor magnitudes slower than to local memory.

Latency transfer using MPI:    
- Latency infiband<0.6 µs, max 15m cable.
- Bandwidth infiniband 400Gbit/s

Advantages of MPI:
1. Message passing in shared memory:
    Using MPI in distributed and shared systems, it enables the programmer to manage data locality- by exactly specifying where the data goes. In shared distributed systems, message passing often means rewriting to another part of shared memory, so you dont pay a network cost.
2. Debugging/overwriting advantage:
     MPI communication is explicit(using MPI_Send,MPI_Recv etc), makig it easier to trace where the data is coming from and going to, making debugging less painful(compared to open_mp).
3. MPI is a library, not a language:
    MPI provides functions you call inside your program- its not a new programming language.
    Bindings exist for C, C++, Python making it flexible.
    There are open-sourced MPI implementations (like MPICH, OpenMPI, LAM/MPI) and also vendor tuned versions optimized for supercomputers.
4. MPI is Portable:
    Since MPI is a standard, not just an implementation, you can move MPI programs between systems pretty easily.
5. Processes, not Threads:
    MPI programs are made up of processes that talk via messages.
    Processes can technically all run on the same machine, but that’s inefficient — you usually want them distributed across nodes in a cluster.

To check number of nodes, cpu, and MPI processes in the cluster:

qstat -Bf

Communicator:
- A group of processes communicating using MPI is defined using communicator. The communicator containing all processes is MPI_COMM_WORLD


In [None]:
# In print_rank_size/print_rank_size.py script:
from mpi4py import MPI
comm = MPI.COMM_WORLD

# print the rank and size
rank=comm.Get_rank()
size=comm.Get_size()

In [None]:
#PBS -N combined_parllel_for_loop
#PBS -q teachingq
#PBS -l select=1:ncpus=4:mpiprocs=4
#PBS -l walltime=00:01:00
#PBS -o log.out2
#PBS -e log.err2
export OMP_NUM_THREADS=4

echo -e "Job started from $(pwd)."
echo "Changing directory to..."
PBS_O_WORKDIR=/home/sy37tovi/parllel_computing2/MPI/mpi_examples
cd "$PBS_O_WORKDIR"
echo -e "$(pwd)"

mpiexec -n 4 python print_rank_size.py


# output:
Hello World!

rank=1, size=4

Hello World!

rank=2, size=4

Hello World!

rank=0, size=4

Hello World!

rank=3, size=4

# Rank: 
- The unique ID number assigned to each process in a communicator (typically MPI.COMM_WORLD).

- Ranges from 0 to size - 1.

- Used to differentiate processes and direct communication (e.g., “send data from rank 0 to rank 2”).

# Size:
- The total number of processes in a communicator

Use OpenMPI if:

You want maximum hardware performance on Linux clusters.

You're working with custom networks or need modular control.

Use MPICH if:

You care about portability and standard compliance.

You're using Intel MPI, MS MPI, or a system that builds on MPICH.

# MPI_Bcast:
MPI_Bcast is used to broadcast data from one process to all other processes in the communictaor
- Only the root process provides the data
- All processes (including the root) receive the same data
- It's a way to share the same information across all processes efficiently.

# MPI Broadcast:
Broadcast the messaage in the buffer of the process with rank 'root' to all processes. Note that all processes call MPI.

In [None]:
#broadcast_list.py
from mpi4py import MPI

list=[0,1,2,3]
count=4
comm=MPI.COMM_WORLD
rank=comm.Get_rank()

if rank!=0:
        list=None

print("data before broadcasting to other threads")
print(f"received_data={list},rank={rank}")
received_data=comm.bcast(list,root=0)
print(f"received_data={received_data}; rank={rank}")

# output:


Job started from /home/sy37tovi/pbs.943649.mmaster02.x8z.

Changing directory to...

/home/sy37tovi/parllel_computing2/MPI/mpi_examples/broadcast_mpi

data before broadcasting to other threads

received_data=[0, 1, 2, 3],rank=0

data before broadcasting to other threads

received_data=None,rank=1

data before broadcasting to other threads

received_data=None,rank=2

data before broadcasting to other threads

received_data=None,rank=3

received_data=[0, 1, 2, 3]; rank=0

received_data=[0, 1, 2, 3]; rank=1

received_data=[0, 1, 2, 3]; rank=2

received_data=[0, 1, 2, 3]; rank=3

# Difference between broadcasting python objects and numpy arrays

| Feature      | `comm.bcast()`                   | `comm.Bcast()`                                |
| ------------ | -------------------------------- | --------------------------------------------- |
| Abstraction  | High-level                       | Low-level                                     |
| Data types   | Any Python object (via pickling) | Only buffer-like objects (e.g., NumPy arrays) |
| Return value | Returns broadcasted object       | Returns `None` (in-place)                     |
| Performance  | Slower (due to pickling)         | Faster (direct memory broadcast)              |
| Use case     | Simpler, flexible usage          | Performance-critical code, large data         |


comm.Barrier()
A synchronization point — all processes must reach it before any continue. This is often used to line up output or timing.

comm.reduce(n, op=MPI.SUM, root=0)
Each process sends its n to rank 0, which performs a sum reduction:

If 4 ranks: 1 + 2 + 3 + 4 = 10

Only rank 0 receives the result; others get None.

MPI.Finalize(): 
Cleans up MPI — in mpi4py, this is optional because it finalizes automatically at exit.

In [None]:
from mpi4py import MPI
import sys
def main(argv):
        comm=MPI.COMM_WORLD
        rank=comm.Get_rank()
        size=comm.Get_size()
        print("Hello world!")
        print(f"rank={rank}; size={size}")
        comm.Barrier()

        n=rank+1
        sum=comm.reduce(n,op=MPI.SUM,root=0)

        if rank==0:
                print(f" Received total sum at rank {rank} ={sum}")

if __name__=="__main__":
        main(sys.argv)
        MPI.Finalize()